103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part I

As our criticisms of Bouckaert et al. have been extremely harsh, we must justify them in some detail. I have accused the authors of erring “at every turn,” a charge that reeks of hyperbole. But even if that claim is exaggerated, it is still not too far from the mark. To demonstrate the extraordinary density of error in the Science article, the next few posts will dissect the authors’ base map of Indo-European languages (Figure S6 in their Supplementary Materials). This map, depicting the distribution of both modern and ancient Indo-European languages, forms a key input for their “explicit geographic model of language expansion” (Bouckaert et al., p. 957), as the locations of the sampled languages shown on this map are fed into the model in order to calculate the location of the PIE homeland. Many of the errors and inconsistencies found on their other maps stem from mistakes made in this initial figure.

The map in question shows the location of the 103 Indo-European languages analyzed. The brief caption notes that “colored polygons represent the geographic area assigned to each language based on Ethnologue.” This assertion is misleading at best. The Ethnologue does not consistently map modern languages, and it pays little attention to long-extinct ones such as Hittite. And where the Ethnologue does map, it typically does so in vastly greater detail than Bouckaert et al. Compare, for example, how the two sources depict the languages of what is now southern and central Pakistan in the paired figures to the left.

Regardless of the source (or sources) used, the map is highly inaccurate. To illustrate the cavalcade of error found in Bouckaert et al., I have isolated 103 miscues, some admittedly rather minor, but others highly significant. As recounting all of them would be tedious, I will simply note them in call-outs on expanded details from their “master map.” I have prepared twelve such enlarged maps, each focusing on a different part of the historically Indo-European-speaking world. I will post these maps sequentially over the next few days, discussing in the accompanying posts some of their more egregious errors. Today’s post will conclude with a consideration of South Asia; subsequent ones will move in a westward direction, terminating in the British Isles.

Before examining the portrayal of the Indian Subcontinent in Bouckaert et al., a few words are in order about their general approach to mapping. Analyzing their base-map is no easy matter, as they do not follow conventional cartographic procedures. Their all-important polygons are often impossible to trace, obscured by the large, numbered circles used to label the 103 languages. Another perceptual problem stems from their use of overlays, with multiple extinct languages (in red) layered upon extant languages (in blue). The resulting color blends yield confusing intermediate shades. Note on the detail posted to the left the depictions of Luvian, Hittite, Classical Armenian, Kurdish, and modern Armenian. Determining which language is indicated in which places takes some patience.

A more intractable problem concerns the map’s temporal framing. The short explanation provided in the caption makes the issue seem simple: “Red areas indicate ancient languages and blue areas indicate modern languages.” Left unanswered is the time frame of “linguistic modernity.” In some places, the term is defined broadly, extending back hundreds of years. Cornwall, for example, is shown as inhabited by speakers of modern Cornish. Such a view is anachronistic, as Cornish had disappeared from most of the peninsula by 1700, and was essentially extinct before the modern revival movement began in the 20th century. (Today Cornish is estimated to have only “a few” native speakers.) Elsewhere, the mapping of “modern languages” refers to the late 20th century. The German zone, for example, fits only the post-WWII period, after millions of German speakers had been expelled from Pomerania, Silesia, and Sudetenland. The map, to put it simply, plays fast and loose with time and space.

Even more problematic is the mapping of many languages on the basis of political rather than linguistic features. As was noted in an earlier post, all of the maps used in the study show signs of what I called “geopolitical contamination,” in which the boundaries of modern-day states incorrectly determine those of language groups, following Max Weinreich’s dictum that “a language is a dialect with an army and navy.” I was puzzled, for example, by the fact that Moldova was placed outside of the Indo-European realm in Figure S4, showcased on Quentin Atkinson’s website. The reason is readily apparent when one considers the map of the 103 language polygons (Figure S6). Here Romanian is depicted as almost exactly coincident with Romania. Moldova is fully excluded from this realm, even though the official “Moldovan Language” is differentiated from Romanian solely on political grounds. One can indeed identify a Moldovan subdialect of Romanian, but it spans the Romanian-Moldovan border. Moldova should thus have been placed within the Romanian polygon, yet it is instead depicted in the same manner as Hungary, giving the impression that it lies outside the Indo-European realm. The consequences of such a strategy are troubling for the contemporary world, but become positively pernicious when retroactively extended into the past, which is precisely what the Bouckaert model does. As a result, almost all of Moldova is ludicrously mapped as most likely never having been occupied by Indo-European speakers in Figure S4.








Such geopolitical contamination is clearly evident in the depiction of the languages of South Asia, posted here. Note that Bengali, often regarded as the world’s sixth most widely spoken language, is essentially limited to Bangladesh, its 80+ million speakers in the Indian state of West Bengal written out of the linguistic community. Even more unreasonably, Vedic Sanskrit is given the polygon of a modern political unit. The supposed territory of this ancient language is outlined and shaded in red in the map posted here. This area, it turns out, precisely fits the territorial extent of Punjab before it was partitioned by the British. That colonial-era Punjab would have no bearing on the distribution of Vedic Sanskrit, spoken some 3,000 years ago, should go without saying. It is also worth noting that the former Punjab included what is now the Indian Himalayan state of Himachal Pradesh, which features peaks 22,000 feet above sea level. It is safe to assume that such areas were never part of the Vedic Sanskrit realm.


Mapping Vedic Sanskrit is no easy task, but that is no excuse for using a modern geopolitical proxy. Careful studies show that the world of the Rig Veda was largely limited to what are now the Indian and Pakistani states of Punjab along with the Vale of Peshawar and Swat Valley. “Vedic India” in the larger sense extended from this region down the Ganges Valley through Bihar and southward to encompass Gujarat, as can be seen in the second map posted here. Either of these two areas could easily have been used for the Vedic Sanskrit polygon.


I will not comment further on the remaining errors and infelicities on the Bouckaert et al. portrayal of South Asia, as a number of them are noted on the map itself. I have also posted a fine Wikipedia map of the current distribution of the Indo-European languages of South Asia for comparative purposes. (Note that this Wikipedia map lumps a number if disparate dialects into single languages, such as Bihari.)

As we shall see in forthcoming posts, similar errors litter all other portions of the original language map employed by Bouckaert et al. As a result, it is difficult to avoid the conclusion that the authors simply do not have the level of geo-linguistic comprehension necessary for carrying out their task. I have taught the geography of modern languages at leading universities for twenty-five years, and I can peg the level of understanding demonstrated by students fairly accurately. That of Bouckaert et al. would clearly fall into the “B” range. Given the unfortunate realities of grade inflation, that means that more than half of my undergraduate students finish their terms with a better understanding of the distribution of languages than the authors of a supposedly path-breaking article on the origin and spread of the world’s largest language family published in one of the world’s leading scientific journals.