103 Errors in Mapping Indo-European Languages in Bouckaert et al. Concluded: Part V, Western Europe

By now, all of the cartographic failings of Bouckaert et al. have become familiar. On the map of France and neighboring areas, for example, we see the unreasonable elevation of minor dialects to the status of discrete languages (three forms of Breton make the list), the replacement of a non-Indo-European language with an Indo-European languages (the Basque region is shown as French speaking), the improper use of political boundaries as linguistic boundaries (French is not shown as extending into Switzerland), the preferential classification of dialects as languages when they are associated with states (Walloon counts as a language, unlike the other equally distinctive langues d’oïl of northern France or the langues d’oc of southern France; Flemish counts as a language, unlike other equally distinctive forms of Dutch), and the simple geographical misplacement of languages (Romansh is placed in northwestern Italy rather than southern Switzerland). Of particular note in regard to the linguistic mapping of France is the fact that Corsica is completely obliterated by circle #48 (see the map of the Italian Peninsula in the previous post).


The mapping of the Iberian Peninsula is particularly simplistic. The authors have simply placed Portuguese in Portugal and Spanish, along with Catalan, in Spain. The fact that Galician in northwestern Spain is closer to Portuguese than to Spanish is ignored, and the Basque-speaking region is mapped as if it were Spanish speaking. The Balearic Islands are also neglected, as archipelagoes generally are in the authors’ land-biased approach.


The map of the British Isles severely misconstrues the Celtic tongues. Irish, for example, is shown as extending across all of the Republic of Ireland and as entirely absent from Northern Ireland. In actuality, Irish has long been largely limited to the western margin of the island, and as late as the early 20th century was still spoken in parts of what was to become the political unit of Northern Ireland. The mapping here, in other words, is yet again political rather than linguistic. By the same token, Welsh is placed in the coal-mining districts of southern Wales where it has been absent for generations, just as Cornish is depicted in areas where it was not been spoken for hundreds of years. The mapping of Scottish Gaelic is not bad, but the term used—“Scots Gaelic”—is off the mark. The proper term is “Scottish Gaelic,” as “Scots” refers to a different language altogether. Scots, or Lowland Scots, is usually regarded as a highly distinctive form of English, but some linguists regard it as a language in its own right (CNN has recently reported on the demise of one of its dialects).*




The mapping of extinct language is also poorly executed. Old English is essentially restricted to the historical kingdom of Wessex, even though the language extended as far north as the Edinburgh region of what is now southeastern Scotland, and included dialects of Kent, Mercia, and Northumbria. Significantly, even the Wessex (West Saxon) dialect of Old English extended farther to the east than what Bouckaert et al. would allow for Old English in its entirety.

The language map of Bouckaert et al. that I have criticized over these past five posts is a cornerstone of their model, yet it is also wholly inadequate for the task. Many of the errors found here ramify through all of the maps that they have produced.  But even if a serviceable map had been constructed, the model would still yield nonsense, as most of the assumptions upon which it is based are unwarranted, as we shall in more detail see in subsequent posts.

*Although I am no expert on this topic, I would argue that Lowland Scots is almost but not quite interintelligible with Standard English, especially in its spoken form, and thus deserves to be regarded as a separate language. Although I love the poetry of Robert Burns, I generally need translation. Take for example, these versus from “Auld Lang Syne”:

In the Original Scots:

We twa hae run about the braes,

and pu’d the gowans fine;

But we’ve wander’d mony a weary fit,

sin auld lang syne.


We twa hae paidl’d i’ the burn,

frae morning sun till dine;

But seas between us braid hae roar’d

sin auld lang syne.


In Standard English:

We two have run about the slopes,

and picked the daisies fine;

But we’ve wandered many a weary foot,

Since long long ago.


We two have paddled in the stream,

from morning sun till dinner time;

But seas between us broad have roared

since long long ago.


Or listen to the delightful poem “To a Mouse” on ScotsIndependent website:


Wee, sleekit, cow’rin, tim’rous beastie,

O, what a panic’s in thy breastie!

Thou need na start awa sae hasty

Wi bickering brattle!

I wad be laith to rin an’ chase thee,

Wi’ murdering pattle.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part IV (Central Europe)

(Continued) The main problems with the language map of eastern Central Europe in Bouckaert et al. have already been discussed; to whit, the depiction of “national” languages as coterminous with state boundaries. The authors do occasionally deviate from this norm, showing, for example, a tiny non-Romanian area in northwestern Romania. Note also that they show Latvian as failing to reach Latvia’s northwestern coast. This view is indeed historically accurate, as northern Courland was the land of the Livonians, a Finnic-speaking people. The last native speaker of Livonian, however, died in 2009; for decades before that, Livonian was severely endangered and most speakers were bilingual in Latvian or Russian. If the map purports to depict the present situation, it is flatly wrong here. If it depicts the relatively recent past, as it does for some areas, it is more on target. Unfortunately, no time specification is provided.

Such unspecified chronology is a more intractable problem for the depiction of extinct languages. Major languages of the distant past often experienced major geographical changes, sometimes literally moving en mass when their speakers migrated. The Goths, for example, probably originated in what is now Sweden, later crossed the Baltic into northern Central Europe, subsequently moved into the steppes north and northwest of the Black Sea, and eventually spread with victorious warrior bands over much of the Roman Empire; the final redoubt of the language was the Crimean Peninsula, where it persisted until the ninth century and perhaps until early modern times. Any Gothic language polygon would thus fit a specific place only at a specific time. Bouckaert et al. have apparently selected the period just after the movement of Gothic out of Scandinavia, although the area specified does not seem to match what (little) is known about the early relocation of the language (see the map to the left).

As mentioned in the previous post, the placing of Byelorussian (Belarusian) in a small corner of the Czech Republic is a careless transcription error. But the intended depiction, that of Eastern Czech, is still off base. Czech is not heavily differentiated into dialects. The truly distinctive forms of the language are half way to Polish. Cieszyn Silesian and other Lach dialects are regarded by most Czech linguists as a Polish-influenced form of Czech and by most Polish linguists as a Czech-influenced form of Polish (politics do tend to intrude into linguistic discussions). Such dialects, however, are not on the map. What is (supposed to be) shown is “Eastern Czech,” placed in a small corner in the southeastern part of the Czech Republic. It is unclear what this designation refers to. Across the entire eastern half of the republic, one finds the Moravian dialect (or dialects), which are not strikingly different from standard Czech.

The linguistic depiction of the Italian Peninsula in Bouckaert et al. contains some curious features. This portion of the map is difficult to decipher, as extinct languages overlay extant languages, and much the area is covered by the circular labels. It is still clear, however, that the mapping here remains inconsistent. Italian is shown as extending neither into the Po Valley in the north nor to Sicily in the south. Fair enough: the local dialects spoken (or spoken until recently) in those areas are markedly different from Standard Italian, based on the Tuscan dialect. Yet the authors place other parts of the peninsula with equally distinctive dialects, such as Apulia in the southeast, in the Italian language category. In regard to the extinct Indo-European languages mapped here, the major issue is why only Umbrian and Oscan were selected to accompany Latin.




Most of the problems found on the map of Germany and environs have already been discussed. Note, for example, how Luxembourgish makes the cut on political grounds, whereas other distinctive German dialects are ignored. Of special note here is the demarcation of two Lusatian (or Sorbian) languages, although only one is labeled on this map segment. These Slavic tongues of eastern Germany are distinctive, and mapping them as separate languages makes linguistic sense. But it is difficult to understand why these relatively minor languages, with 40,000 and 10,000 speakers respectively, have been added to the tally, whereas Iranian and Indic I-E languages with hundreds of thousands to tens of millions of speakers have been ignored.

The language mapping of Scandinavia shows, yet again, striking geopolitical influence. Here we have Danish blanketing Denmark, Riksmal (or the Norwegian “national language”) everywhere in Norway except the islands and Finmark, and three separate Swedish languages covering all of Sweden except the islands, which remain unmarked. The straight east-west line that separates two supposedly distinct Swedish languages is a curious and highly unlikely feature.

But as one would expect, the continental Scandinavian languages do not actually correspond so well to national territories. Overall, the region is characterized by a dialect continuum so pronounced that some scholars regard all of the mainland North Germanic tongues as a single, regionally differentiated language. Swedish and Danish are almost interintelligible, and Norwegian is often regarded as a kind of a bridge: as a common saying puts it, “Norwegian is Danish spoken in Swedish.” (Norwegian vocabulary is similar to that of Danish, whereas its phonology is more like that of Swedish). But it is more complicated than that, as there is no single Norwegian language at any level. Local dialects cross the border with Sweden, but even in terms of official state recognition, Bokmål (“book language”) competes with Nynorsk (“New Norwegian”), and neither of these two variants are exactly the same as the standardized but non-official Riksmål (“national language”) and Høgnorsk (“High Norwegian”) forms. The differences between Bokmål and Nynorsk are not purely lexical (e.g. Bokmål pike ‘girl’ vs. Nynorsk jente ‘girl’), but concern grammatical patterns too (e.g. Bokmål does not distinguish masculine and feminine genders, whereas Nynorsk does). In a sense, the differences between Bokmål and Nynorsk are more pronounced than those between Bokmål and Danish (e.g. Danish word for ‘girl’ is pige, and most dialects of Danish and its standardized form do not distinguish masculine and feminine genders). The contention among these different language varieties is at once political, cultural, and historical, tied up with Norway’s former subordination to Denmark. Norwegian linguistic nationalists have often wanted to purge specifically Danish elements from the language, whereas linguistic traditionalists would like to preserve them.

Legacies of geopolitical change are also evident in the Scania region of southern Sweden. The dialects of Sweden’s far south are close to those of Denmark—so close, in fact, that some scholars place them within an “East Danish” category. Significantly, Scania was part of the Kingdom of Denmark until it was lost to the rising power of Sweden in 1658; it did not become an integral part of Sweden, however, until 1719, and which point a policy of linguistic “Swedenization” was initiated. “Eastern Danish” is thus considered by some to be a more historical than a linguistic category.

One of the oddest features of the mapping strategies employed by Bouckaert et al. is their reluctance to include islands within the territories of any language. In some cases, island groups are appended to mainland polygons, as can be seen here in the depiction of Danish (in the same manner, the Hebrides are mapped as Scottish-Gaelic speaking). Most often, however, islands and archipelagos are simply ignored, as one can see here in the cases of Norway’s Lofoten and Sweden’s Gotland and Olaand. Had Gotland been considered, I wonder whether it would have been mapped as Gutnish speaking. Gutnish, a disappearing dialect, is distinctive, and is sometimes said to be a direct descendent of ancient Gothic.

The mapping of Old Norse as coinciding with Iceland is also untenable. When Old Norse was spoken on Iceland it was also spoken in Norway, Sweden, Denmark, in northern Scotland, and pockets of the western British Isles.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part III: From Western Russia to the Balkan Peninsula

(Continued) The most glaring error in the linguistic map of western Russia and environs by Bouckaert et al. concerns the labeling of Belarus. The number “22,” placed in the center of the country, is listed as signifying the “Czech E,” which presumably means “eastern Czech.” As the authors have correspondingly appended the label “Byelorussian” to a small area in the eastern Czech Republic, the error is obviously one of transposition. Such mistakes can occur inadvertently, although the fact that it has gone undetected indicates a troubling failure to engage in routine proofreading.

A much deeper problem is indicated by the intentional mapping. Note how the polygons indicating the Belarusian and Ukrainian languages correspond precisely to the present-day territories of Belarus and Ukraine respectively. Such exact political-linguistic correspondence is rare, and when it is encountered it generally indicates a recent history of state-led linguistic repression or ethnic cleansing, which should be taken into account in any historical consideration of linguistic geography. In the case of Belarus and Ukraine, however, the current distribution of the national languages does not even come close to fitting precisely within the geographical bodies of the respective countries.

Belarusian is widely spoken in Belarus but it is not the country’s majority language and it is dominant only in the west and the south, as can be seen on the Wikipedia map posted here. Even in these areas, Belarusian is losing ground among the young, and is thus classified as a “threatened language.” The threat stems from Russian, which, according to the 2009 national census, is spoken at home by 72 percent of the people of Belarus. Identifying the Belarusian language with the national territory of Belarus is—yet again—a political rather than a linguistic statement.

Placing the Ukrainian language precisely within the territorial bounds of Ukraine is an even more egregious error. The fact that eastern Ukraine and the Crimean Peninsula are mostly Russian-speaking areas is well known, as it is mentioned almost every time that Ukrainian elections are discussed. According to the Constitution of the Autonomous Republic Crimea, Russian rather than Ukrainian serves as the “language of interethnic communication”. Moreover, government duties in Crimea are fulfilled mainly in Russian, hence it is a de facto official language. The issue of whether Russian should be made co-official in other areas of Eastern and Southern Ukraine that are already de facto Russian-speaking is hotly debated on the parliamentary level. Before WWII, moreover, the linguistic map of the region was far more complex than it is now, an observation that holds true for most of eastern and central Europe. The southern Crimea, for example, was then dominated by people speaking Crimean Tatar, a language in the Turkic family.

The depiction of European Russia is little better. In this case, political boundaries are not slavishly followed, as large areas of northern Russia are correctly shown as non-Russian speaking. But many northern regions that are Russian-speaking, such as Saint Petersburg, are oddly excluded from the realm. Conversely, sizable areas in eastern European Russia are mapped as Russian-speaking when in actuality they are inhabited by peoples speaking Uralic and Turkic languages. It is admittedly difficult to map such languages as (Volga) Tatar, Mari, and Udmurt, as they are not spoken in geographically contiguous areas but rather form archipelagos in a Russian sea. But do such technical challenges warrant the exclusion of such language? More than six million citizens of the Russian Federation speak Tatar as their first language, and mapping them as if they were Russian speakers fails to given them the recognition that they deserve. The Udmurt language, spoken by about half a million speakers, has been recently propelled to the focus of the public attention in Russia and in the rest of Europe when a band of Udmurt-speaking (and -singing) grandmothers won second place at the Eurovision Song Contest.

Such mapping difficulties are by no means limited to western Russia. In many parts of the Indo-European realm, languages are interspersed, forming complex amalgams. As mentioned above, such mixtures were much more intricate before the horrors of the Second World War and its immediate aftermath. Depicting such areas as linguistically uniform, as Bouckaert et al. routinely do, thus results in intrinsic distortions. Such distortions, moreover, seem to be a necessary feature of their basic methodology, as they depict every language within a discrete and uniform polygon. Linking together languages whose speakers are scattered in separate communities over large areas into single bounded spaces results in such absurdities as the gerrymandered Kurdistan mentioned in the previous post.

Such procrustean tendencies reach a laughable extreme in the depiction of the Romani language (that of the so-called Gypsies), seen on the map of the Balkans posted to the left. Romani, labeled 74, is impossible to locate precisely, as the area indicated is covered by the circle16 in western Bulgaria. Presumably, a small, discrete Romani polygon lies below this numerical tag. To restrict the Romani language to this area is beyond absurd. Romani, like the Roma people who (sometimes) speak it, is dispersed over most of Europe. Bouckaert et al., however, do not even manage to adequately locate the language’s center of gravity, as far more people speak Romani in Romania than in Bulgaria. Mapping Romani is, of course, an extraordinarily difficult task, as the linguistic community is not only scattered widely, but its members often relocate. As a result, most cartographers simply indicate the numbers and percentages of Romani speakers (or Roma people more generally) found in different countries.

The rest of the map is not much better. Although the authors differentiate four separate Albanian languages, they depict the northern half of Albania as non-Albanian speaking. They also limit Serbo-Croatian to Serbia and Montenegro, excluding Croatia and Bosnia. Here the categories used and the map itself fail to correspond; what the map shows is the political-linguistic construct of Serbian (plus Montenegrin), used since the break-up of Yugoslavia, whereas the label turns back to the Yugoslavian idea of a single Serbo-Croatian languages, which also encompasses Bosnian and Croatian. From a linguistic standpoint, Serbo-Croatian works best, as all of its politically standardized forms are mutually intelligible to some degree. But by the same token, Bulgarian and Macedonian, shown here as separate languages, are similarly interintelligible. The underlying problem here is the lack of uniformity in the treatment of different languages: if they have four Albanian languages as well as separate languages in Bulgaria and Macedonia, they should have separated Serbian, Croatian, Bosnian, and Montenegrin—or better still, they should have differentiated the non-political dialectal divisions of Serbo-Croatian: Chakavian, Kaykavian, Western Shtokavian, Eastern Shtokavian, and Torlakian.

Finally, the mapping of Greek, both ancient and modern, is bizarrely idiosyncratic.  On what possible basis could the authors limit ancient Greek to Athens and its vicinity? The implicit argument here is that only Attic Greek was Greek, with the other Hellenic polities speaking non-Greek languages, a nonsensical idea. And yet they don’t even manage to map Attic Greek properly, leaving out the islands on which it was spoken. One can only conclude that the authors are incompetent at mapping languages, a cornerstone of their approach.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part I

As our criticisms of Bouckaert et al. have been extremely harsh, we must justify them in some detail. I have accused the authors of erring “at every turn,” a charge that reeks of hyperbole. But even if that claim is exaggerated, it is still not too far from the mark. To demonstrate the extraordinary density of error in the Science article, the next few posts will dissect the authors’ base map of Indo-European languages (Figure S6 in their Supplementary Materials). This map, depicting the distribution of both modern and ancient Indo-European languages, forms a key input for their “explicit geographic model of language expansion” (Bouckaert et al., p. 957), as the locations of the sampled languages shown on this map are fed into the model in order to calculate the location of the PIE homeland. Many of the errors and inconsistencies found on their other maps stem from mistakes made in this initial figure.

The map in question shows the location of the 103 Indo-European languages analyzed. The brief caption notes that “colored polygons represent the geographic area assigned to each language based on Ethnologue.” This assertion is misleading at best. The Ethnologue does not consistently map modern languages, and it pays little attention to long-extinct ones such as Hittite. And where the Ethnologue does map, it typically does so in vastly greater detail than Bouckaert et al. Compare, for example, how the two sources depict the languages of what is now southern and central Pakistan in the paired figures to the left.

Regardless of the source (or sources) used, the map is highly inaccurate. To illustrate the cavalcade of error found in Bouckaert et al., I have isolated 103 miscues, some admittedly rather minor, but others highly significant. As recounting all of them would be tedious, I will simply note them in call-outs on expanded details from their “master map.” I have prepared twelve such enlarged maps, each focusing on a different part of the historically Indo-European-speaking world. I will post these maps sequentially over the next few days, discussing in the accompanying posts some of their more egregious errors. Today’s post will conclude with a consideration of South Asia; subsequent ones will move in a westward direction, terminating in the British Isles.

Before examining the portrayal of the Indian Subcontinent in Bouckaert et al., a few words are in order about their general approach to mapping. Analyzing their base-map is no easy matter, as they do not follow conventional cartographic procedures. Their all-important polygons are often impossible to trace, obscured by the large, numbered circles used to label the 103 languages. Another perceptual problem stems from their use of overlays, with multiple extinct languages (in red) layered upon extant languages (in blue). The resulting color blends yield confusing intermediate shades. Note on the detail posted to the left the depictions of Luvian, Hittite, Classical Armenian, Kurdish, and modern Armenian. Determining which language is indicated in which places takes some patience.

A more intractable problem concerns the map’s temporal framing. The short explanation provided in the caption makes the issue seem simple: “Red areas indicate ancient languages and blue areas indicate modern languages.” Left unanswered is the time frame of “linguistic modernity.” In some places, the term is defined broadly, extending back hundreds of years. Cornwall, for example, is shown as inhabited by speakers of modern Cornish. Such a view is anachronistic, as Cornish had disappeared from most of the peninsula by 1700, and was essentially extinct before the modern revival movement began in the 20th century. (Today Cornish is estimated to have only “a few” native speakers.) Elsewhere, the mapping of “modern languages” refers to the late 20th century. The German zone, for example, fits only the post-WWII period, after millions of German speakers had been expelled from Pomerania, Silesia, and Sudetenland. The map, to put it simply, plays fast and loose with time and space.

Even more problematic is the mapping of many languages on the basis of political rather than linguistic features. As was noted in an earlier post, all of the maps used in the study show signs of what I called “geopolitical contamination,” in which the boundaries of modern-day states incorrectly determine those of language groups, following Max Weinreich’s dictum that “a language is a dialect with an army and navy.” I was puzzled, for example, by the fact that Moldova was placed outside of the Indo-European realm in Figure S4, showcased on Quentin Atkinson’s website. The reason is readily apparent when one considers the map of the 103 language polygons (Figure S6). Here Romanian is depicted as almost exactly coincident with Romania. Moldova is fully excluded from this realm, even though the official “Moldovan Language” is differentiated from Romanian solely on political grounds. One can indeed identify a Moldovan subdialect of Romanian, but it spans the Romanian-Moldovan border. Moldova should thus have been placed within the Romanian polygon, yet it is instead depicted in the same manner as Hungary, giving the impression that it lies outside the Indo-European realm. The consequences of such a strategy are troubling for the contemporary world, but become positively pernicious when retroactively extended into the past, which is precisely what the Bouckaert model does. As a result, almost all of Moldova is ludicrously mapped as most likely never having been occupied by Indo-European speakers in Figure S4.








Such geopolitical contamination is clearly evident in the depiction of the languages of South Asia, posted here. Note that Bengali, often regarded as the world’s sixth most widely spoken language, is essentially limited to Bangladesh, its 80+ million speakers in the Indian state of West Bengal written out of the linguistic community. Even more unreasonably, Vedic Sanskrit is given the polygon of a modern political unit. The supposed territory of this ancient language is outlined and shaded in red in the map posted here. This area, it turns out, precisely fits the territorial extent of Punjab before it was partitioned by the British. That colonial-era Punjab would have no bearing on the distribution of Vedic Sanskrit, spoken some 3,000 years ago, should go without saying. It is also worth noting that the former Punjab included what is now the Indian Himalayan state of Himachal Pradesh, which features peaks 22,000 feet above sea level. It is safe to assume that such areas were never part of the Vedic Sanskrit realm.


Mapping Vedic Sanskrit is no easy task, but that is no excuse for using a modern geopolitical proxy. Careful studies show that the world of the Rig Veda was largely limited to what are now the Indian and Pakistani states of Punjab along with the Vale of Peshawar and Swat Valley. “Vedic India” in the larger sense extended from this region down the Ganges Valley through Bihar and southward to encompass Gujarat, as can be seen in the second map posted here. Either of these two areas could easily have been used for the Vedic Sanskrit polygon.


I will not comment further on the remaining errors and infelicities on the Bouckaert et al. portrayal of South Asia, as a number of them are noted on the map itself. I have also posted a fine Wikipedia map of the current distribution of the Indo-European languages of South Asia for comparative purposes. (Note that this Wikipedia map lumps a number if disparate dialects into single languages, such as Bihari.)

As we shall see in forthcoming posts, similar errors litter all other portions of the original language map employed by Bouckaert et al. As a result, it is difficult to avoid the conclusion that the authors simply do not have the level of geo-linguistic comprehension necessary for carrying out their task. I have taught the geography of modern languages at leading universities for twenty-five years, and I can peg the level of understanding demonstrated by students fairly accurately. That of Bouckaert et al. would clearly fall into the “B” range. Given the unfortunate realities of grade inflation, that means that more than half of my undergraduate students finish their terms with a better understanding of the distribution of languages than the authors of a supposedly path-breaking article on the origin and spread of the world’s largest language family published in one of the world’s leading scientific journals.



The Misleading and Inconsistent Language Selection in Bouckaert et al.

To successfully model the spread and divergence of a language family, one must select languages for one’s data set in a comprehensive, balanced, and consistent manner. Results will be skewed if large numbers of languages are excluded from analysis, if some regions and linguistic branches are covered much more thoroughly than others, or if both dialects and languages are selected based on different criteria in different parts of the world. Bouckaert et al., unfortunately, do all of this and more. The authors favor certain areas and linguistic sub-families, minimizing others. Biases relating to preservation and examination seem to guide most such decisions. Most extinct Indo-European languages that are well documented, such as Old English and Old Norse, are included in the analysis, whereas those that are poorly known, such as all of the Scythian languages of the hypothesized proto-Indo-European homeland in the Pontic Steppes, are simply ignored. Likewise, living languages that have been intensively studied get preference over those that have not received similar scrutiny. Selecting and ignoring languages in such a manner may be convenient for formal modeling, but deep and systematic distortions result.

One of the more vexing issues in linguistics is the differentiation of languages from dialects. As in biological taxonomy, “lumpers” argue endlessly with “splitters.” Whether one accepts either position is immaterial for formal analysis, but one must maintain consistency. Bouckaert et al., however, shift wildly from fine splitting to gross lumping. Their treatment of Albanian exemplifies the former approach, as they divide it into four separate languages (listed as Albanian C, Albanian K, Albanian G, and Albanian Top). Albanian is indeed divided into Gheg and Tosk, which can easily count as separate languages, but no other dialects approach such status in most divisional schemes. The split-happy Ethnologue, however, does count two minor Albanian dialects in Italy and Greece—linguistically indistinct from Tosk in Albania—as separate languages, an approach that Bouckaert et al. chose to follow. In several other parts of Europe they adopt a similar method, classifying Breton as three separate languages, Sardinian as three, and the minor Slavic tongue of Lusatian (also known as Upper Sorbian) as two. But elsewhere in Europe they reject such fine divisions. They take Serbo-Croatian, for example, as a single language—yet oddly give it the ISO code for its Bosnian dialect [BOS]). They also regard German as one tongue; if they had remained consistent and followed the Ethnologue here, they would have included such languages as Bavarian, Mainfränkisch (East Franconian), Pfalzisch, Upper Saxon, and Swabian. In South Asia and the Iranian zone, the authors’ “lumping” tendency reaches an extreme. They count Hindi as a single language despite its pronounced dialectal variation (even the Wikipedia discusses the “Hindi languages”). They do the same with Lahnda, a dialect continuum that encompasses, according to the Ethnologue, eight separate languages.

Bigger problems for Bouckaert et al. are encountered in their basic enumeration of the Indo-European languages of Asia. Whereas the comprehensive Wikipedia family tree for the Iranian branch of Indo-European includes more than fifty extant languages, the selective approach of Bouckaert et al. considers only nine. The authors are even more remiss when it comes to the Indo-Aryan languages of northern South Asia. Punjabi, widely regarded as the world’s tenth most widely spoken language with more than 100 million speakers*, is nowhere to be seen. Whereas the authors list only fifteen extant I-E languages in South Asia, the Ethnologue counts more than 200. A few of the major Indo-Aryan languages discounted by Bouckaert et al. include Rajasthani (20 million** speakers), Bhili (1.5 million), Sylheti (10 million), Garhwali (3 million), Kutchi (2 million), Awadhi (38 million), Kannauji (6 million), and Bhojpuri (38 million). Yet in one part of the region, they abruptly switch to an idiosyncratic splitting approach, differentiating the Waziri dialect from Pashto, which they oddly call “Afghan.” The major split in this language, the north/south divide between “Pashto” and “Pakhto,” however, remains invisible.

By including European I-E languages much more readily than non-European ones, the authors evince a form of Eurocentrism. The same tendency is encountered in their treatment of extinct languages. For western and central Europe, nine dead languages are listed, including Old Irish, Old High German, Old English, and Old Prussian. Fair enough. But for northern South Asia, an area of roughly similar territorial extent and historical population levels, only Vedic Sanskrit makes the list. The many extinct Prakrit languages are excluded without reason. Here preservation bias cannot be the culprit, as a number of these languages are relatively well known, Even Pali, a semi-living language owing to its liturgical position in the Theravada Buddhist community, is inexplicably left off the map.

The Bouckaert model stumbles even more sharply in regard to extinct Iranian languages. Only two are included: Old Persian and Avestan. Major Eastern Iranian languages that were once important literary vehicles, such as Sogdian, Bactrian, Khotanese and Khwarezmian, are simply disregarded. So too are the less well-known Scythian languages of the steppe zone.*** As noted in previous posts, had the Scythian languages been included in the model, the geographical patterns generated would likely have been quite different. Although one could argue that the Scythian languages are not known well enough to have been used, such an argument amounts to an admission that preservation bias compromises the approach. The failure to include well-known Sogdian, on the other hand, cannot be attributed to preservation bias, and is perhaps rooted instead in carelessness, ignorance, or the simple desire to mold the data in order to reach pre-established conclusions.

As the supplementary materials make clear, the authors of the study are fully aware that they have excluded a number of Indo-European languages, both living and dead. Yet in an interview with Isabelle Boni for the general public, co-author Quentin Atkinson maintains that “we compare these words across all Indo-European languages” (emphasis added). Such a statement is careless and misleading at best.

*Admittedly, Western Punjabi is sometimes counted as one of the Lahnda languages, but not Eastern Punjabi.

** The 20 million figure used here assumes that Marwari is counted as a separate language, as it is in Bouckaert et al.

***It is also notable that the Indo-European Thracian language(s), along with the other Paleo-Balkan languages, are likewise ignored.


Why the Indo-European Debate Matters—And Matters Deeply

As expected, we have received a few complaints from friends, acquaintances, and Facebook-followers in regard to the current Indo-European series. “Why get so exercised over a single article,” some ask, reminding us that science is a self-correcting endeavor that will eventually winnow away the chaff. Others question the entire enterprise, wondering why we would care so much about such an obscure topic.

We agree that science is, in the long run, a self-correcting undertaking, which gives it vast power. But self-correction does not come automatically; it takes work, which we are happy to provide. And in the short-term, counterfeit research can do great harm, as the Lysenko Affair in the Soviet Union so well demonstrated. We also find it deeply troubling that a nonsensical article would not only be accepted for publication in one of the world’s premier scientific journals, but would immediately be trumpeted in the mass media for “solving” one of the key mysteries of human pre-history. The episode uncovers a whiff of corruption in the scientific-journalist establishment that needs a blast of fresh air.

In regard to the second set of complaints, we must reject them outright. The Indo-European issue is not obscure, trivial, or unrelated to pressing issues of our day. In fact, it is difficult to locate a single topic of historical debate that has been more ideologically fraught and politically laden over the past 150 years than that of Indo-European origin and expansion.

Indo-European studies took on a heavy ideological burden in the late 1800s, a development that would indirectly lead to the most hideous examples of genocide and mass-murder that the world has ever witnessed. The supposedly superior “Aryans” of Nazi mythology were none other than the speakers of Proto-Indo-European (PIE). Nazi propagandists conjured their own wildly off-base theories about I-E origins, but their fantasies had roots in the scholarly endeavors of German philologists. And while Nazism was militarily crushed and its ideological foundations pulverized, the movement refuses to die. Indeed, it seems to be experiencing something of a revival in eastern Germany, Hungary, and—of all places—Russia. On numerous occasions, I have found myself directed by Google to the odious “Stormfront” website while searching for images and ethnographic descriptions of various Eurasian ethnic groups. The Aryan myth also continues to feed racially troubling ideologies outside of Europe, particularly in Iran and northern India.

Even scholars who have sought to undermine the noxious notion of the Aryan Herrenvolk have occasionally generated their own benign but still fantasy-laden counter-narratives. The key figure here is the late Lithuanian-American archeologist Marija Gimbutas, noted for placing the I-E homeland in the Pontic Steppes. Gimbutas’s scientific research was solid, and we suspect that she was largely correct in locating the PIE homeland. But in seeking to turn the Nazi view on its head, she went too far—and some of her lay followers went much too far. In the feminist retelling of the tale that she inspired, the Aryans become the Kurgans, a uniquely violent, male-dominated people who destroyed the peaceful, gender-equitable if not matriarchal civilization of “Old Europe.” In Riane Eisler’s 1988 treatise, The Chalice and the Blade: Our History, Our Future, the Kurgan conquests are seen as ushering in a global age of male domination and mass violence. The work was a bestseller, blurbed by noted anthropologist Ashley Montagu as the “most important book since Darwin’s Origin of Species.”

Eisler’s global vision failed from the onset: as male domination characterized almost all historically known human societies, it cannot be attributed to a single ancient people located in one particular part of the Earth. Recent research has also tended to undermine many of her more specific claims. The Old Europeans were probably not as peaceful and female-centered as they had been portrayed, and the PIE speakers and their immediate descendents were probably not so insistently androcentric. Certainly the early Indo-European speakers were no strangers to violence and domination, but how do we account for the female Scythian skeletons from the Kurgan homeland tricked out in military gear? Perhaps Herodotus was on to something when he wrote of Amazon tribes in the area. More to the point, we now understand that the early Indo-European-speakers could not have simply invaded Old Europe and subjugated its inhabitants, as they lacked the state-level forms of military organization necessary for wide conquests. As Anthony shows so well in The Horse, the Wheel and Language, the process was almost certainly one of gradual incursions, marked by both social predation and mutualism, that allowed the militarily advantaged, semi-pastoral, equestrian I-E speakers to slowly spread their forms of speech. And while their languages did indeed expand over vast areas, they did not simply replace pre-existing tongues. Almost everywhere, older linguistic elements survived. Major non-I-E substrates characterize such I-E subfamilies as Germanic and Greek. A huge problem for both Nazi ideology and the Gimbutas/Eisler thesis is the fact that most of the Germanic root words pertaining to war are non-Indo-European. The mysteries here remain deep.

Considering the misuses to which the issue of I-E origins has been put, it is understandable that some people would want to reject the idea that the original speakers were war-like horse-riders from some remote, northern homeland. All such troublesome interpretations would vanish if I-E expansion could instead be linked to the gradual movement of simple farmers from the Near Eastern agricultural heartland into the sparsely settled lands of Mesolithic Europe. But if the evidence indicates otherwise, as it most assuredly does, the result is merely another myth. Scientific responsibility demands the search for truth, even if the truth leads into uncomfortable areas.

Regardless of the complications introduced by ideological distortions, investigations of I-E origins and expansion have a huge bearing of the study of human prehistory. Indo-European, after all, is by far the world’s largest language family when counted by the number of speakers. Linguistic evidence about the family’s spread tells us much of significance about the historical development of a vast section of the Earth’s surface over many centuries, even millennia. Studies of human prehistory depend crucially on three lines of evidence: those derived from archeological digs; from genetic studies; and from linguistics. Over the past decade, much progress has been made in bridging linguistic and archeological evidence, as demonstrated by David Anthony’s The Horse, the Wheel, and Language. To the extent that the burgeoning genetic investigations of Y- and mitochondrial DNA lineages can be incorporated into this linguistic-archeological nexus, a much richer understanding of the prehistoric human past awaits. For a path-breaking interdisciplinary foray into this territory, see Andrew Shryock and Daniel Lord Smail, Deep History: The Architecture of Past and Present.

Such developments, however, risk being cut short if the field of historical linguistics continues to languish. Further progress will depend not only on linguists carrying out their own research, but also on their passing down of their knowledge and techniques to future generations of students. Such lines of intellectual transmission, however, are threatened by cutbacks in linguistic departments, as well as by the assaults on the field mounted by interlopers who have somehow managed to convince many scientists that linguistic evidence is of little account when it comes to studying the history of languages. To the extent that the Anatolian hypothesis gains ground among archeologists and geneticists on the basis of the recent Science article, our collective knowledge of the past will take a sharp step backwards.

The most troubling aspect of the affair, however, is not the threats that it poses but rather the revelations that it makes about the integrity of the scientific and journalistic establishments. A scholarly journal such as Science is duty-bound to vet any potential contribution through established experts. Yet I have a difficult time imagining that the article in question was subjected to proper peer-review through any qualified specialist in the field in which it sits: Indo-European historical linguistics. Either the article was never sent to a competent linguistics reviewer, or the resulting review was irresponsibly ignored. And yet this is not the first time that a preposterous article on historical linguistics has appeared in Science (and also in Nature), as we shall see in future posts. Have the editors of this august journal decided that the discipline of linguists has somehow failed, and that its field of historical inquiry should therefore be handed over to epidemiologists and computational modelers? If so, on what possible grounds was this decision reached? Unless such questions can be answered, I have a difficult time avoiding the conclusion that the editors of Science have betrayed the basic canons of academic responsibility.

While contemplating these issues, I am continually reminded of the Sokal Hoax, an episode that revealed the vacuity of postmodernist literary theory and “science studies” in the mid-1990s. This affair came to my attention when I was participating in the conference on “The Flight from Science and Reason” organized by the New York Academy of Sciences. A rumor began to circulate among the attendees that a noted physicist and mathematician with solid leftist political credentials was perpetrating a prank that would debunk Social Text, perhaps the leading journal of poststructuralist theory, and in so doing deflate the pretension of those who sought to undermine science in the name of human liberation. Sokal’s article, entitled “Transgressing the Boundaries: Towards a Transformative Hermeneutics of Quantum Gravity,” argues that since science is merely a social construct, quantum gravity, especially as interpreted through the new-age lens of “morphogenetic fields,” can have progressive implications for political action. The paper was accepted and duly published, despite the fact that it was, as its author soon admitted, “a pastiche of Left-wing cant, fawning references, grandiose quotations, and outright nonsense . . . structured around the silliest quotations [by postmodernist academics] he could find about mathematics and physics.” Sokal designed the hoax as a kind of test of the allegations made by Paul Gross and Norman Levitt in their book Higher Superstition: The Academic Left and Its Quarrels With Science. As he discovered, even the most palpable nonsense imaginable could be published in Social Text so long as it sounded good and flattered the editors’ ideological preconceptions.”

While the Sokal Affair was a purposive hoax, the members of the Boukaert team evidently believe that their article constitutes a contribution to knowledge. But what the authors think about their own work is of no significance, as the arguments they make must stand on their own. Had Alan Sokal actually believed that the “construction” of quantum gravity could be a politically progressive act, would his article have been any less nonsensical? The current authors have thus perpetrated an unwitting hoax, but the end results should be no less embarrassing for the editors of Science than the Sokal Affair was for those of Social Text. Boukaert et al. begin by improperly framing the problem, and then go on to err at every turn. It is not so much that the article’s conclusions are incorrect, but rather that every assumption it makes, every technique it employs, and virtually every “fact” that it marshals is either incorrect, inappropriate, or misleading. Yet this work was published in one of the world’s most prestigious scientific journals. Something here smells rather fishy.

But if the mere publication of the article in Science raises questions about intellectual integrity, its immediate celebration in the pages of the New York Times points to a deeper mire. Science publishes hundreds of articles each year, a tiny fraction of which are ever mentioned in the New York Times, let alone showcased in the newspaper’s main section. Yet the Times has gone out of its way on more than one occasion to trumpet “contributions” to linguistic history from members of the Bouckaert team, specifically Quentin Atkinson. Evidently, the editors of the supposed newspaper-of-record in the United States have concluded that the work of these scholars constitutes one of the most important scientific stories of the past decade. On what possible basis could such an assessment have been rationally made?

Journalists, like academics, are expected to adhere to certain standards of professional behavior. Unless they are writing for the editorial pages or are explicitly employed in “advocacy journalism,” reporters are expected to remain as objective as possible, not letting their own interests, political predilections, or friendship and kin networks direct their work. Such guidelines are impossible to follow to the letter, and as a result complete objectivity is a mere ideal. But such an ideal is still supposed to influence behavior in self-respecting media outlets, eliminating the excesses of partisanship. In the present case, however, all such ethical fetters seem to have been removed. Nicholas Wade’s reporting on this issue has been non-objective in the extreme. One can only speculate as to why Wade has been determined to act as Quentin Atkinson’s pocket journalist, ever ready to proclaim his latest clumsy foray into linguistics as a scientific breakthrough on par with plate tectonics.

To appreciate the level of corruption revealed by the Bouckaert Affair, imagine that a parallel series of events occurred in a different walk of life, such as business. Imagine, for example, that an established financial firm with a reasonably good reputation decided to apply its mathematical models to an unrelated business, one in which both the leaders and employees of the company had no experience. Being ignorant of their new field, they made a number of naïve and ultimately untenable assumptions about how it operates, and thus when they applied their favored methods, unexpected breakdowns occurred. Soon the firm began to hemorrhage money. But rather than admit to their failure, the managers instead crowed about their success, hiding their mounting losses in misleading accounting sheets and obscurely written reports. But even as the company began to collapse, its reputation strengthened and its stock-market valuation rose. Such gains, it turns out, stemmed from glowing reports on its new venture in the business media, most notably the New York Times. The most substantive Times’ piece on the venture appeared not in the paper’s business pages, but in its main news section, gaining it a particularly wide readership. The fact that it was written by the former editor of its business section, a person widely regarded as one of the country’s leading economic journalists, helped propel the story. For a while, it appeared as if the firm could do no wrong. And then …

In the world of commerce, such a story would end with the quick death of the firm, as well as that of its business model. To the extent that any company making consistent losses will eventually fail, business—like science—is a self-correcting enterprise. Failure in business, however, is generally more pressing than it is in science, as rather more money and power is typically at stake. Intrinsic error can linger in science for decades, as demonstrated by the prolonged resistance of geologists to the ever-mounting evidence for continental drift. In a field as marginal as Indo-European studies, well-funded pseudo-scientific works could withstand invalidation by under-funded scholars for many years. In the popular imagination, moreover, erroneous ideas can escape correction altogether, lodging so firmly as to be all but irremovable by evidence. Examples include the widely known non-facts that the Eskimo languages have a multitude of words for snow, and that Europeans before Columbus thought that the world was flat. The Indo-European Affair, in short, matters, and matters deeply. I find it cause for deep concern, and as a result I will continue to write about it.

But after one more post, the current series on Indo-European origins will go on hiatus for a few weeks. Both Asya and I must travel for a short period, so blogging in general will be light for the next week or so.