The Misleading and Inconsistent Language Selection in Bouckaert et al.

To successfully model the spread and divergence of a language family, one must select languages for one’s data set in a comprehensive, balanced, and consistent manner. Results will be skewed if large numbers of languages are excluded from analysis, if some regions and linguistic branches are covered much more thoroughly than others, or if both dialects and languages are selected based on different criteria in different parts of the world. Bouckaert et al., unfortunately, do all of this and more. The authors favor certain areas and linguistic sub-families, minimizing others. Biases relating to preservation and examination seem to guide most such decisions. Most extinct Indo-European languages that are well documented, such as Old English and Old Norse, are included in the analysis, whereas those that are poorly known, such as all of the Scythian languages of the hypothesized proto-Indo-European homeland in the Pontic Steppes, are simply ignored. Likewise, living languages that have been intensively studied get preference over those that have not received similar scrutiny. Selecting and ignoring languages in such a manner may be convenient for formal modeling, but deep and systematic distortions result.

One of the more vexing issues in linguistics is the differentiation of languages from dialects. As in biological taxonomy, “lumpers” argue endlessly with “splitters.” Whether one accepts either position is immaterial for formal analysis, but one must maintain consistency. Bouckaert et al., however, shift wildly from fine splitting to gross lumping. Their treatment of Albanian exemplifies the former approach, as they divide it into four separate languages (listed as Albanian C, Albanian K, Albanian G, and Albanian Top). Albanian is indeed divided into Gheg and Tosk, which can easily count as separate languages, but no other dialects approach such status in most divisional schemes. The split-happy Ethnologue, however, does count two minor Albanian dialects in Italy and Greece—linguistically indistinct from Tosk in Albania—as separate languages, an approach that Bouckaert et al. chose to follow. In several other parts of Europe they adopt a similar method, classifying Breton as three separate languages, Sardinian as three, and the minor Slavic tongue of Lusatian (also known as Upper Sorbian) as two. But elsewhere in Europe they reject such fine divisions. They take Serbo-Croatian, for example, as a single language—yet oddly give it the ISO code for its Bosnian dialect [BOS]). They also regard German as one tongue; if they had remained consistent and followed the Ethnologue here, they would have included such languages as Bavarian, Mainfränkisch (East Franconian), Pfalzisch, Upper Saxon, and Swabian. In South Asia and the Iranian zone, the authors’ “lumping” tendency reaches an extreme. They count Hindi as a single language despite its pronounced dialectal variation (even the Wikipedia discusses the “Hindi languages”). They do the same with Lahnda, a dialect continuum that encompasses, according to the Ethnologue, eight separate languages.

Bigger problems for Bouckaert et al. are encountered in their basic enumeration of the Indo-European languages of Asia. Whereas the comprehensive Wikipedia family tree for the Iranian branch of Indo-European includes more than fifty extant languages, the selective approach of Bouckaert et al. considers only nine. The authors are even more remiss when it comes to the Indo-Aryan languages of northern South Asia. Punjabi, widely regarded as the world’s tenth most widely spoken language with more than 100 million speakers*, is nowhere to be seen. Whereas the authors list only fifteen extant I-E languages in South Asia, the Ethnologue counts more than 200. A few of the major Indo-Aryan languages discounted by Bouckaert et al. include Rajasthani (20 million** speakers), Bhili (1.5 million), Sylheti (10 million), Garhwali (3 million), Kutchi (2 million), Awadhi (38 million), Kannauji (6 million), and Bhojpuri (38 million). Yet in one part of the region, they abruptly switch to an idiosyncratic splitting approach, differentiating the Waziri dialect from Pashto, which they oddly call “Afghan.” The major split in this language, the north/south divide between “Pashto” and “Pakhto,” however, remains invisible.

By including European I-E languages much more readily than non-European ones, the authors evince a form of Eurocentrism. The same tendency is encountered in their treatment of extinct languages. For western and central Europe, nine dead languages are listed, including Old Irish, Old High German, Old English, and Old Prussian. Fair enough. But for northern South Asia, an area of roughly similar territorial extent and historical population levels, only Vedic Sanskrit makes the list. The many extinct Prakrit languages are excluded without reason. Here preservation bias cannot be the culprit, as a number of these languages are relatively well known, Even Pali, a semi-living language owing to its liturgical position in the Theravada Buddhist community, is inexplicably left off the map.

The Bouckaert model stumbles even more sharply in regard to extinct Iranian languages. Only two are included: Old Persian and Avestan. Major Eastern Iranian languages that were once important literary vehicles, such as Sogdian, Bactrian, Khotanese and Khwarezmian, are simply disregarded. So too are the less well-known Scythian languages of the steppe zone.*** As noted in previous posts, had the Scythian languages been included in the model, the geographical patterns generated would likely have been quite different. Although one could argue that the Scythian languages are not known well enough to have been used, such an argument amounts to an admission that preservation bias compromises the approach. The failure to include well-known Sogdian, on the other hand, cannot be attributed to preservation bias, and is perhaps rooted instead in carelessness, ignorance, or the simple desire to mold the data in order to reach pre-established conclusions.

As the supplementary materials make clear, the authors of the study are fully aware that they have excluded a number of Indo-European languages, both living and dead. Yet in an interview with Isabelle Boni for the general public, co-author Quentin Atkinson maintains that “we compare these words across all Indo-European languages” (emphasis added). Such a statement is careless and misleading at best.

*Admittedly, Western Punjabi is sometimes counted as one of the Lahnda languages, but not Eastern Punjabi.

** The 20 million figure used here assumes that Marwari is counted as a separate language, as it is in Bouckaert et al.

***It is also notable that the Indo-European Thracian language(s), along with the other Paleo-Balkan languages, are likewise ignored.


The Misleading and Inconsistent Language Selection in Bouckaert et al. Read More »