Focused Series »

Indo-European Origins
Northern California
The Caucasus
Imaginary Geography
Home » Cultural Geography, Historical Geography, Indo-European Origins, Linguistic Geography

The Misleading and Inconsistent Language Selection in Bouckaert et al.

Submitted by on September 28, 2012 – 9:09 pm 9 Comments |  
To successfully model the spread and divergence of a language family, one must select languages for one’s data set in a comprehensive, balanced, and consistent manner. Results will be skewed if large numbers of languages are excluded from analysis, if some regions and linguistic branches are covered much more thoroughly than others, or if both dialects and languages are selected based on different criteria in different parts of the world. Bouckaert et al., unfortunately, do all of this and more. The authors favor certain areas and linguistic sub-families, minimizing others. Biases relating to preservation and examination seem to guide most such decisions. Most extinct Indo-European languages that are well documented, such as Old English and Old Norse, are included in the analysis, whereas those that are poorly known, such as all of the Scythian languages of the hypothesized proto-Indo-European homeland in the Pontic Steppes, are simply ignored. Likewise, living languages that have been intensively studied get preference over those that have not received similar scrutiny. Selecting and ignoring languages in such a manner may be convenient for formal modeling, but deep and systematic distortions result.

One of the more vexing issues in linguistics is the differentiation of languages from dialects. As in biological taxonomy, “lumpers” argue endlessly with “splitters.” Whether one accepts either position is immaterial for formal analysis, but one must maintain consistency. Bouckaert et al., however, shift wildly from fine splitting to gross lumping. Their treatment of Albanian exemplifies the former approach, as they divide it into four separate languages (listed as Albanian C, Albanian K, Albanian G, and Albanian Top). Albanian is indeed divided into Gheg and Tosk, which can easily count as separate languages, but no other dialects approach such status in most divisional schemes. The split-happy Ethnologue, however, does count two minor Albanian dialects in Italy and Greece—linguistically indistinct from Tosk in Albania—as separate languages, an approach that Bouckaert et al. chose to follow. In several other parts of Europe they adopt a similar method, classifying Breton as three separate languages, Sardinian as three, and the minor Slavic tongue of Lusatian (also known as Upper Sorbian) as two. But elsewhere in Europe they reject such fine divisions. They take Serbo-Croatian, for example, as a single language—yet oddly give it the ISO code for its Bosnian dialect [BOS]). They also regard German as one tongue; if they had remained consistent and followed the Ethnologue here, they would have included such languages as Bavarian, Mainfränkisch (East Franconian), Pfalzisch, Upper Saxon, and Swabian. In South Asia and the Iranian zone, the authors’ “lumping” tendency reaches an extreme. They count Hindi as a single language despite its pronounced dialectal variation (even the Wikipedia discusses the “Hindi languages”). They do the same with Lahnda, a dialect continuum that encompasses, according to the Ethnologue, eight separate languages.

Bigger problems for Bouckaert et al. are encountered in their basic enumeration of the Indo-European languages of Asia. Whereas the comprehensive Wikipedia family tree for the Iranian branch of Indo-European includes more than fifty extant languages, the selective approach of Bouckaert et al. considers only nine. The authors are even more remiss when it comes to the Indo-Aryan languages of northern South Asia. Punjabi, widely regarded as the world’s tenth most widely spoken language with more than 100 million speakers*, is nowhere to be seen. Whereas the authors list only fifteen extant I-E languages in South Asia, the Ethnologue counts more than 200. A few of the major Indo-Aryan languages discounted by Bouckaert et al. include Rajasthani (20 million** speakers), Bhili (1.5 million), Sylheti (10 million), Garhwali (3 million), Kutchi (2 million), Awadhi (38 million), Kannauji (6 million), and Bhojpuri (38 million). Yet in one part of the region, they abruptly switch to an idiosyncratic splitting approach, differentiating the Waziri dialect from Pashto, which they oddly call “Afghan.” The major split in this language, the north/south divide between “Pashto” and “Pakhto,” however, remains invisible.

By including European I-E languages much more readily than non-European ones, the authors evince a form of Eurocentrism. The same tendency is encountered in their treatment of extinct languages. For western and central Europe, nine dead languages are listed, including Old Irish, Old High German, Old English, and Old Prussian. Fair enough. But for northern South Asia, an area of roughly similar territorial extent and historical population levels, only Vedic Sanskrit makes the list. The many extinct Prakrit languages are excluded without reason. Here preservation bias cannot be the culprit, as a number of these languages are relatively well known, Even Pali, a semi-living language owing to its liturgical position in the Theravada Buddhist community, is inexplicably left off the map.

The Bouckaert model stumbles even more sharply in regard to extinct Iranian languages. Only two are included: Old Persian and Avestan. Major Eastern Iranian languages that were once important literary vehicles, such as Sogdian, Bactrian, Khotanese and Khwarezmian, are simply disregarded. So too are the less well-known Scythian languages of the steppe zone.*** As noted in previous posts, had the Scythian languages been included in the model, the geographical patterns generated would likely have been quite different. Although one could argue that the Scythian languages are not known well enough to have been used, such an argument amounts to an admission that preservation bias compromises the approach. The failure to include well-known Sogdian, on the other hand, cannot be attributed to preservation bias, and is perhaps rooted instead in carelessness, ignorance, or the simple desire to mold the data in order to reach pre-established conclusions.

As the supplementary materials make clear, the authors of the study are fully aware that they have excluded a number of Indo-European languages, both living and dead. Yet in an interview with Isabelle Boni for the general public, co-author Quentin Atkinson maintains that “we compare these words across all Indo-European languages” (emphasis added). Such a statement is careless and misleading at best.

*Admittedly, Western Punjabi is sometimes counted as one of the Lahnda languages, but not Eastern Punjabi.

** The 20 million figure used here assumes that Marwari is counted as a separate language, as it is in Bouckaert et al.

***It is also notable that the Indo-European Thracian language(s), along with the other Paleo-Balkan languages, are likewise ignored.


Previous Post
Next Post

Subscribe For Updates

It would be a pleasure to have you back on GeoCurrents in the future. You can sign up for email updates or follow our RSS Feed, Facebook, or Twitter for notifications of each new post:

Commenting Guidelines: GeoCurrents is a forum for the respectful exchange of ideas, and loaded political commentary can detract from that. We ask that you as a reader keep this in mind when sharing your thoughts in the comments below.

  • The data have been manipulated too. I could examine the list for Vedic Sanskrit. Often,
    words have been chosen from the Vedic in such a way that these would be scored
    “not cognate” in the computerized analysis. One simple example of such manipulation
    is the meaning “warm”: for Vedic Sanskrit they have listed uṣṇa, instead of the Vedic word gharma- (cognate
    to PIE *ghwer-). Other
    such manipulations in compilation are: for “sea” Vedic samudras and not mīra; for “sky” Vedic daus and not nabha. The list of such manipulations is very big and cannot be exaustive.

    • Thanks, Premendra! This is getting better and better! We haven’t gone over all the lists they used, so I’m wondering if similar problems arise for other languages as well… In any case, since they never define “cognates” I wonder whether they might have missed some or counted non-cognate look-alikes as cognates too…

      • Yes, thanks as well. My next post will take on their mapping of Vedic Sanskrit, with is truly bizarre, as they portray it as covering precisely the area of pre-partition Punjab!

  • Paolo

    For completeness, please note that the Italian variant of Tosk that is mentioned by Ethnologue (aae) is hardly intelligible with Tosk. See for examples.

    • Thanks for this comment, Paolo! It is my understanding that the Italian variant of Tosk (aae) is somewhat distinctive lexically (with unsurprising lexical borrowings from Italian, Sicilian, and Greek), but grammatically fairly similar to Tosk in Albania. While there is no strict measure as to how much difference is needed to qualify as a separate language, most linguists treat it as a dialect (a member of the Tosk dialect group) rather than a full-fledged language. Its separate mention in the Ethnologue is due mostly to sociolinguistic rather than purely linguistic reasons, as its speakers are for the most part literate in Italian rather than Albanian. As with any other source of data, it is important to keep in mind what the goal of that data collection is, and with the Ethnologue, sociolinguistic factors outweigh linguistic ones (hence their splitting tendency). While the issue is surely controversial, Martin’s point in the post still stands: Bouckaert et al. do not spell out (and I doubt they actually have) any clear criteria as to which varieties are included in their list and which ones are not. If the Italian variant of Tosk is included, so many more other things should be also, for balance, and aren’t.

    • Excellent point. But as Asya notes, I have no problem regarding “Italian Tosk” as a separate language, provided that other equally divergent dialects of other languages get the same treatment.

  • 1. Marwari is actually the basis for the Rajasthani standard: In other words, you’d have a better case to split Bagri, Mewari and so on from “Rajasthani” than Marwari. If you split Marwari from Rajasthani, then what is Rajasthani? It’s like splitting “Tuscan” from “Italian”.

    2. If you’re using the criterion of mutual intelligibility, there’s no justification of treating Serbian and Croatian as separate languages. There’d be a better case to treat Kajkavski, Čakavski and Štokavski as languages – both standard Serbian and Croatian are based on a particular dialect of Štokavski. Kajkavski is actually closer to Standard Slovene than it is to Standard Croatian, and the Torlak Bulgarian-Macedonian-Serbian transitional varieties are harder for standard-language-speaking Serbs to understand than Croatian.

    It makes no philological sense to treat “Croatian” as a language with Štokavian, Kajkavian and Čakavian dialects, and then Serbian with Torlak and the same Štokavian dialect. Not to mention Bosnian and Montenegrin, which entirely belong to a subvariety of Štokavian (Ijekavian). The Serbian, Croatian, Bosnian and Montenegrin languages only make sense as sociolinguistic constructs.

    • Thank you for your comments. I agree with you absolutely on “Serbo-Croatian”. They should have looked at dialects, to the extent that they are different enough at the level of Swadesh 200-word list… As for “Tuscan” vs. “Italian”, they are not identical, but again, some uniformity as to what units are chosen for analysis is needed: all dialects, all languages, but not a weird mixture of the two.

    • Good point about Marwari — thanks for providing it. But I would also note that the authors have mapped “Marwari” as essentially coincident with Rajasthani. So why not call it that? Would one ever label all Italian dialects as “Tuscan?”

      And thanks as well for your comments on Serbo-Croatian, which are spot-on. I hope tat I have handled this issue better in today’s post.