Recent Focused Series »

Indo-European Origins
Northern California
The Caucasus
Imaginary Geography
Home » Historical Geography, Indo-European Origins, Linguistic Geography

The Malformed Language Tree of Bouckaert and His Colleagues

Submitted by on September 10, 2012 – 4:46 pm 19 Comments |  
In the previous GeoCurrents post, we examined the first element in the research of Bouckaert and his colleagues: identifying and comparing cognate sets, which is a prerequisite for the second step, namely constructing the linguistic family tree and putting a time scale on it. In this post, we will focus on the problems surrounding their diagram of the branching Indo-European languages; the next post will focus on dating issues. But first let’s consider how their methodology works.

Essentially, the method they employ is based on comparing cognate sets and calculating the number of shared cognates, which allows the grouping of languages into subsets based on the assumption that the more cognates a given pair of languages shares, the closer their relationship. A phylogenetic tree is constructed to represent those relative degrees of relatedness. The number of shared cognates also allows the researchers to estimate relative dating of splits on the tree: if  language A and language B share 97% of cognates (i.e. only 3 out of 100 items are not cognates), whereas a language C and a language D share merely 85% of cognates (i.e. 15 out of 100 items are not cognates), the split between C and D is taken to be five times older than the split between A and B. After relative dating has be so established, absolute timing is determined by factoring in known dates of specific historical events that are thought to be associated with splits on the tree. A caveat must be made here: any dating of the branching patterns of a linguistic tree presupposes that splits between separate languages are discrete events that happen at a certain point in time. In actuality, they are not, as language divergence is a gradual process. Certain historical dates can be assumed as approximate divergence dates, but only to a limited extent. For example, 1492 CE—the year Jews were expelled from Spain—can be taken as the divergence date between Spanish and Balkan Ladino.* Prior to the expulsion of the Jew from Spain, their language—though containing some borrowings from Hebrew—some  was virtually indistinguishable grammatically from that of the Christian Spaniards (see also chapter 12 of Languages of the World: An Introduction). Another example of a historical event used to estimate the time of a divergence on the tree is the split of Dutch and Afrikaans, dated to the establishment of the Cape Colony (though Afrikaans did not derive from Dutch in general but rather from certain specific dialects of that language).

In the ideal situation, the procedure described above can produces a reasonably serviceable model of language spread and divergence through time. But things are rarely as smooth in practice. As a brief demonstration of the potential pitfalls encountered by such a model, let’s consider how it works—and how it breaks down—in regard to a small selection of lexical items from seven languages spoken in Vanuatu and nearby regions of Papua New Guinea, shown in the image on the left. Colored cells in the table represent words in each lexical set that appear to be similar, whether through common descent (true cognates) or borrowing (which we cannot distinguish at this point, as is the case in regard to many of the supposed cognates used in Bouckaert et al.’s article as well). As can be seen, the Motu, Sowa, Mota, and Raga words for ‘wind’ are similar, but the Hiw, Waskia, or Amara counterparts are not. Mota and Raga have the most similar items, eight each. Next come Sowa with six similar items, Hiw with five, and Motu with four. Waskia and Amara appear to have no shared vocabulary, at least not in this selection.

Based on this data alone, we can hypothesize that Mota and Raga are the most closely related languages, with Sowa a more distant “cousin”, Hiw more distant yet, and Motu least closely related of all; Waskia and Amara, on the other hand, have to be treated as unrelated to each other and to the rest of the languages under consideration. (Relative dating of the splits can also be established, but we will not do so here.)


But if one examines additional vocabulary elements along with grammatical properties of these languages—as well as their known histories—a very different picture is revealed, schematized in the chart on the left. While Hiw, Sowa, Mota, and Raga are indeed grouped together into East Vanuatu subset, Motu is more closely related to Amara than to the first four languages. Both Motu and Amara belong to the Western Oceanic grouping, although they belong to distinct subsets within that grouping, which accounts in part for their failure to share any similar words in our sample. Together with other clusters of languages, East Vanuatu and Western Oceanic are members of the Oceanic branch of the Austronesian family. The Oceanic branch includes over 500 languages, among them the Polynesian tongues discussed in an earlier GeoCurrents post. Waskia, the seventh language in our set, belongs to the Trans-New Guinea language family, completely unrelated to Austronesian. The discrepancy between the tree produced by comparing potential cognates and the tree established by a more thorough analysis highlights an essential point: while words are an important element of any language, grammatical patterns—that is how words are put together—are equally important. We shall return to this crucial point below.

As with our Austronesian/Papuan model, the lexicostatistical methods employed by Bouckaert et al. produce a tree that has a questionable configuration. For example, the authors group Armenian and Tocharian together, with the split of the hypothesized Tocharo-Armenian proto-language dated to approximately 3200 BCE (5,200 years ago). As noted by Alexei Kassian of the Institute of Linguistics, Russian Academy of Sciences in his critique, “nobody has ever proposed such a grouping, which also directly contradicts not only traditional linguistic arguments, but also formal lexicostatistics as well”. The study also groups Frisian with Flemish and Dutch, rather than English, which many Germanic scholars find objectionable. This grouping is probably due to the fact that Frisian and Dutch have heavily borrowed from one another. Thus, Frisian being grouped together with Flemish and Dutch suggests that a large number of borrowings were mistaken by Bouckaert et al. for cognates. As we shall see below, many other odd configurations on Bouckaert et al.’s tree likely derive from the same problem. It should also be pointed out that the authors of the Science article make a critical mistake in that they do not distinguish shared innovation from shared retention. It has become a cornerstone of historical linguistics that only shared innovations have probative value when it comes to language classification. Otherwise, a variety of English which uses stone has to be considered closer to German (which uses the cognate stein) than to a variety of English which uses (the French loanword) rock. However, because stone and stein are shared archaisms, they are of no worth for classification.

Without going through each split on the tree one by one, I will point out some of the most glaring errors, starting with those concerning the Romance languages. According to Bouckaert et al.’s tree, Romanian was the first language to separate from Latin, while Sardinian is grouped together with Italian, Romansch, and Ladin in a significantly later split (we will return to the Romanian problem in the next GeoCurrents post). Most traditional classifications, such as the one schematized in this Wikipedia chart, treat Sardinian as the first language to have branched off the Romance sub-tree, and for good reason; Sardinian has a substantial non-Indo-European substrate, which probably indicates that Sardinian Latin began to diverge from classical Latin as soon as the language was imposed on the island in 238 BCE. In regard to the tongues of the Italian Peninsula, Friulian is usually grouped with Romansch, Ladin, and other speech varieties spoken in Northern Italy, which are more closely related to Gallo-Romance (including French) than to standard Italian, as the latter is based on the Tuscan dialect, belonging to the South Romance grouping. Among the commonalities that Northern Italian dialects share with (Standard) French are grammatical effects resulting from contacts with the Germanic languages of the Franks, Burgundians, and Longobards, such as using subject-auxiliary inversion to form yes/no questions (illustrated by English Have you…? or French Avez-vous…?) and the obligatory nature of a subject in every sentence (illustrated by the ungrammaticality of *Rains in English or the corresponding *Pleut in French). Northern Italian dialects share these pattern with English and French; southern Italian dialects do not.

Another unsupportable configuration of the Science model concerns the internal grouping of the Slavic languages. According to the classification employed by Bouckaert et al., Byelorussian and Polish are sibling tongues. They contend that Byelorussian is more distantly related to Ukrainian and Russian than it is to Polish, but also that Polish is more closely related to Ukrainian and Russian than it is to Czech, Slovak, or Lusatian. Such a notion contradicts the well-established classification of Slavic languages into South Slavic (including Slovenian, Serbo-Croatian, Bulgarian, and Macedonian), East Slavic (including Russian, Ukrainian, and Byelorussian), and West Slavic (including Czech, Slovak, Lusatian, and Polish!). This established classification scheme is based not merely on cognates, but also on the basis of phonology and grammar. For example, one of the sound patterns that characterizes East Slavic languages—but not Polish—is so-called pleophony (in Russian, polnoglasie), that is having an extra vowel/syllable in words like the Russian moloko, Byelorussian malako, and Ukrainian moloko for ‘milk’ (all stressed on the last syllable). Polish, like other West Slavic and South Slavic languages, does not have this feature; hence, the Polish mleko, the Czech mleko, the Slovak mlieko, the Slovenian, Serbo-Croatian, and Macedonian mleko, and the Bulgarian mljako. Additional argument—which are numerous—for classifying Polish as a West Slavic language, more closely related to Czech or Slovak than to Byelorussian, Ukrainian, or Russian, are discussed in detail in Sussex & Cubberley (2006). Even earlier glottometric approaches similar to those employed by Bouckaert et al. failed to group Polish with Byelorussian.** Bouckaert et al. appear to have hypothesized a close tie between Polish and Byelorussian due to a high degree of “horizontal transmission” (i.e. borrowing) that characterized the two languages for centuries. From the 14th century to the late 18th century, the Byelorussian lands were politically tied to Poland, first as part of the Grand Duchy of Lithuania, in personal union with Poland, and later as part of the Polish–Lithuanian Commonwealth. Significant ethnic and linguistic mixing characterized these lands as recently as 100 years ago.

In regard to the higher level of Slavic classification, Bouckaert et al. group West Slavic and East Slavic languages together, separating them from the South Slavic languages. While this classification is supported by many patterns in sound structure, morphology, and word order, other historically based patterns support different classification schemes. Some scholars group East and South Slavic together as opposed to West Slavic; others draw a distinction between North Slavic (including Polish, Sorbian, and the three East Slavic languages) and South Slavic (including, perhaps confusingly, the traditional South Slavic languages as well as Czech and Slovak). Confusion here is generated by the fact that some linguistic features cut across internal Slavic boundaries: for instance, fixed-stress languages include those of the West Slavic branch and Macedonian. Some of these cross-cutting features are due to religious affiliations: languages whose speakers tend to be Orthodox Christians favor Greek-based lexis, while languages whose speakers are non-Orthodox “often show a greater preference for indigenous or Western lexis” (Sussex & Cubberley 2006, p. 9). The use of Cyrillic or Latin-based alphabet also lines up with religious affiliation. Other grammatical similarities that cut across the traditional Slavic language-family categories are due to extensive borrowing, as has been demonstrated in the case of the Balkan sprachbund (i.e. an area where linguistic features are shared across family boundaries). Two Slavic languages found in the Balkan sprachbund area—Bulgarian and Macedonian—share a number of features with non-Slavic neighboring languages such as Romanian and Albanian, most notably their reliance on suffixed articles. Compare the Bulgarian grad-at ‘the city’ with the Romanian counterpart oraş-ul (the hyphen is used to show the boundary between the root and the suffixed article): although the suffixed articles are themselves different—Bulgarian -at vs. Romanian -ul—the use of such suffixed articles in general is distinctive, limited in Europe to the Balkans, Scandinavia, and Lithuania. The difficulties encountered in trying to fix the higher-order classification of the Slavic languages indicate that a tree-based model may not be the ideal tool to represent language relatedness. As a result, newer network- and wave-based models are gaining ground in historical linguistics.

The internal groupings within the Indo-Aryan branch of the Indo-European proposed by Bouckaert et al. are similarly unexpected. As can be seen in the figure to the left, which takes a segment of their tree and color-codes each language to represent the established classification scheme, hardly any of the traditional groupings are reproduced by Bouckaert et al.’s algorithm. The only cluster of languages that corresponds to the schema created by scholars of the Indic languages consists of Assamese, Oriya, and Bengali, which are all members of the Eastern Zone. Bihari, also traditionally classified as a member of this group, finds itself loosely related to Hindi and Urdu (both Central Zone languages), as well as to Lahnda, a Northwestern Zone language. Singhalese, which is typically treated as a member of a different grouping altogether (not color-coded here), is linked by Bouckaert et al. together with Kashmiri, another Northwestern Zone language, more closely related to Lahnda and Sindhi. (Many linguists classify Kashmiri as a Dardic language, which would place it on a highly distinct branch of the Indo-Aryan languages.) Breaking from all established classification schemes, Bouckaert et al. group Sindhi with Marwari, a member of the Central Zone, and Lahnda with Urdu. They further claim that Urdu and Hindi are not particularly closely related, their split dating to about 1,200 CE. In actuality, Hindi and Urdu are basically mutually intelligible, and as a result many people consider them to be different forms of the same language. We know from unassailable historical sources, moreover, that these two languages broke from the Hindustani dialect continuum only in the 19th century. Similar mistakes are found elsewhere on the tree. Bouckaert et al., for example, treat Gujarati and Marathi as relatively closely related, but scholars with actual expertise in this area argue that Gujarati is more closely related to Hindi, Urdu, and Marwari than it is to Marathi.

Yet the biggest oddity of the modeled linguistic tree involves its treatment of Romani, the language of the Gypsies, or Roma people, whose sociolinguistic aspects are discussed in more detail in an earlier GeoCurrents post. According to Bouckaert et al., Romani was the first Indo-Aryan language to split off the rest of the tree, around 1500 BCE (3,500 years ago). The improbability of this date was noted by biologist and blogger Razib Khan. What Khan does not mention, however, is that linguistic analysis alone can demonstrate its absurdity. Here one needs to examine the evolution of the Romani grammatical gender system and of the corresponding systems in other Indo-Aryan languages. Earlier forms of these languages had three genders—masculine, feminine, and neuter—inherited from proto-Indo-European. But for reasons that need not concern us here, the Indo-Aryan languages, including Hindustani and others, lost the neuter gender. From written sources we know that this change occurred some time around 1000 CE. Once the neuter gender was lost, the formerly neuter nouns were reassigned to either the masculine or feminine gender, seemingly at random. Modern Romani too has only two genders, masculine and feminine. As in Hindustani, and hence modern Hindi, the majority of the formerly neuter nouns became masculine in Romani. Crucially, gender reassignment occurred in the same manner in Romani as in Hindi. For instance, agni ‘fire’, which was neuter in Prakrit, the ancestral language of modern Indo-Aryan languages, became the feminine āga ‘fire’ in Hindi and likewise the feminine jag ‘fire’ in Romani. Such parallel changes apply to hundreds of formerly neuter nouns; thus, it is statistically all but impossible that they could all be reassigned to the same gender in Hindi and Romani purely by chance. The simplest explanation is that Romani separated from the languages of northern India after the loss of the neuter gender around 1000 CE, and the reassignment of nouns, which happened only once, with both Hindi and Romani inheriting the novel forms. As a result, Romani could not have branched off from the languages of northern India before the 11th century CE.

Dating the Romani split to a period 2,500 years later than the one proposed by Bouckaert et al. receives further support from genetic studies, which place the “founding event” (that is, the Roma exodus from India) “approximately 32-40 generations ago”. Assuming 25-30 years per generation, this figure nicely matches the 1000 CE date derived from linguistic studies (see Morar et al. 2004). Bouckaert et al. probably classify Romani as the “odd man out” among the Indo-Aryan languages due to the fact that it picked up a much of its vocabulary from languages it came into contact with during its journey from India to Europe: Armenian, Turkish, Persian, Kurdish, and especially Greek. Among the many Greek loanwords in Romani are drom ‘road’ from the Greek drómos ‘road’, zumin ‘soup’ from the Greek zumí ‘soup’, xoli ‘anger’ from the Greek xolí ‘anger’, as well as grammatical loanwords like pale ‘again’ from the Greek pale ‘again’, komi ‘still’ from the Greek akómi ‘still’ and numerals efta ‘seven’, oxto ‘eight’ and enja ‘nine’.

Yet again, we see that accepting the model proposed by Bouckaert and his colleagues requires one to believe not just three but actually dozens of impossible things before breakfast (with apologies to Lewis Carroll). We shall examine more of their linguistic failings in tomorrow’s post.



* After 1492, Ladino in the Balkans was wholly isolated from Spanish (making it a perfect separation point), whereas Ladino in North Africa (esp. Morocco) remained in contact with Spanish.

**Both Polish and Byelorussian were shown by these earlier studies to have close ties to Ukrainian, but lesser connection to each other, as reported in Sussex & Cubberley (2006, p. 474).




Morar, Bharti; David Gresham; Dora Angelicheva; Ivailo Tournev; Rebecca Gooding; Velina Guergueltcheva; Carolin Schmidt; Angela Abicht; Hanns Lochmuller; Attila Tordai; Lajos Kalmar; Melinda Nagy; Veronika Karcagi; Marc Jeanpierre; Agnes Herczegfalvi; David Beeson; Viswanathan Venkataraman; Kim Warwick Carter; Jeff Reeve; Rosario de Pablo; Vaidutis Kucinskas and Luba Kalaydjieva (2004) “Mutation history of the roma/gypsies”. American Journal of Human Genetics 75(4): 596-609.

Pereltsvaig, Asya (2012) Languages of the World: An Introduction. Cambridge University Press.

Sussex, Roland and Paul Cubberley (2006) The Slavic Languages. Cambridge University Press.


Previous Post
Next Post

Subscribe For Updates

It would be a pleasure to have you back on GeoCurrents in the future. You can sign up for email updates or follow our RSS Feed, Facebook, or Twitter for notifications of each new post:

Commenting Guidelines: GeoCurrents is a forum for the respectful exchange of ideas, and loaded political commentary can detract from that. We ask that you as a reader keep this in mind when sharing your thoughts in the comments below.

  • Trond Engen

    You say: As a result, newer network- and wave-based models are gaining ground in historical linguistics. But isn’t that a statement that might need some qualifying? The point (as this utter layman understands it) is that on a fairly shallow timescale, with related languages staying in contact, preserving a high degree of (local) mutual intelligibility, centers of innovation and sociolinguistic prestige will shift, and network- and wave-models make more sense, while on a deeper scale, when languages start to behave independently, the tree model is better suited. I mean, there are people out there who evoke these newer models as a reason to reject the whole damn Indo-Arborean family.

    • Asya Pereltsvaig

      You are absolutely right that no simplistic model can represent such a complex matter perfectly… and different aspects of this complex reality are better represent with one or the other model.

  • Trond Engen

    That’s not to say that the wave model is irrelevant for languages that aren’t closely related. It comes handy again for sprachbund effects, calquing and culture words, but in these cases such effects have to be weeded out to see the trees underneath.

  • Tom D

    While many of your criticisms thus far have been pretty well-founded, here I think you misunderstand the methods used in Bouckaert et al. (2012).

    You describe a very standard model used in lexicostatistics: get a cognate set, determine the percentage of shared cognates or some other measure–which we can more generally call “distance”, and then group the languages with the shortest distances as closer to one another than languages with longer distances.

    This is not at all the method that Bouckhaert et al. (2012) use. They use Bayesian maximum likelihood methods (hereafter Bayesian ML methods), which differ fundamentally from distance-based methods like lexicostatistics. The method basically works by the following.

    We have an initial simplifying assumption: we aren’t interested in anything other than cognacy. If, for instance, a given pair of cognate words across two languages have sound changes, this is unimportant to us. We could include this information if we wanted to, but to make our lives easier, we have simply decided to code characters (a possible cognate set) as 0 or 1.

    For instance, in Japonic, we have two cognate sets for the word ‘to sell’: one with the Ishigaki word kasïn, another with the Japanese word uɾu and with the Oogami word ʋː. We’d end up with a character set for words like kasïn as 100 (languages are Ishigaki, Japanese, and Oogami, respectively), and one for words like uɾu as 011. If we want to find the likelihood of Proto-Japanese-Oogami having a word like uɾu, it would clearly be 1 from our observed data–it is very likely that their common ancestor shared the same trait as them. If we go back to the common ancestor of Japanese, Oogami, and Ishigaki, it is less clear–a probability of only 0.5.

    However, what we’ve done so far was a simple example. Oogami could have very well borrowed ʋː from Japanese. While we know this isn’t the case, it can fool these methods. However, they have proven to be surprisingly robust to undetected borrowing in linguistic applications, as well as its analog, horizontal gene transfer, in biological applications (see Currie et al. 2010 for a discussion relevant to

    linguistics). We also haven’t included a model other than “the most parsimonious tree is the most likely tree.”–that is to say, the tree with the least number of changes in state of cognate sets to go from the ultimate common ancestor to all of the daughters is probably the correct tree.

    It is very important to remember that we’re still dealing with cognates all the way down the tree. Like I said, we can actually compute the likelihood that Proto-Oogami-Japanese has a form like uɾu (P = 1) versus a form like kasïng (P = 0)

    Bouckhaert et al. contend that we can do better than that (as do biologists who’ve developed models for DNA, RNA, etc. evolution): we can develop a simple model for cognate change–really, an abstract, simple model for language change. We know that some words evolve faster than others, so we want cognates to be replaced in those words quicker than in other words.

    I’ll spare some of the details, but, for instance, we know based on the rest of the Japonic languages, that kasïng is an innovation specific only to the Yaeyama Islands. Therefore, we could say ‘to sell’ evolves very slowly. Given this data–we hypothesize it evolves very slowly, and kasïng cognates are only found in the Yaeyama Islands–it then becomes much more likely that the Proto-Japonic form was something like uɾu, so instead of a probably of 0.5, it might end up being something more like 0.9. We can then say that, given our data and given our model, we’re 90% certain that a word cognate to uɾu was the Proto-Japonic form for ‘to sell’.

    I hope I’ve made it pretty clear that this is not just “counting cognates” or however one wishes to describe lexicostatistics–these are fundamentally different methods.

    The attraction of using these models is that they can possibly do things like let us infer dates (by calibrating branch lengths to actual times), to geographic inference (as we’ve seen here), etc.

    • Asya Pereltsvaig

      Thanks, Tom, for explaining Bouckaert et al.’s methods in such great detail and so clearly. I was simplifying when I said that they “count cognates”, but what you explain isn’t that far from that, as the probabilities thus calculated are based on the number of items that look alike across various subsets of languages. It is absolutely true that sound changes etc. are not taken into account (while considering innovations rather than retentions would give a better model). Also, you say that “these methods… have proven to be surprisingly robust to undetected borrowing in linguistic applications” — but I am surprised at how many of the obvious mis-calculations on the tree “scream borrowing” (I’ve mentioned the Polish-Belorussian issue, the Romani problem, and the Romanian problem in my critique, and I am sure there are others).

      • Tom D

        “I was simplifying when I said that they “count cognates”, but what you explain isn’t that far from that, as the probabilities thus calculated are based on the number of items that look alike across various subsets of languages.”

        I think I’m the one who seems to have oversimplified here, and looking back at it, I’ve ended up saying something I didn’t mean.

        The probabilities are not calculated based on the number of items that look alike. (In fact, they are calculated based on the number of trees in the set of trees the methods generate which look alike.) I know this isn’t exactly what I said above, so I’ll explain some more.

        These methods actually infer the character state going back though the tree, and at every node (i.e., the point where two branches join). This means they really are able to infer what is a shared innovation from a shared retention. One of the methods they used to analyze this was the stochastic Dollo model. As this is the simplest one, I’ll go over it (most of the following is a paraphrase of Felsenstein 2004).

        In a naive kind of Dollo parsimony, there are two possible states for a cognate set in a given language: 0 (“absent”) or 1 (“present”). 0 is ancestral, while 1 is complex. In parsimony-based methods in general, we want to find the tree with the least number of changes. 0 is allowed to change state into 1 once and only once. 1 can change back into 0 as many times as we care to have it. There are multiple ways to do this computationally–but the simplest and most intuitive is to simply assign penalties to changes of state. We heavily penalize a change from 0 > 1, but to only lightly penalize a change from 1 > 0.

        How this method distinguishes shared innovations from retentions is kind of “hidden” in the algorithm, but the certainly do distinguish them. Let’s take my Japonic example again. Using this simple algorithm, let’s look and see which is more likely to be innovative: uru or kasïng.

        Both Japanese and Oogami have a form cognate to uru, while Ishigaki does not. And nethier Japanese nor Oogami have a form cognate to kasïng, while Ishigaki does. Again, we’d put these into binary characters: uru 110 and kasïng 001.

        If going from 0 > 1 is costly and going from 1 > 0 is not, our algorithm will spit out that the state of Proto-Japonic for the cognate set uru is 1, and that Ishigaki lost it (going from 1 > 0). Other reconstructions, like that PJ didn’t have it and Japanese and Oogami gained it, are much more penalized and are dispreferred.

        The kasïng cognate set actually shows how we really need more than just three languages for this sort of model. Depending on how we penalize things, we might have the incorrect tree, where PJ has a cognate for kasïng, and then both Japanese and Oogami lost it. There are two ways to prevent this. The first is to use more than three languages. For instance, Hachijoo has a form congnate with uru. We would then have more evidence that Ishigaki is innovative. Also, in a more basic sort of parsimony, where we’re only counting numbers of changes, we would certainly be pointed towards the correct reconstruction: since one change is less than two changes, we can see that it’s simply that Ishigaki is innovative. So even something as simple as counting the number of changes between possible trees can give us a (naive) way to distinguish shared innovations from retentions.

        We then need to lay these two trees on top of one another. But, since we don’t have any real information that Japanese and Oogami are different, all we can really do is just make a split between Ishigaki and Japanese-Oogami.

        I should hope that it is obvious that borrowings will wreak havoc on naive parsimony methods. Another issue is if we don’t know what a particular cognate should be (and have no way of asking), like with an extinct language. This is why they have largely been abandoned by biologists, and why Bouckaert et al. use a modified version of the above, as well as Bayesian maximum likelihood (Bayesian ML) methods.

        The Bayesian ML methods are more complex–enough that I won’t explain them here, but they let us do more (like separating “rate of change” and “length of time” from a more general “amount of change”, usually visualized as branch length). They too operate using character state changes, and they too distinguish shared innovations from retentions.

        About the probabilities: these stochastic Dollo and Bayesian ML methods, unlike a naive Dollo parsimony method, generate a large set of trees, because they account for uncertainties, conflicts, and test out slightly different variations of the models you give them. The probabilities given represent the number of trees that agree with a particular subgrouping, not the percentage of cognate similarities. For instance, by default, BEAST, the program they use to do these analyses spits out 10,000,000 trees. If we’d say Ishigaki and Japonic-Oogami split off from one another and we’re 90% sure about it, it’s not because they share 90% of their cognates, but because 90% of the trees in this set of 10,000,000 trees agree that this is the correct grouping. Further, in my model above, it wasn’t because they share 90% of anything. It is how likely, given the data and the model, that that is the correct subgrouping.

        Sorry for oversimplifying before, I hope this makes things a little more clear.

        • Asya Pereltsvaig

          Thank you for your detailed explanations. So if they model is supposed to identify borrowings, why does it get off-track most when high (or especially) low degree of borrowings is involved?

          • Tom D

            Well, the model does not and cannot identify borrowings per se; it’s more that the phylogenetic signal (the relatedness inferred from everything else) is still strong enough in spite of the borrowings screwing it up. Other methods can be used to try and identify borrowings (see Claire Bowern’s recent article on Tasmanian languages), but they didn’t employ them here.

            I kind of off-handed mentioned this, but one of the articles discussing borrowing I was thinking about was Greenhill et al. (2009), where they did a simulation study (i.e., fake data which lets us know what the true tree is and where the borrowings are) to look at the potential effects of borrowing on these sorts of analyses. They found that for the sorts of word lists they had been using, borrowing of up to 20% of the vocabulary was realistic, and that we more or less find the true tree at even up to a 30% rate of borrowing.

            They also found that the older the borrowing, the more problematic (as we might expect). Interestingly, they found that with higher levels of older borrowings, attempts to convert the amount of change into a date (and rate of change) would actually end up with a younger-than-expected proto-language.

            This tree looks very different than some of their previous analyses (like in Gray and Atkinson’s original article in Nature). There are a couple of possibilities for why this is.

            First is that they used different data which hadn’t been scrutinized properly to weed out borrowings. But unless someone goes through all of their data, we won’t know. As the mantra goes, “any errors remain [their] own”, but in this case, it could well have been somebody else who screwed up, since their data is from already published word lists and databases. And this kind of task would be a monumental, if not impossible, task even for the most seasoned Indo-Europeanist, because do we know really all of the borrowings throughout the history of IE languages? I doubt it…

            Second is the methods themselves. They limit themselves to basic vocabulary, but for very good reason. These lists are easy to obtain and are available even for some of the extinct and ancient IE languages. While basic vocabulary can certainly be borrowed, it appears from initial investigations to be borrowed at a lower rate. If we didn’t limit ourselves to this, we might resolve some issues better (particularly Sardinian), but we might get lots more problems (especially for extinct and ancient languages, like Hittite, as well as languages where we know there’s a lot of borrowing, like Romani).

            There are plenty of other issues that could be at play, but here I think it’s more than likely the data.

            A final thing to keep in mind is that this may actually be the true tree (or a close approximation thereof), but only for the history of IE vocabulary. I have reservations about the accuracy of this tree; it likely isn’t a close approximation of the true tree. But including other sorts of data, like sound changes, could tell us something different, and grammatical changes could tell us something different again. Like including extra vocabulary, we could do these things, but unlike vocabulary, we face a serious issue of how to code them, even if we try to go beyond binary “presence-absence” character states.

            In the end, vocabulary seems to be a nice fit with the tools we have and the things we want to do with out data and our tools. And, it’s interesting that their earlier work on IE (and Austronesian, etc.) actually gets so much right, even without the additional evidence from grammatical changes and sound changes.

          • Asya Pereltsvaig

            Thank you for your continuing comments, Tom! And also for including the links to the articles you mention — this is really helpful to us and to the other readers! A couple of points:

            1) A model that “actually gets so much right” or “isn’t a close approximation of the true tree” should be advertized as “decisive support for an Anatolian origin” or the “final solution”, or whatever… (now of course the press did most of this advertizing, but as we’ve made clear elsewhere in our critique, our problem is BOTH with the actual paper and the way it’s got publicized in the media).

            2) As regards borrowings, it is probably possible to weed them out computationally, without recurse to etymologies of each and every word in each and every language. But to do so, sound changes (and possibly grammatical changes) need to be taken into account. In fact, I give my students in the very introductory linguistics course (i.e. with not a single linguistics prerequisite) an exercise exactly like that. They manage. Professional linguists should too. At least for some of the data, of course.

            3) It’s too easy to say that the cognate data is bad and it’s not theirs, so they are off the hook. I don’t agree with that, though. First, I don’t believe in using somebody else’s data and not having responsibility for it. Everybody is allowed to publish bad work, and many people do. The peer-review process is supposed to take care of it, but it does fail at times. It doesn’t give other scholars a carte blanche for reusing bad data without critical assessment. Or would these authors take the blame if someone else uses their faulty results and builds upon them? I doubt it. Anyway, there’s a bigger problem here: it is not that this or that word fed into their model was incorrectly analyzed by some other linguist. Such problems, if minor, would be easy fixes. The larger issue is that their model NECESSITATES the use of such data that is likely to be incomplete, faulty, etc. When phonological or grammatical changes are considered, we are dealing with a SYSTEM, not just a list of idiosyncratic items, so there is less room for error… But as you say, applying this model to phonological or grammatical changes is far too complicated.

          • Dragos

            In my opinion it’s not only the data, but the models which are not good enough. All the IE trees suggested by the models of Gray, Atkinson et al have only binary branches splitting off in a similar, and often unrealistic manner. For example in the Romance group the divergence followed the quick expansion of the Roman empire (which did not start immediately, not every regional variation of Latin was a “Romance” one – see J. N. Adams’ The Regional Diversification of Latin 200 BC – AD 600) None of the trees is able to capture such a phenomenon, because the amount of changes found in each language is naively correlated with a branch length. That’s why I find these sophisticated algorithms not much better than the standard lexicostatistical methods. There are some improvements though: the change/time ratio is not fixed and calibration is possible, but this is not enough. To get a realistic tree one has to add pre-conditions to various clades, because the models don’t know the real history of the speakers. And while we can fix the trees for the historical attested languages, we have no information for the prehistorical ones. Eventually there may be language families which can be described succesfully by such trees, but how do we know this is not just chance, how do we know the IE family is one of them?

            I already tried TraitLab and I described one of my experiments here: I also performed another one with sound changes but I don’t find the dataset anymore. I used 8 Romance languages and I coded about 15 binary characters based on known sound changes (e.g. prothetic e in Old Spanish and French, the rhotacism of intervocalic l in Balkan Romance, etc.) There was no visible improvement. Moreover as I added supplementary conditions to each clade (e.g. we can consider languages such as French, Provençal, Spanish different enough 1000 years BP), the root was going further back in time. There were also sub-grouping problems and I think there are two main reasons for that: a) innovations may occur independently b) if some innovations spread through a central group of languages, it doesn’t mean the languages on the periphery form a separate group.
            Fixing groups and dates by excessive calibration raises another question: if we know most of the groups and the dates, why would we need such a model? I am not impressed by “hey, our model correctly suggested the Italo-Celtic group”. Only if a model can identify and date all the known clades, only then I can trust its claims about the origin of IE languages.

          • Asya Pereltsvaig

            Thanks for sharing these excellent points, Dragos. What I find fascinating is that in the earlier work, published a few years back in Nature, they used pretty much the same model (minus geography) but the tree looks subtly different. They used a different set of languages, for sure, but even for the same languages, there are some things that were wrong in the Nature paper and have been fixed in this latest one, and a few things that were right in that paper but now they are getting them wrong (and a few things that were wrong them and are still wrong now). So it’s not that the model improved much, from that point of view. Which indeed brings us to the question you asked at the end: if we already know what the tree should look like, why do we need their model at all? But here’s what I think: we do know quite well what certain parts of the tree should look like (e.g. Polish shouldn’t be part of the East Slavic cluster), so we can validate the model when (if!) it produces a tree that matches those benchmark points. Once the model can adequately match the things that we do know, we may assume that it produces a good result for things we don’t know or are not sure about (like the age and location of PIE)—but only then!

          • Dragos

            I also agree their models did not improve much. In the recent papers they sought to answer some of the criticism, but they missed many important objections and they remained firm in their questionable assumptions.

            They may claim some topology errors have small effects in determining the age of the rest of the nodes. However their failure to predict the rates of change is obvious across their trees. As you noted already, they claim Romani split off more than 3500 years ago and evolved slowly, whereas the opposite is true. In several other cases they suggest the literary languages changed faster than the vernaculars: Latin is represented by a thick line, the Romance languages by thin ones; Old Church Slavonic is thick, the contemporary Slavic languages/dialects are thin; Ancient Greek (the Attic-Ionic dialects?) is thick, the “other” Greek (dialects) is thin. And we’re supposed to believe Koine branched off Ancient Greek in 1400BC!? Or conversely that Polish and Belarusian differentiated only 500 years ago, Portuguese and Spanish 500 years ago, Italian (I assume the Toscan dialect) and Friulian 600 years ago, and so on? If anything, these examples show such methods are not capable of estimating dates.

          • Asya Pereltsvaig

            Yes, there are major errors in both the classification and the dating…

  • Jack

    Oh dear god please stop using jpg images. Are you aware that jpg is a terribly lossy format? Hence why the pictures look awful and blurry. Please use ‘png’s, like the rest of the internet.

    • Kevin Morton

      Like the rest of the Internet? Based on a sample of a million images from Google, around 90% of images on the Internet are JPG. While JPG is lossy, like you say, its compression is pretty darn efficient for many types of images and has contributed greatly to servers across the world not getting clogged up with millions of weighty image files.

      That said, 90% is probably far more than it should be used. JPG is great for images that have colors like a photo (photographs, most artwork, some maps), but shouldn’t normally be used for images with a lot of text that needs to retain its sharpness.

      So you’re right, PNG should be used for the tree diagram images in this particular article (Part of the problem just lies in resolution as well). In any case, I thought I would at least qualify your “like the rest of the internet” quip.

      Source for the 90% figure:

      • Asya Pereltsvaig

        Thanks, Kevin! Very good point in general; however, in this particular case the images I get from the Supplementary Materials PDF are already blurry and not very good in resolution, alas! Their file can be downloaded here (if anyone is interested):

        • Kevin Morton

          Indeed, if the original you are retrieving is not great quality, re-saving as a PNG is not going to do you much good.

          • Asya Pereltsvaig

            Another instance of the “garbage in garbage out” principle… :)

  • Mark

    Your linguistic classification is not corect.

    See Slavic languages:

  • Pingback: » Blog Archive » Շումերերենը հայերե՞նն է. բառարան հայերենի ու շումերերենի առնչությունների վերաբերյալ