Recent Focused Series »

Indo-European Origins
Siberia
Northern California
The Caucasus
Imaginary Geography
Home » Europe, Indo-European Origins, Linguistic Geography, Russia, Ukraine, and Caucasus

Shared Innovations Are More Important Than Shared Retentions

Submitted by on October 12, 2012 – 5:23 pm 29 Comments |  

One linguistic phenomenon that can cause numerous errors in constructing a phylogenetic tree using lexicostatistical methods is borrowing across languages, as discussed in earlier GeoCurrents posts. In this post, we focus on another factor that likewise leads to a misshapen tree: using shared retentions rather than shared innovations as testimony for intermediate nodes on the tree. As correctly pointed out by Jaakko Häkkinen in his response to Bouckaert et al.’s article, “such retentions cannot reliably testify for an intermediary proto-language, because the highest retention rate can sometimes be found in the opposite ends of a language family”. To put it differently, an innovation in one language can make its sister and its cousin on the true divergence tree look more similar and hence more closely related than they actually are, as schematized in the chart on the left.

Below, I will exemplify this phenomenon with four Slavic languages: Russian, Belarusian, Ukrainian, and Polish. The former three are typically said to form the East Slavic branch, while the latter belongs to the West Slavic branch (the third branch, South Slavic, will not be discussed here). As I show below, Russian is in several respects the innovating language in the East Slavic group, which makes the other two East Slavic languages, Belarusian and Ukrainian, appear more closely related to Polish than they are to Russian, as schematized on the left. It should be remembered, however, that the labels “innovating” and “conservative” apply to these languages only with respect to the specific phenomena under discussion; in many other respects, Belarusian and Ukrainian, or Polish may be the “innovating languages”, and Russian the more conservative tongue.

Let’s first consider the lexicon, as it is the domain that Bouckaert et al. rely on.* The table on the left lists the names of the twelve calendar months in the four languages. For ease of presentation, the cognates shared by Polish with Belarusian or Ukrainian (or both) are highlighted by different colors; the cognates shared by Polish with Russian (and in one instance, with Belarusian) are given in boldface. Based on this data alone, one would be fully justified in classifying Polish as most closely related to Belarusian and (somewhat less closely perhaps) to Ukrainian. Russian, contrarily, would be seen as a more distant cousin. As it so happens, this is exactly the tree that Bouckaert et al. provide in their paper (Supplementary Materials, Figure S1); the relevant detail of their tree is reproduced below:

However, this view results from an incorrect interpretation of the data. Rather than being testimony for the closer link of Belarusian/Ukrainian to Polish than to Russian, these data result from the fact that Russian adopted the month names of the Julian calendar, while the other three languages generally retained the original Slavic terms. (The Julian term for ‘May’ has intruded into the otherwise non-Julian system of Belarusian and Polish; and Polish mazec ‘March’ is also Julian in origin.) As discussed by Sussex and Cubberley (2006: 476), the earlier Slavic names for months “show etymologies … reflecting various aspects of flora, fauna, climate and activity”. For example, the term for February derives from ‘bitter, fierce’ in reference to the typically cold weather of the month. The term for ‘July’ comes from ‘linden tree’; interestingly, Russian has the word lipa for ‘linden tree’ but it does not preserve the month name based on it. Likewise, ‘September’ is the ‘heather’-month, while ‘November’ is the ‘leaf-falling’ month. The other month names that are not shared between the three languages—Belarusian, Ukrainian, and Polish—may come from different roots, but they too have weather- or activity-describing etymologies: for example, the name for ‘August’ in Ukrainian and Polish comes from the word for ‘sickle’ (cf. Russian serp ‘sickle’), while in Belarusian it derives from the root for ‘reaping’. Similarly, the names for ‘October’ in Belarusian and Polish derive from two different words for ‘flax’, while the Ukrainian term comes from the root for ‘yellow’. Crucially for our argument, the shared cognates across Belarusian, Ukrainian, and Polish are shared retentions, not shared innovations; lexicostatistical methods often take this sort of data—mistakenly!—to be evidence of common descent.

Instead, the more typical—and much better supported—classification of the four languages in question groups Belarusian and Ukrainian together with Russian rather than with Polish (see chart on the left). Numerous phonological, morphological, and even syntactic phenomena support this classification (an interested reader is referred to Sussex & Cubberley 2006 for a more detailed discussion); here, I will mention only two phonological phenomena that unify the three East Slavic languages (Russian, Belarusian, and Ukrainian) in contrast to the West Slavic languages such as Polish (and often in contrast to the South Slavic languages as well). The first such phenomenon is the so-called pleophony. As a result of a complex series of changes, East Slavic languages ended up with sequences -oro- and -olo- (in roots of words), whereas West Slavic languages have corresponding -ro- and -lo-. Compare, for example, the Russian korova ‘cow’ and zoloto ‘gold’ to Polish krowa  and złoto. Importantly, Ukrainian and Belarusian follow the Russian pleophony pattern: for example, ‘cow’ in Ukrainian is korova and in Belarusian karova; ‘gold’ in Ukrainian is zoloto and in Belarusian zolata (generally, Ukrainian does not reduce vowels the same way Russian does, while Belarusian does reduce vowels and reflects the vowel reduction in spelling as well; as a result, Russian words are spelled like the Ukrainian ones, while their pronunciation is closer to their Belarusian counterparts).

Another phonological pattern that groups the three East Slavic languages in contrast to Polish (and other West Slavic languages) is the treatment of nasal vowels inherited from Proto-Slavic: in the East Slavic languages these vowels have lost their nasal qualities, whereas Polish has retained nasal vowels. The back nasal vowels, essentially the short and long nasal o-sounds, have been replaced in East Slavic by /u/, as in ruka ‘hand’ and zub ‘tooth’ (shared by all three East Slavic languages). In contrast, in Polish these have become the nasal e- and a-sounds, marked in Polish orthography by the hooks under the corresponding vowel letters, as in ręka (pronounced /renka/) and ząb (pronounced /zamp/). Similarly, the short and long nasal e-sounds have turned into /a/ in East Slavic, as in p’at’ ‘five’ and r’ad ‘row’ (subsequently, in Belarusian the “soft” r-sound has been turned “hard”, as in rad ‘row’). The corresponding forms in Polish feature nasal e- and a-sounds, as in pięć ‘five’ (pronounced /pjenč/) and rząd (pronounced /žand/). Once again, Belarusian and Ukrainian pattern with Russian rather than with Polish.

To return to vocabulary issues, the month names discussed above are not the only area where Russian deviates from Belarusian or Ukrainian. Generally speaking, Russian has borrowed more heavily than its East Slavic brethren from Finnic-speaking neighbors to the north, Turkic- and Iranian-speaking neighbors to the east and south, as well as from Western European languages. A significant proportion of its lexicon is also constituted by words borrowed from Old Church Slavonic (OCS). Despite the various movements in favor of the vernacular, these words, often originally belonging to the higher registers, remained in the language and can often be identified by their phonological characteristics, particularly where they exhibit combinations not found in modern Russian. Belarusian and Ukrainian, unlike Russian, have gone further towards adapting these words to native phonological patterns, to the extent that they have them at all. Such phonological traits that reveal the OCS origin of certain Russian words include the lack of pleophony (discussed above), as well as the sequence /ra/ instead of the more common East Slavic /ro/, both of which are exemplified by the Russian nagrada ‘reward’ vs. the Ukrainian nahoroda. Other OCS-isms in Russian are the appearance of /žd/ and /šč/, as in odežda ‘clothes’ and osveščenie ‘illumination’ vs. the corresponding Ukrainian odeža and osvičennja. Another notable Old Church Slavonic feature in Russian is the use of the verbal prefix {iz-} for {vy-}, as in the OCS borrowing izgonjat’ ‘exile’, contrasting both with the more colloquial, native Russian vygonjat’ ‘hoot away’ and with the Ukrainian vyhanjaty ‘exile’.

These examples of lexical innovation in Russian, whether concerning the borrowing of the Julian month names or the retained OCS lexis, exemplify another important drawback of lexicostatistic methods: the lexical level alone is not very reliable in determining language relatedness. As Häkkinen puts it, “a word could equally well be a later loanword than an inherited word”. The main reason for this unreliability of the lexical level is that “a sound change can be seen in numerous words, while words are single, separate units. A word appears, disappears or gets replaced independently from all other words, but sound change affects the whole vocabulary”. As scientists, we linguists look for systematic phenomena, hence our preference for grammatical (read, “phonological, morphological, or syntactic”) patterns over idiosyncratic words.

 

____________

*The pattern schematized in the diagrams above arises not only with respect to lexical innovations: phonological, morphological, and syntactic changes too may make a sister of an innovating language appear more similar to its cousin. For example, the application of the First Germanic Sound Shift (also known as Grimm’s Law) in Proto-Germanic makes Latin and Irish appear more closely related to Russian, Lithuanian, and Sanskrit than they are to Germanic languages such as English, Dutch, and Icelandic. Thus, Latin, Irish, Russian, Lithuanian, and Sanskrit all retain the PIE /d/, as in the words for ‘ten’ decem (Latin), deich (Irish), desjat’ (Russian), dešimt (Lithuanian), and daśan (Sanskrit) vs. the innovation /t/ in English ten, Dutch tien, and Icelandic: tíu. The correct philogenetic tree would have Romance (Latin) and Celtic (Irish) languages grouped with Germanic rather than Balto-Slavic (Russian, Lithuanian) or Indo-Iranian (Sanskrit). Similarly, the complete loss of the nominative-accusative distinction  in English (except with pronouns) makes Germanic languages retaining this morphological distinction with nouns or articles, such as Icelandic (hattur vs. hatt ‘hat’) and German (der Tisch vs. den Tisch ‘the table’), more similar to each other than either is to English. Again, the correct philogenetic tree has English more closely related to German than to Icelandic.

 

Sources:

Häkkinen, Jaakko (2012) “Problems in the method and interpretations of the computational phylogenetics based on linguistic data: An example of wishful thinking: Bouckaert et al. 2012”.

Sussex, Roland and Paul Cubberley (2006) The Slavic Languages (Cambridge Language Surveys). Cambridge University Press.

 

Previous Post
«
Next Post
»

Subscribe For Updates

It would be a pleasure to have you back on GeoCurrents in the future. You can sign up for email updates or follow our RSS Feed, Facebook, or Twitter for notifications of each new post:
        

Commenting Guidelines: GeoCurrents is a forum for the respectful exchange of ideas, and loaded political commentary can detract from that. We ask that you as a reader keep this in mind when sharing your thoughts in the comments below.

  • Tom D

    As I have mentioned before, it is very unfortunate that you all and Häkkinen assume the methods employed by Bouckaert et al. are simply a re-hash of lexicostatistics. The methods employed by Bouckaert et al. are perfectly capable of distinguishing shared innovations from shared retentions.

    While their tree is certainly flawed in some regards, you would need to do a lot more to prove that these methods cannot distinguish shared innovations from shared retentions, in my mind.

    First, you would need to look at the actual data used by Bouckaert et al., rather than data they did not use. While month names are a perfectly good example, they did not include month names in their analysis. Certainly we could guess that something similar might be contaminating their data, but without actual evidence, all we have is speculation. If we did find this to be the case in their actual data, it would certainly show that their data is problematic.

    Then you would need to show, using simulated data (where you know what is borrowed and what is not), that these methods in general cannot distinguish shared innovations from shared retentions. As I mentioned above, since we already know these methods can and do distinguish between shared innovations and shared retentions, this would really be impossible.

    Finally, a bit of a round-about comment. If these methods couldn’t distinguish shared innovations from shared retentions, why would biologists bother using them? In this regard, evolutionary biology faces the same issues as historical linguistics–both subgroup taxa (in the former species, in the later varieties of languages) only based on shared innovations. I should note that biologists certainly still do use, with proper caution, methods similar to lexicostatistics like UPGMA, but these methods are most certainly not what is employed here.

    • Jaska

      Excellent writing again from Asya!

      Tom, you didn’t understand: the Swadesh lists they use (= this makes the method lexicostatistic) already are based on retentions, not innovations! The Swadesh lists and similar basic word lists contain the words which should be the most stable – it means that they are mostly retentions, and only in some branches some meanings are innovations.

      If you read the critique thoroughly and many times, you will see the evidence for
      1. the innovations being more reliable than retentions for tracing the taxonomy (true divergence): retentions cannot PROVE the divergence, because shared retentions may be caused also by conservativeness.
      2. the phonological level being more reliable than the lexical level for tracing the taxonomy (true divergence) – remember Samoyed?

      • Tom D

        Using Swadesh lists or other lists of “basic” vocabulary certainly does not make a method lexicostatistics. It would be perfectly fine to use a list of “basic” vocabulary to start off doing a reconstruction of two language families using the comparative method. In fact, this is what Lyle Campbell recommends in his book Historical Linguistics.

        Lexicostatistics, as defined by Swadesh and others, is a method that involves the following: once you’ve collected a list of basic vocabulary, determine what’s cognate with what. From there, for each pair of languages in your data, convert the number of shared cognates into a percentage. That percentage will show you how similar a pair of languages ought to be.

        Another thing to clear up is what exactly glottochronology is. Many people use “lexicostatistics” and “glottochronology” the same way. However, glottochronology adds an extra layer on top of this. Swadesh developed an equation to take the percentage of shared vocabulary between pairs of languages and convert it into actual times. In theory, this would let us date the divergence of various parts of a language family.

        Of course, both of these methods are flawed for a variety of reasons. One of the more fundamental ones was that glottochronology, as it was originally proposed, assumes a constant, average rate of lexical replacement. Bob Blust wrote an article in 2000 showing that the rates of lexical replacement, even in just basic vocabulary, are so different that this average rate obscures and even screws up any attempt at dating.

        The fundamental problem with these methods was noticed in biology, but, as far as I’m aware, has never been noticed in linguistics: in converting real data (cognate sets) into distance measures (percentages of shared cognates), you lose data. Not in the way that they cannot distinguish shared innovations from shared retentions (and they cannot), but that by not reconstructing changes of state at levels beyond just pairs of languages, you actually lose information (in the information theory sense). This means these methods will never be as accurate or as reliable as the comparative method or other sorts of computational methods which utilize the comparative method.

        These methods are in a more general family of methods called distance-based methods. I mentioned another one above, called UPGMA, which is used in biology and is very similar to glottochronology. All of them share these flaws. Biologist, it seems, can tolerate some errors in final products, and are willing to use these methods where appropriate. Historical linguists, or so it seems, cannot tolerate any errors and will throw out anything that gives them.

        I’ve described how the methods used by Bouckaert et al. differ elsewhere, but they differ fundamentally. They do not look at how many cognates a language shares. They actually reconstruct the history of cognates along a tree, similar to how the comparative method would reconstruct histories of changes.

        Also, I never said that Asya’s or your general critique was wrong. It’s spot on. Any method that seeks to trace an evolutionary history, be it in linguistics, biology, anthropology, etc. must take the difference between shared innovations and retentions into account.

        What I am saying is that this part of the critique–a critique of the specific methods employed by Bouckaert et al.–is wrong because it rests on a fundamental misunderstanding of their methods. And I said that was “unfortunate”, without any explanation. I’ll do so here. I think it is unfortunate because it undermines a lot of legitimate criticisms that both of you all bring up. I think it is fine to not understand or fully understand the methods employed by different authors. Nobody can know everything. The issue is when someone tries to overstep the limits of what they know.

        • Tom D

          Oops. Meant to include a link to my comments elsewhere on the site: they can be found here.

          • http://www.pereltsvaig.com Asya Pereltsvaig

            Thanks!

        • http://www.pereltsvaig.com Asya Pereltsvaig

          You are right in that both lexicostatistics (and the glorified version employed by Bouckaert et al.) and the comparative method may start with lists of basic vocabulary. But what the two approaches do with such lists and what else they examine is very different. Comparative method immediately taps into the systematic part of language, while lexicostatistics forever remains on the shaky ground of the lexicon. And why do you think lexicostatistics people go back and forth on the reliability of this or that vocabulary list? You refer to Blust 2000, but are you familiar with Starostin’s later work proving (or attempting to prove, depending on whether you buy the conclusions or not) that Swadesh lists are reliable after all, if certain additional factors are taken into consideration? Atkinson himself seems undecided on which list is best to use: he argued in an earlier paper that Swadesh 100-word list is more reliable, but reverted to the 200-word list in this work.

          As for your point that some people may “overstep the limits of what they know”, I am afraid it’s misaddressed: in this case, it is Bouckaert et al. that seem to overstep the limits of what they know. They are either oblivious to the fact that they’ve essentially un-validated their own model (see my comment above), or they don’t seem to care about any EGREGIOUS errors their model may produce.

          You claim that “historical linguists… cannot tolerate any errors”. But as a scientist, what I cannot tolerate is a complete disregard for the scientific process when it comes to the study of language: one either has to adopt the established knowledge or to prove explicitly why this or that point is false. Before I can take seriously any aspect of Bouckaert et al.’s theorizing about IE origins, I’d like them to explain to me why everything we know about the development of (East) Slavic languages is wrong. Or why their model spits out such obvious errors. Or what facts they would admit as conclusively disproving their model. Because so far, the only people who are “throwing out anything” that inconveniences them is the Bouckaert team and their defenders.

          • Tom D

            I intend to get back to this, but I think we’re beginning to get into a fundamental difference in our approach to language change. You talked about it elsewhere as well, and I’m still mulling over a reply.

            As Labov (2007) put it, “transmission is the fundamental mechanism by which linguistic diversity is created and maintained”. If this is indeed the case, then we should expect vocabulary to be transmitted just as sound changes or whatever else is. The lexicon is not just an idiosyncratic list of words which may or may not reflect the evolutionary history of a language. They must reflect the evolutionary history of a language. It may not be the same evolutionary history as sound changes, but we shouldn’t expect it to be. It may tell us things sound changes can’t (the specifics of language contact, etc.), but so could grammatical innovations. It can be incomplete or faulty, but so can our understanding of the phonology of a language or its grammar. The lexicon is certainly not a fundamentally useless thing when tracing language evolution.

            And in the same vein, as I mentioned above in my one reply to Jaska, lexical items, sound changes, grammaticalizations, etc., all face the same issues in terms of tracing linguistic evolution. We must distinguish shared innovations from shared retentions. It may be easier to do with sound changes, but there are further complicating factors.For instance, things can evolve in parallel and end up looking like they really are shared innovations. Also, things can evolve and then revert back to a previous state. I have no examples off-hand, but I do know there is some extensive work on disproving the claim that grammaticalizations are irreversible (Janda 2001, Campbell 2001, etc.).

            And as for the number of words used, in more general simulation studies of these sorts of methods, the authors found that the more data you had, the better the analysis. The issue with getting outside of so-called “basic” vocabulary is that the rate of borrowing goes way up, and as the study I linked to here pointed out, when you get above a rate of borrowing of around 20-30%, even the best-performing algorithms stop giving accurate results.

          • http://www.pereltsvaig.com Asya Pereltsvaig

            That words reflect the history of the language does not mean that the lexicon is systematic or that it is a good tool for establishing divergence trees…

      • http://www.pereltsvaig.com Asya Pereltsvaig

        Jaska, thank you for your comment! Indeed, Bouckaert et al.’s methods may be “glorified lexicostatistics” but by the sheer fact of relying on word lists without considering sound changes and other systematic aspects of language they inherit those problems that you’ve mentioned.

    • http://www.pereltsvaig.com Asya Pereltsvaig

      Tom, I never said that Bouckaert et al. use the month name data — they don’t, and I purposefully selected it because it is exactly the kind of “simulated data”, as you call it, “where [I] know what is borrowed and what is not”. So in effect, I am doing in the post the sort of thing that you suggest I do. I don’t do it using their computational methods, because the conclusions from this data set are far too obvious — if their computational algorithm is fed these data (only!) and doesn’t spit the closer link of B/U to Polish than to Russian, it’s not a good algorithm.

      You are right in that we don’t know for sure without rerunning their algorithms why they spit out such egregious errors as the one concerning Slavic languages (and as a Slavicist I find their “results” laughable!). I simply suggested two possible explanations: not distinguishing borrowings vs. true cognates, and not distinguishing retentions vs. innovations. Both times, you claimed that their methods are sophisticated enough to do this — then why don’t they?

      This brings us back to the issue of validation of the model. They attempt to do so with the Proto-Romance. However, there are numerous aspects of their results that achieve exactly the opposite. I’ve identified several such egregious errors (Polish, Romani, several issues in the Romance branch, etc.). Thus, Bouckaert et al. fail to show that their model can replicate the well-established answers to the already solved problems (and they do not even attempt to provide substantive evidence to show that these well-established answers are incorrect!). Without such replication/validation, the model is essentially worthless. This is something that we discussed in the comment section to the first post in this series, and it’s an issue that keeps coming back.

      What I am curious about is why their model produces such results in the first place. Now, I might not understand the whole methodology in detail, but I see two possible reasons: if these are not the correct reasons, it is their job (not mine) to identify the glitches in their model that result in such gaffes.

      And as for your roundabout comment, I don’t feel like worrying about problems in the biological sciences, as there seems to be a surplus of biologists who could do that (instead of trying to make grand discoveries in a field they seem to know too little about). I have my hands full worrying about my discipline instead.

      • Tom D

        Another brief reply with intent to expand:

        I see now what you were doing with the month names, but as I said above, it would be thoroughly convincing to see this applied to the actual data they used, rather than just hypotheticals that even the authors themselves are well aware of.

        Also, you need not re-run their analysis to see why it gives us such an incorrect tree. All you need to do is go through the data and figure out where Bouckaert et al. were in error. They provide you will all of their data and all of their sources, so while this will be much more time consuming than critically analyzing their map, it is “easy” to do. However, as I intend to expand upon, I really don’t think it will be easy at all, because we’re talking about borrowings that wouldn’t be obvious to “real” Indo-Europeanists.

        • http://www.pereltsvaig.com Asya Pereltsvaig

          “rather than just hypotheticals that even the authors themselves are well aware of” — I don’t live in their minds, so I have no idea what they are or aren’t aware of. They do seem oblivious (or wilfully blind) to many problems inherent in their work…

          “you need not re-run their analysis to see why it gives us such an incorrect tree” — actually I’ve identified mistakes, and it’s their job, not mine, to fix them. However, I strongly suspect that many of the errors are not fixable within their methodology, as they are inherent to it because of the sort of data they require, i.e. Swadesh lists that they’ve used, which I have examined, by the way. As we’ve already pointed out several times throughout this discussion, Swadesh lists (1) involve words, not grammatical patterns, and (2) do not distinguish borrowings, including those among IE languages, from true cognates. Both of these problems lead to the errors which I’ve identified, such as the Romani problem or the Polish problem. I don’t need to go through the steps of replicating their errors in order to identify them or understand where they stem from.

          “we’re talking about borrowings that wouldn’t be obvious to “real” Indo-Europeanists” — actually some of these borrowings are obvious enough to my undergraduate students with no background in linguistics, but the key is taking into account sound changes. After the current cohort will have submitted this homework, I might put up a post on this.

  • Jaska

    Tom:

    “Using Swadesh lists or other lists of
    “basic” vocabulary certainly does not make a method lexicostatistics.
    It would be perfectly fine to use a list of “basic” vocabulary to
    start off doing a reconstruction of two language families using the comparative
    method. In fact, this is what Lyle Campbell recommends in his book Historical Linguistics.

    I’ve described how the methods used by Bouckaert et al. differ
    elsewhere, but they differ fundamentally. They do not look at how many cognates
    a language shares. They actually reconstruct the history of cognates along a
    tree, similar to how the comparative method would reconstruct histories of
    changes.”

    Your link was very clarifying and accurate, thank you. But how I see it,
    it is only a different way to observe the very same data. The most parsimonious
    Bayesian tree probably very rarely differs from the tree based on the numeric
    value of cognate sharing. Or can you show any Bayesian trees which differ from a
    tree based on the sharing matrix of the very same lexical data? I see no essential
    difference between their method and “less computationally advanced”
    lexicostatistics: the data is the same, so the problems concerning the data are
    the same.

    Tom:

    “These methods are in a
    more general family of methods called distance-based methods. I mentioned
    another one above, called UPGMA, which is used in biology and is very similar
    to glottochronology. All of them share these flaws. Biologist, it seems, can
    tolerate some errors in final products, and are willing to use these methods
    where appropriate. Historical linguists, or so it seems, cannot tolerate any
    errors and will throw out anything that gives them.”

    Maybe the difference is, that biologists have no similar qualitative
    methods to historical linguistics in their repertoire to verify or falsify the
    computational methods? Therefore they are more prone to accept the results.

    Tom:

    “Also, I never said that Asya’s or your general critique was wrong. It’s spot on. Any
    method that seeks to trace an evolutionary history, be it in linguistics,
    biology, anthropology, etc. must take
    the difference between shared innovations and retentions into account.

    What I am saying is that this part of the critique–a critique of the
    specific methods employed by Bouckaert et al.–is wrong because it rests on a
    fundamental misunderstanding of their methods. And I said that was
    “unfortunate”, without any explanation. I’ll do so here. I think it
    is unfortunate because it undermines a lot of legitimate criticisms that both
    of you all bring up. I think it is fine to not understand or fully understand
    the methods employed by different authors. Nobody can know everything. The
    issue is when someone tries to overstep the limits of what they know.”

    I cannot see how the complete mathematical understanding
    of the method could even be relevant, when my critique is directed to the data
    itself. No matter what the method, the data itself is based on lexical retentions.
    No matter what the method, the conclusions and trees drawn from such data cannot
    avoid the possible sources of error (conservativeness vs. innovativeness; false
    divergence vs. invisible convergence).

    • http://www.pereltsvaig.com Asya Pereltsvaig

      Let me just add that the problem is not only with the data, but with the fact that the method requires this sort of data.

    • Tom D

      Some brief replies with intent to expand later:

      Compare Dyen’s lexciostatistical tree with Greenhill’s Bayesian ML tree. The data is almost the same, but not quite, as Dyen passed away and most of his data is now lost. but both are based off of a 200-word Swadesh list. Greenhill’s tree more or less lines up with the traditional linguistic analyses of Austronesian, while Dyen’s doesn’t.

      Biologists certainly do have qualitative methods, but they’ve taken a bit of a back seat after the discovery of DNA and the maturation of computational phylogenetics. Also, I would certainly hope biologists don’t accept results blindly. My point there was that biologists would not use these methods if they couldn’t distinguish shared innovations from shared retentions.

      Finally, the data itself cannot be “based on lexical retentions”. It’s just a list of words. If it were the case that the data itself were based on retentions, then we would be perfectly justified in saying that a list of sound changes used to subgroup languages would more or less be a list of phonological retentions. Any and all kinds of data used to infer trees suffer from the issues you mention–we still must determine which sound changes are innovative, which are simply parallel evolutions, etc. Sound changes aren’t borrowed frequently, but it is within the realm of possibility. Here I’m thinking of Fataluku, a so-called Papuan language I’ve done a bit of fieldwork on. It is overwhelmingly disyllabic, but there are hints that this is “contamination” from long term contact with Austronesian languages. Not a phonological change like *p > /f/, but getting there.

      • Jaska

        Thank you
        for the links. It is difficult to compare the trees, because the other has so
        few and the other so many languages, and the names for branches seem to be
        different…

        Tom:

        “Finally, the data itself cannot be “based
        on lexical retentions”. It’s just a list of words. If it were the case
        that the data itself were based on retentions, then we would be perfectly
        justified in saying that a list of sound changes used to subgroup languages
        would more or less be a list of phonological retentions. Any and all kinds of
        data used to infer trees suffer from the issues you mention–we still must
        determine which sound changes are innovative, which are simply parallel
        evolutions, etc. Sound changes aren’t borrowed frequently, but it is within the realm of possibility.”

        I think
        there is some misunderstanding concerning the concepts here, because I see no
        sense in your claim that the Swadesh lists are not based on lexical retentions.
        And sound changes are innovations, not retentions; we can confirm them with the
        help of loanwords.

        Lexical
        retentions = words inherited from the common protolanguage. The very nature of
        the Swadeshian word lists is, that they aim to contain the most stable words:
        words which have been preserved in as many branches as possible. A word which
        has preserved in all branches is a 100 % lexical retention; a word which has
        been replaced in 1/10 of the languages is a 90 % lexical retention. You can see
        that the basic words are retentions with very high percentage: http://en.wikipedia.org/wiki/Indo-European_vocabulary

        So, are we
        now talking about the same topic with the same concepts?

        To find out
        the true divergence these kind of retention words are not very diagnostive, because
        they can have different interpretations. It MAY BE, that the tree gained with
        such retentions is the true divergence tree, but it MAY BE equally possible, that
        the tree gained with the retentions only tells the conservativeness of the
        branches (see page 6–7: http://www.mv.helsinki.fi/home/jphakkin/Problems_of_phylogenetics.pdf).

        The only way
        to reliably find out the true divergence would be to count the innovations
        shared by different branches – the words which ARE NOT INHERITED from the
        protolanguage but have appeared later. And still, lexical innovations are not
        accurate enough, because we cannot distinguish loanwords from the words
        inherited from an intermediary protolanguage (connecting only two or three
        branches). The phonological level is the only level which can reliably tell the
        true divergence – the phonological innovations (sound changes), to be precise.

        So, not all
        kind of data suffer from these problems: only data based on lexical retentions.
        Read my link above, all the arguments are there (and in my forthcoming article).

        • http://www.pereltsvaig.com Asya Pereltsvaig

          Thank you for your comment! You beat me to it…

        • Tom D

          A few more quick replies:

          I should have explained Dyen’s tree a bit more. His methods posit 40+ primary branches of Austronesian, in very strong conflict with basically all other linguistic reconstructions and the widely accepted Out-of-Taiwan model of Austronesian linguistic prehistory. So it would of course be the case that some of the names are not the same–his tree and ideas about the diversification of Austronesian (an Out-of-Melanesian model) was so different they had to be. That large difference really was the important bit there, as you asked for a lexicostatistical reconstruction that completely conflicted with a Bayesian ML–what Bouckaert et al. use–reconstruction. I felt I provided one.

          Note how Greenhill’s tree built by Bayesian ML more or less lines up with Blust’s tree, though there are differences here and there, unlike Dyen’s tree which is completely different.

          In terms of data, no, I don’t think we are talking about the same thing any more. I certainly see what you’re saying now about the selection bias inherent in Swadesh lists. think you’ve spent quite a bit of time telling me something I thought I was trying to tell you, in part, as well, though, as I was pretty sure I acknowledged several times that borrowing, parallel evolutions, etc. can screw up the analysis. Perhaps I’m overly optimistic when it comes to using all available sources of data And, I guess, my defense of the methods in general can make it look like I think the conclusions of Bouckaert et al. are right, but this is not the case.

          But one thing I would like to point out is that we need to be careful now to separate the sorts of more general models of evolutionary change used (distance methods, Bayesian ML methods, etc.) from the data used to infer an evolutionary history (word lists, sound changes, etc.).

          Although I said elsewhere it can be difficult to use sound changes as the analysis for these newer, non-lexicostatisical computational analysis–and maybe others took “difficult” to mean as “impossible”–you can certainly use them. One article I would point to which uses a carefully selected set of mixed characters, with lexical, phonological, and grammatical innovations carefully selected to weed out all known borrowings, is Pellard (2009), specifically chapter 9. For instance, the first character he uses is the irregular change of *b > g in the word ‘yawn’, a phonological innovation in all of the dialects of Miyako, but nowhere else in the rest of Japonic (cf. Oogami Miyako ɑfks, Hirara Miyako afukɿ, Tokyo Japanese akɯbi; the /f/ : /k/ correspondence here is regular.).

          This sort of analysis seems to me to be much more on track with what you all have suggested would be appropriate data for these kinds of analyses.

          • Jaska

            OK, now you
            know what I was talking about, but I’m not yet sure what you were talking
            about. :) If you have time, I would be interested to understand the topic of your
            points, too.

            Yes, I see
            the trees are different, but I it is difficult to assess the data and the
            methods without the original publications. I accept that there are
            contradicting results, but how much of that difference is produced by the data,
            the method, and the subjective interpretations of a scholar?

            I mean, producing
            40 equal branches is not possible, if one follows strictly the results of the
            lexicostatistic analysis: surely it is practically impossible that all the values of retentions in the Swadesh lists of these 40 branches are
            sharply the same (eg. 85/100)! So Dyen must have decided some kind of margin of error, thus
            refusing to stratify the branches which have a retention rate close to each
            other (corresponding to a low posterior probability between branches).

  • Dragos

    Nice article. Indeed, a “word could equally well be a later loanword than an inherited word”. Bouckaert et al have many badly coded cognate sets, inherited from Dyen et al.
    Here’s the database of Dyen et al (1997): http://www.wordgumbo.com/ie/cmp/
    Here’s their database: http://ielex.mpi.nl/
    For example they show the Romanian word “animal” in the same cognate class with the words from Latin, Catalan, Spanish, French, etc. but they are wrong, this is a late borrowing from French. As for “belly”, Romanian has inferited both “pîntece” and “vintre”, so it has two cognate sets in common with other Romance languages.

    • http://www.pereltsvaig.com Asya Pereltsvaig

      Thanks, Dragos! When it comes to multiple cognate sets for the same meaning, I am not 100% sure, but I think they don’t allow that in their coding. If I understand correctly, they code words as either cognate (to the presumed IE form) or not cognate…

      Now, the last point you make that they use “cognate” as a synonym for “look alike” is something that has bothered me all along. They “define” cognates in the paper as “homologous words”, not stating that the common derivation is key. But they do state elsewhere that they exclude known cases of borrowing, like English mountain from French montaigne. But I don’t think they’ve excluded enough borrowings. I am making this point forcefully in our talk tomorrow. Stay tuned for the video recording (I am assuming that you are not local so can’t make it to the actual talk?)

      • Dragos

        They allow multiple cognate sets for the same meaning.

        I agree with you: they did not exclude all the known the cases of borrowing. This stems from the indiscriminate use of wordlists, from the unwarranted assumptions that these words are resistant against borrowing and that similar words are cognates, but also from their inadequate bibliography. They keep working on their database, however I think their progress is slow.

        I’m waiting for your video!

        • http://www.pereltsvaig.com Asya Pereltsvaig

          It is my understanding that they do NOT allow multiple cognate sets for the same meaning, but the paper is far from being well-written, so if you can find me a quote where it says that they do, I would very much appreciate it!

          The video is coming soon, we hope!

          • Dragos

            I don’t know quotes, but here’s an example from their online cognacy database:
            http://ielex.mpi.nl/cognate/713/ http://ielex.mpi.nl/cognate/2727/

          • http://www.pereltsvaig.com Asya Pereltsvaig

            Thanks for the links, Dragos. But how do you know that this is the database they’ve used? Does it state so in the paper somewhere? I am not seeing it…

          • Dragos
          • http://www.pereltsvaig.com Asya Pereltsvaig

            Thanks for the link, Dragos! We’ll be crawling all over it in the next couple of weeks, to be sure!

      • Dragos

        Check this quote by Ernst Pulgram (The tongues of Italy, 1958, p. 147): “words cognate with French bière, tabac, café are common Romanic, evoking a picture of Caesar’s soldiers guzzling beer and smoking cigars in sidewalk cafés”. Atkinson et al are in good company!

        • http://www.pereltsvaig.com Asya Pereltsvaig

          Well said!