Linguistic Phylogenies Are Not the Same as Biological Phylogenies

(Note: This post is jointly written by Martin Lewis and Asya Pereltsvaig)

A key assumption of Bouckaert et al. is that the diversification and spread of languages operates so similarly to the diversification and spread of biological organism that the two processes can successfully be modeled in the same manner. The parallels between organic and linguistic evolution are indeed pronounced. Both processes entail replicating codes that continually change, giving rise to novel varieties that increasingly differ from their progenitors over time. As a result, “phylogenetic trees,” showing descent from common ancestors, are a common feature of both evolutionary biology and linguistics.

But despite their similarities, organic evolution and linguistic evolution are in many ways highly dissimilar. Encoding information for communication is not the same as encoding information that generates life: language is vastly more fluid and complex than the genetic code; individual languages are much less clearly differentiated from each other than are species; and language is a social phenomenon, given to influences largely irrelevant for biological evolution. The key differences can be summarized as follows: biological evolution is unconstrained but governed by natural selection (any mutation can happen, but which mutations remain in the pool depends in large part on natural selection), whereas linguistic variation (seen in terms of deep grammatical properties) is constrained by a system of parameters but is not subject to natural selection. As a result, the branching trees of linguistic descent are merely analogous to the phylogenetic diagrams of biological evolution, and do not indicate the same kind of relationships.

Although organic evolution operates through a much more restricted set of message-carrying units than does human language, it nonetheless produces diversity at a much deeper level. Given the biological constraints of the human brain/mind (as of yet less than fully understood), there are only so many ways in which any given language can be structured. To be sure, the number of possible human languages, both extant and extinct, as well we those that may arise in the future, is vast, but all human languages appear to be “variation on a theme,” guided by the same parameters. Some languages have as few as two vowels (Ubykh, Northwest Caucasian) and others as few as six consonants (Rotokas, North Bougainville); other languages may have as many as 20 vowels (e.g. the Taa language, spoken in Botswana and Namibia, is reported by some sources to have as many as 20 or even 30 vowels, depending on analysis) and as many as 84 consonants (as in Ubykh; the Taa language is reported to have 87 consonants under one analysis, 164 under another). But crucially, all languages differentiate vowels from consonants and use both. Some languages put verbs before subjects and objects, while others place them at the ends of sentences, but all languages have verbs, subjects and objects.* Some languages can build sentence-long words packed with of numerous prefixes, infixes, or suffixes, while others use stand-alone, stripped-down words to do the grammatical work of expressing tense, number etc., but all languages make words from morphemes—and all construct sentences. As a result of this limited space of possibilities, completely unrelated languages evolving on their own often come to share major grammatical traits.

Linguistic evolution, unlike that of the biological realm, moves at a rapid clip. In non-literate societies, words change so quickly that after some five to eight thousand years not enough cognates can be traced back to establish linguistic relatedness. In the same time span, grammatical structures can undergo wholesale transformations, and sound inventories can change drastically as well. As a result, even clearly related languages can have next to nothing in common with each other, and can only be linked through investigations into their ancestors. Hindi and English, two of the three most widely spoken Indo-European languages, are dissimilar in almost every respect.** On casual inspection, Hindi would seem to have more in common with the non-Indo-European languages of the Indian sub-continent than it does with English.

Thus, relatedness at the family level and overall linguistic similarity often fail to correspond. Maps showing major language patterns typically bear little if any resemblance to maps depicting linguistic families. Even something as seemingly basic as word order correlates poorly with lines of descent. For example, Indo-European languages can be SVO (subject-verb-object; marked by red dots on the map to the left), such as English, Romance, and most Slavic languages (but Sorbian, a Slavic language, is SOV); SOV (marked by blue dots), such as the Indo-Iranian languages (yet Kashmiri is SVO); or VSO (marked by yellow dots), such as the Insular Celtic languages (yet Cornish is SVO). Some other families, such as Austronesian, have an even greater variability in the basic word order:  Niuean is VSO, Malagasy is VOS, Rotuman is SVO, and Tuvaluan is OVS.

Similarly, features of morphological typology (how words are formed from morphemes) often cross-cut connections established by common descent. Whereas Proto-Indo-European, like most of its daughters, was a synthetic language (building words from multiple non-root morphemes), English and Afrikaans are relatively analytical (with low ratios of morphemes to words), which gives them a certain affinity with Mandarin Chinese (a highly analytical language). As discussed in an earlier GeoCurrents post, isolating languages are found in Africa (Hausa, an Afroasiatic language), Asia (Vietnamese, Austroasiatic), Oceania (Rapanui, Austronesian), and the Americas (Kipea, Kiriri). In phonology as well, similar patterns obtain, as sound inventories often fail to show systematic correspondences with language families. The Indo-European languages of South Asia, for example, are in many respects more phonologically similar to the Dravidian languages of the same region than they are to most other IE language. One of the characteristic phonological markers of the region, the rich inventory of retroflex consonants, is also scattered across the rest of the world, found in about 20 percent of all languages belonging to a wide variety of families.

One of the best ways to appreciate the relative insignificance of language families in regard to the global distribution of such features is to explore the maps that can be generated on the WALS website, such as the one reproduced above. Few if any of these maps bear much resemblance to the familiar depiction of the world’s major language families.

Again, the contrast with biological evolution is stark. The farther removed organisms are from each other on the tree of life, the fewer genes they necessarily share. Even when convergent evolution results in similarities between distantly related organisms, the parallels are relatively superficial. As a result, modern genetic inquiry can establish precise levels of biological relatedness, a process that has revolutionized taxonomy over the past few decades. In the biological realm, moreover, the farther one moves up different branches of evolutionary descent, the more distinctive the organisms found along it generally become. Chordates (the phylum that includes vertebrates) share a distant common ancestor with echinoderms (sea stars and their relatives), and some tunicates, primitive members of phylum Chordata, might be mistaken by unschooled observers for sea lilies in phylum Echinodermata. (Tunicates more generally look like unrelated jellyfish and other cnidarians; a few could be mistaken for rocks, but such rocks disconcertingly bleed when cut open.) But no one would ever mistake any mammal with a sand dollar, a sea cucumber, or any other echinoderm, animals characterized by radial rather than bilateral symmetry. The two phyla have simply evolved in strikingly different directions. If linguistic evolution worked in the same manner, it is questionable whether translation between distant languages would even be possible. Moreover, the disparate patterns of spatial distribution of deep grammatical properties, such as the ones illustrated by the WALS maps, would not be found.

In language, deep grammatical properties can radically change, often taking on the same forms as those encountered in wholly unrelated tongues. As a result, linguistic relationships are often anything but obvious, and can only be discerned though intensive study; significantly, such hidden connections can hold true even for relatively recently emerged languages. A fluent speaker of the major Germanic languages, for example, might be nonplused to learn that Frisian is more closely related to English than it is to Dutch. Yet according to some specialists, even Low German is “phylogenetically” closer to English than it is to (High) German—even though Low German is generally regarded as a mere dialect (or group of dialects) of German!

Linguistic evolution is only vaguely analogous to organic evolution for a variety of reasons, but a crucial factor is the fact that vastly less sharing occurs across biological lineages. We now know that genes can jump from one species to another, but the process is relatively rare; in this realm, change generally occurs as a result of random mutations acted upon by natural selection, not from the borrowing of elements from other species. When it comes to languages, however, sharing is ubiquitous. Languages are almost always borrowing words, and sometimes they adopt grammatical properties of other languages as well. At times, two completely unrelated languages essentially merge to create a hybrid tongue. To be sure, linguists are almost always able to determine which language contributed more elements and more basic structures, and hence should count as the parent tongue. (It should be noted that the use of the terms “parent” and “daughter” in relation to languages is misleading since, unlike in the biological realm, where individual organisms are discrete, the transition from “parent” to “daughter” language is always gradual.) When it comes to creole languages, however, such determinations are not always easy. In regard to grammar, different creoles of completely different parentage are often more similar to each other than they are to any of their source languages. In some instances of mixed languages, admixtures of vocabulary, grammar, and phonology run so deep that linguists abandon the quest for unambiguous classification. Cappadocian Greek, for example, is slotted by the Wikipedia into the seemingly impossible “Greek-Turkish” language family. Does Indo-European therefore encompass this language? Other sources, such as the Ethnologue, place this language in the Greek branch of the Indo-European family, but Turkish influences on Cappadocian Greek are pronounced: it has certain sounds that have been borrowed from Turkish, as well as vowel harmony; it has developed agglutinative inflectional morphology and lost (some) grammatical gender distinctions; and its basic word order is SOV. And Cappadocian Greek is by no means the only example of such a thoroughly “mixed language.” In the biological realm, in contrast, such mixtures are so obviously impossible that they have generated their own nonsense genre, as exemplified by Sara Ball’s delightful flip-book, Crocguphant.

Linguistic family trees must therefore be taken as often showing lines of partial descent, unlike the phylogenetic diagrams of organic evolution. To gain a more complete understanding of linguistic relatedness, it is necessary to complement language families with other kinds of connections. The various languages of a Sprachbund, or a linguistic convergence area, for example, derive from different families, yet nonetheless come to share many features through long histories of mutual interaction. One must also consider linguistic strata, which take into account the influences imposed by one language on another. The role of a linguistic substratum, derived from a previously existing language that was later supplanted by another tongue, can be profound. In many cases, such linguistic substrates were instrumental in generating subfamilies; the Germanic languages, for example, are distinct from other Indo-European languages not merely because they drifted in their own particular direction, but also because that acquired a major substrate from another (unknown) language family. Sometimes, the ghostly presence of a long extinct language or language family can be detected through such substrates. Vedic Sanskrit, for example, was definitely an Indo-European language, but it was influenced not only by the preexisting Dravidian and Munda languages of the Indian subcontinent, but also by an unknown substrate deemed by Colin Masica “Language X.”

A useful alternative to the linguistic tree is the so-called wave model, or Wellentheorie, originally devised to explain some of the characteristics of the Germanic languages that seemed to defy the phylogenetic approach. In wave theory, fluid dialect continua replace the stable, geographically bounded languages required by models predicated on direct descent from ancestral tongues. Here, innovations can occur at any points within a dialect continuum; such changes then spread outward in a circular manner, eventually dissipating as the distance from the innovation center increases.*** If a bundle of innovations substantially overlap and become entrenched, a new dialect, or even language, can be said to have emerged. But according to wave theory, such a “language” is still best viewed as an “impermanent collection of features at the intersections of multiple circles.”

Wave theory does recognize, however, the fact that a single language/dialect can appropriate an entire dialect continuum, subordinating more localized speech forms and eventually driving them into extinction, as indeed was the case in regard to Standard German over most of Germany. Such a process, however, generally requires the power of the state or of some other overarching institution. Such geographically expansive and culturally potent organizations, however, are a feature of the relatively recent past; for most of humankind’s existence, the institutions necessary for producing linguistic standardization over broad areas were lacking. We are so used to the modern world of mass communication over vast distances and of language-standardizing governments and educational systems that we easily forget that in earlier times, and in many remote areas to this day, different linguistic environments prevailed. Overall, we suspect that for most of human history, the wave theory more accurately captures the process of language change than does the standard phylogenetic model. Yet in the most general terms, the two models complement each other relatively well.

*Debate does rage, however, about whether the so-called “non-configurational languages” such as the Australian language Warlpiri, have subjects and objects in the same sense as the more familiar, “configurational” languages like English or French. The reader is referred to Baker (2001) for evidence of subject-object asymmetries in such non-configurational languages.

**For example, Hindi makes a phonemic distinction between aspirated and unaspirated voiced stops, has fusional case/number morphology, subject-object-verb word order, postpositions, and uses the ergative-absolutive alignment in the preterite and perfect tenses; English, in contrast, has no aspirated voiced stops (and does not use aspiration phonemically at all), has largely abandoned fusional morphology, has lost the case system except with pronouns, employs a subject-verb-object word order, uses prepositions rather than postpositions, and is characterized by nominative-accusative alignment.

***Ironically, the diffusion analogy of Bouckaert et al. may be best suited to describing dialectal continua rather than divergence and expansion of languages and language families; we shall return to this point in a forthcoming post.



Baker, Mark C. (2001) The Natures of Nonconfigurationality. In Mark Baltin and Chris Collins (eds.) The Handbook of Contemporary Syntactic Theory. Oxford: Blackwell. Pp. 407-438.


Linguistic Phylogenies Are Not the Same as Biological Phylogenies Read More »