The Different Modes of Language Spread

In this second-to-last post on Indo-European origins and expansion, we turn once again to language diffusion, a cornerstone of the model employed by Bouckaert et al. A previous post asked whether languages actually spread by diffusion, arguing that the much more rapid process of advection is often more important. As was then pointed out, physical geographical factors, such as impassible mountains and fertile river corridors, guided such advectional movement. Today’s post considers language movement more generally—whether conceptualized as diffusion or advection—focusing more on the social than the natural environment.

A root error of Bouckaert et al. is regarding language expansion as a singular process. Actually, it can operate in two complete different modes: sometimes a language spreads with a group people, and sometimes it does so among different groups of people. To put it in most schematic terms, language movement occurs when a speaker moves from place A to neighboring place B, but it can also happen when a resident of A imparts his or her language to a resident of B. One process is basically demographic, the other conversional. In geohistorical terms, both forms of language expansion have been ubiquitous. They are generally meshed together in a complex manner, but sometimes one or the other process dominates. As they differ so fundamentally, it they could be realistically modeled in the same manner.

The clearest case of demographic expansion occurs when a single human group arrives on an uninhabited landmass and settles it. As the population expands in numbers and spreads geographically, its language will gradually differentiate into dialects and eventually into separate languages, as sub-populations pushing into new areas become socially separate and their forms of speech drift apart. Such linguistic differentiation could be arrested and reversed by state formation or the emergence of over-arching religious or other cultural institutions, but over the long span of the human past, divergence is usually the rule.

The settlement of Madagascar some 1,500 years ago is a prime example of such virgin-land expansion. Linguistic evidence confirms that the original Austronesian-speaking settlers arrived from Borneo in the Malay Archipelago. As their descendents spread over the mini-continent, their original language differentiated into dialects, some of which are regarded by linguistic splitters as separate languages (the Ethnologue lists ten). Later streams of migrants from the African mainland enhanced the island’s genetic diversity while introducing new linguistic elements, but the newcomers always adopted the language of the original settlers. As a result, all the indigenous forms of speech on Madagascar are very closely related, and are usually classified as variants of the single Malagasy macro-language.

Examples of the opposite process of conversional language expansion are common in today’s world. The process occurs whenever parents neglect to pass on their own mother tongue to their children, in favor of the language of one of their neighboring groups. Hundreds of languages have become endangered in over past generation alone by such changes in behavior. Most disappearing American Indian languages in the United States, for example, are in danger not because their populations are dying out or because their lands are being overrun by English speakers, but rather because decisions are made by parents to raise their children as English speakers.

Such processes of language abandonment and replacement are by no means limited to the modern world. A prime ancient example comes from the Philippine archipelago. Almost all Philippine languages belong to one branch of the Austronesian family, which is almost limited to the Philippines (see the map posted here). Such a pattern would seemingly indicate that the Philippines, like Madagascar, had been initially populated by a single group of settlers whose descendants subsequently spread over the archipelago as their language differentiated. But the actual demographic history of the Philippines was completely different. The original Austronesian settlers came to a land that had already been occupied by tens of thousands of years. Its indigenous* inhabitants were collectively called “Negritos” by Spanish authorities, a word meaning “small, dark-skinned people.” Their languages were undoubtedly unrelated to Austronesian, but we cannot say much beyond that. Although the Philippine indigenes have survived to this day, they abandoned their original tongues many centuries ago in favor of the Austronesian speech of the newcomers.

The social interactions between the Austronesian migrants and the indigenous inhabitants of the Philippines are poorly understood, but the key dynamics are evident. The newcomers were an agriculture people with much more highly developed technologies and forms of political integration than those held by the native foragers. The Austronesian migrants demographically overwhelmed most parts of the archipelago in short order, spreading their language(s) and well as their genes. Yet the indigenes held on in a number of rugged areas, particularly those characterized by heavy, year-round rainfall, such as the Sierra Madre Mountains of eastern Luzon** (in the winter dry season, the Sierra Madre catches rain from trade winds forced up-slope). From such redoubts, however, the indigenous foragers interacted extensively with their Austronesian neighbors, exchanging rain-forest products for agricultural and manufactured goods. Eventually, the languages of their trading partners fully “diffused” across their societies and then began to evolve in their own directions. Today, the several surviving “Negrito languages” are much more closely related to the languages of their neighbors than they are to each other. Strikingly similar processes have occurred elsewhere in the world. The most notable case is that of the “Pygmies” of central Africa, another group of diminutive, rainforest hunter-gatherers who long ago abandoned their own languages in favor of the tongues of their more numerous and powerful neighbors, in this case, languages in the Bantu sub-family of Niger-Congo.

The two cases explored above, Madagascar and northeastern Luzon, are best regarded as ends of a spectrum. Most examples of linguistic expansion involve both processes. When one language group expands it usually does so into the territory of a people speaking another language. As communication between natives and newcomers is essential, many individuals acquire a second language. Over time, such a process often leads to the linguistic conversion of the indigenous group—although advancing group are sometimes converted instead, in which case the language frontier retreats. Such encounters are generally accompanied by some conflict, as the native inhabitants typically resent the incursions of the newcomers, who in turn often use force to advance into new lands. To the extent that the indigenes are able to resist the settlers, they will delay the linguistic expansion. The effectiveness of any such resistance in turn depends on the relative numbers of the two groups and on their levels of political and technological development. Any realistic modeling of linguistic spread must take such factors into consideration.

Patterns of physical geographical play an important role here as well, as resistance by native inhabitants is usually more effective in areas of rough or otherwise difficult-to-traverse topography. In some cases, a particular climatic feature can stop language advance; the spreading Bantu-speakers, for example, encountered a firm barrier in the arid and Mediterranean climates of southwestern Africa, which precluded their faming practices and therefore created a refuge for peoples speaking Khoisan languages. Even the geometry of landmasses can play a role. As Anglo-Saxon speech spread across southern England, Celtic speakers were increasingly concentrated in the funnel-shaped peninsula of Cornwall, increasing their population density, shortening their defensive perimeter, and thereby enhancing their ability to resist the spread of English (further north, it was the rugged uplands of eastern Wales that afforded such protection).  Yet again, all such features must also be taken into account by any effective attempt to model language spread.

The movement of one language group into the territory of another typically results in complex and variable linguistic interactions. Outcomes again depend heavily on relative numbers and different levels of technological and political development. When a large group of technically advanced people spreads over a landscape occupied by scant numbers of less technically advanced people, the linguistic impact can be minimal. As English advanced across Australia, for example, it picked up place names, animal designations, and words for unique landscape features (such as billabong) from Aboriginal languages, but not much more. But when two groups with more similar levels of development come into contact, much more intensive linguistic interactions typically result. Sometimes the linguistic substrates bequeathed by vanquished populations can be profound at both the grammatical and lexical levels, at other times they are of little significance, and occasionally they seem to be minor at first glance but turn out to be surprisingly important.***

When a language group moves into the lands of a different people, the initial linguistic development is often that of widespread bilingualism. If the newcomers are dominant, as they often are, the subjugated indigenes will find advantage in learning the new language, but even members of the dominant group sometimes acquire the native tongue. Gender relations typically play a crucial role here as well. Men from the more powerful group often take women from the subordinated people, insisting that their native wives learn their language. Such women do so imperfectly, often imposing upon it sounds, words, and grammatical patterns from their native tongue. When they pass down the transformed language of their husbands to their children, a certain degree of linguistic fusion results.

The preceding discussion only hints at the possible complexities involved in the linguistic interactions that occur when one language group pushes into the territory of another. Even so, it deeply challenges the diffusion model of Bouckaert et al. Rather than advancing by steady progression, an expanding language often moves forward in a spatially dispersed manner, as its speakers establish themselves as a dominant social stratum in a foreign land. Many members of the native population will learn the new language, but they will at first continue rearing their own children in their own tongue. After a number of generations of such bilingualism, most parents in the indigenous group may opt to acculturate their infants in their second languages rather than in their mother tongues. As a result, a language could “spread” almost instantaneously over fairly sizable areas. Over broader areas, however, such a process is likely to be patchy, with some areas “converting” much sooner than others.

A prime example of such uneven processes of language change comes from Anatolia. Most of the region was Greek-speaking in the 11th century when the Turkish influx began. By the 13th century most of Anatolia was firmly under Turkish rule, and by the middle of the 15th century Greek political power had vanished everywhere. Throughout this period, Turkish gradually supplanted Greek, but along both the Black Sea coast and that of the Aegean Sea, largely bilingual but primarily Greek-speaking communities persisted until the expulsions of the early 20th century. And as we saw in an earlier post, mixed “Turkish-Greek” forms of speech emerged in some areas.

A second major challenge to the diffusion model emerging from this analysis involves the unpredictability of language change when two (or more) linguistic communities come to occupy the same general territory. Although one might expect that the language of the dominant group would always prevail, that is obviously not the case—if it were, England would have switched to a Romance language after the Norman conquest, and Russia would have ended up with a North Germanic language of its Variangian rulers. Instead, England kept a Germanic tongue, and Russia—a Slavic one.

Interesting examples of the uncertain nature of language change after a successful invasion come from the Danubian grasslands of central and southeastern Europe. From the fourth century to the ninth century CE, this area experienced four major incursions by non-Indo-European-speaking, militarily dominant, pastoral peoples from the steppe zone to the east: those of the Huns, the Eurasian Avars, the Bulgars, and the Magyars. All four groups built empires of a sort, and all subjugated the much more numerous local inhabitants. The Huns and the Avars, however, disappeared within a century or so with little trace, linguistic or otherwise. The Bulgars, on the other hand, built a kingdom so powerful that vestiges of it survive to this day in the form of Bulgaria, but their Turkic tongue vanished long ago, failing to maintain itself in the heavily Slavic environment over which the Bulgars ruled. The Magyars, on the other hand, were able to firmly establish their language, which is spoken today by roughly 15 million people, even though the Magyars themselves were a relatively small group, substantially outnumbered by the peoples that they dominated.

Could one have predicted the fates of the Hunnic, Avar, Bulgar, and Magyar languages merely from the basic facts of their migrations, conquests, and state formations? I rather doubt it, as far too many contingencies were involved over long periods and broad territories. More to the point, could any such processes be successfully modeled as instances of linguistic diffusion? Here the answer must be a definitive “no.” Of course Bouckaert et al. would object here, as they rule out all episodes involving the “rapid” spread of a single language. Yet over the past several thousand years, the rapid spread of single languages has been the stuff of linguistic history over broad segments of the terrestrial globe. If such processes are ignored, nonsense necessarily results.


*The term “indigenous” becomes problematic wherever multiple waves of settlement have impacted a particular place. The term is used here in the relative sense, referring simply to groups that predated other groups with which they are compared.

**Intriguingly, the most rugged area of northern Luzon, the Cordillera Central, did not serve as a refuge for the indigenous hunter-gatherers, as all of its recorded ethno-linguistic groups are descended from the Austronesian migrants. The Cordillera, the site of my own doctoral research, is an usual area in many respects, as it was historically characterized by higher population densities than those found in the adjacent lowlands to the east; dense populations, in turn, necessitated the construction of some of the world’s most elaborate agricultural terraces (see the photo to the left). In all likelihood, such high population density in the mountains resulted from Spanish pressure; residents of northern Luzon who did not want to submit to Spanish rule and forced Christianization fled to the uplands, where they had to build terraces in order to survive. Prior to this influx, small numbers of “Negritos” may have lived in parts of the Cordillera.

***Intriguingly, substrate influences that seem insignificant at first glance can actually turn out to be important. For decades, linguists looked for Celtic influences on English in the wrong places and thus could not find them; even such a recent, authoritative text as Baugh and Cable’s A History of the English Language (1993) states that, “Outside of place-names the influence of Celtic upon the English language is almost negligible” (p. 85). Currently, however, many of the linguistic peculiarities of English are being attributed to the Celts. These include the do-support construction (where do is required in questions and for negation), the diphthongization of long vowels (possibly, the first push that started the chain reaction of the Great Vowel Shift), expressing possession inside noun phrases, using the same –self items for reflexives (“John cut himself”) and intensifiers (“The president himself will visit”), using the same verb forms for both causative structures (“I broke the vase”) and inchoative ones (“The vase broke”), and the it-cleft (“It was a car that he bought”).



How Large Was the Area in Which Proto-Indo-European Was Spoken?

As the current series on the origin and expansion of the Indo-European languages nears its completion, only a few remaining issues need to be discussed. Today’s post examines once again the mapping by Bouckaert et al. of the area likely occupied by the speakers of Proto-Indo-European (PIE). The focus here, however, is not on the location of this ancestral linguistic homeland, which they situate in southern Anatolia, but rather on the size of the area over which the language was supposedly spoken. The area so depicted on their maps, it turns out, is almost certainly much too large to be credible. By mapping a Neolithic language as covering almost one hundred thousand square kilometers, Bouckaert et al. demonstrate, yet again, a fundamental failure to understand the basic patterns of linguistic geography.   

Bouckaert et al. give a surprisingly precise figure for the area that their model indicates as the probable homeland of proto-Indo-European: 92,000 km2, roughly equivalent to the extent of Hungary or of the American state of Indiana (see the yellow polygon in the map to the left). But given the characteristically opaque phrasing of the authors, it is not immediately clear if this zone is supposed to represent the actual (likely) spatial extent of the PIE-speaking community, or if it is merely supposed to show the broader area in which a much more spatially restricted language group was located. One can deduce, however, that that the former argument is being advanced based on the authors’ framing of the spatial hypotheses supposedly advanced by two different proponents of the steppe theory:

The areas of the hypotheses are approximately 92,000 km2 for the Anatolian hypothesis, 421,000 km2 for the narrow Steppe hypothesis, and 1,760,000 kmfor the wider Steppe hypothesis. So, these areas show a bias toward the Steppe hypothesis; the area covered by the narrow Steppe hypothesis is more than four times larger than that of the Anatolian hypothesis. Likewise, the area covered by the wider Steppe hypothesis is more then (sic) 19 times larger than that of the Anatolian hypothesis.

As can be seen in the map posted here, the area outlined by the “narrow Steppe hypothesis” fits precisely within the area demarcated by the “wider steppe hypothesis.” Such a depiction would not be logical if Bouckaert et al. were proposing that these “areas” were merely the proposed zones in which in a more spatially restricted language had been located, as opposed to the probable zone that such a language actually covered. If the latter meaning had been intended, the “narrow Steppe hypothesis” would merely be a more precise version of the “wider Steppe hypothesis” rather than a different “hypothesis” altogether. One can thus conclude that the authors intend the yellow polygon to indicate the area over which Proto-Indo-European had been spoken, as posited by their model with the given parameters of uncertainty.


In the modern era, and to a significant extent across the past several thousand years, there is nothing unusual in a single language being spoken over a 92,000 square kilometer block of territory. But for such a situation to obtain, expansive spatial connectivity is necessary, which in turn depends on the power of the state or of some other form of social integration. In the world of Neolithic farmers, such regionally integrative institutions were almost certainly lacking, and as a result linguistic communities would have been much more spatially restricted. Such spatial limitations would have been even more pronounced in areas characterized by rough topography and formidable mountain ranges, as such barriers impede communication and thus enhance social and linguistic fragmentation. Yet as can be seen in the map posted here, Bouckaert et al. place the PIE homeland precisely in such a location. A single language spoken by tribal farmers over such a vast expanse of broken topography is all but impossible.

The situation in regard to the homeland identified by the steppe hypothesis would have been different. Under conditions of equestrian-oriented pastoral nomadism, linguistic communities could have occupied much larger territories than those found among agriculturalists living at the same time. The relatively flat topography of the steppe zone, moreover, would have allowed relatively easy communication among scattered groups. Sizable seasonal aggregations, often of a ceremonial nature, are also common under such circumstances, enhancing social solidarity over a broad expanse of land. But even given all of these considerations, the 421,000 km2 and the 1,760,000 km2 figures noted by Bouckaert et al. for the PIE homeland in two versions of the “steppe hypothesis” are still improbable. Geographically aware theorists thus tend to argue only that the original PIE homeland was situated in the western steppe zone, not over its full extent.

We cannot, of course, determine the areal extent of any prehistoric language, as the needed documentary evidence is lacking. It is tempting to associate specific languages with archeologically attested “cultures” that can be mapped, but it must be recalled that language often fails to correspond to groups defined on the basis of shared material culture; consider, for example, the “Pueblo Indians” and the Northwestern cultures of indigenous North America, both of which were highly multilingual, even at the language family level, yet substantially shared the same material cultures. Material culture, after all, is much more dependent on—and serves in part as an adaptation to—the physical environment, whereas languages seldom co-vary with physical geography; there is no way in which a certain word order pattern, or morphological type, or sound system would be more appropriate for any given landscape. All that we can do, therefore, is argue on the basis of contemporary analogues. Here we find that the areas covered by linguistic communities in those parts of the world that maintained “Neolithic” agricultural systems and forms of socio-political organization into modern times were of a restricted spatial scale. The archetypical location here is New Guinea, which is to this day characterized by pronounced linguistic fragmentation, as can be seen in the map posted here. One might object, however, on the basis that New Guinea is an extreme case and as such should not be used for comparative purposes. But in historically stateless areas elsewhere in the world, even where Neolithic technologies were superseded millennia ago, highly restricted linguistic territories remained the rule, as can be appreciated from the language map of central Nigeria posted here.* Maintaining a single language over an area as large as Hungary in such a context is highly unlikely, to say the least.

Similar objections apply to the mapping of the proto-languages of the major IE branches in Bouckaert et al. One must again consider the authors’ intentions in regard to their portrayal of these languages. It is not exactly clear, for example, what they mean by “the inferred location at the root of each subfamily is shown on the map” (see the map caption posted to the left). The “inferred location” of what? Presumably, they mean the inferred location “of the root,” and presumably “the root” refers to the proto-language that later generated each IE branch. It is still not clear, however, whether the colored areas are supposed to indicate the likely locations over which these proto-languages were spoken, or whether they merely show the probable zones in which much more spatially restricted languages were spoken. If the former scenario is indeed the case, the areas depicted are again much too large.

Of the “root languages” mapped on this figure, that of the Indo-Iranian languages is most preposterous. The previous post specified most of the problems associated with this inferred location. The map posted here also shows the extraordinary disconnection between the existing archeological evidence and the spatial hypothesis advanced by Bouckaert et al. I would further note that the area they advance for the origin of the Indo-Iranian languages makes no sense from the standpoint of physical geography. Its western apex is located in the middle of the uninhabitable Dasht-e Kavir (Great Salt Desert), its central portion is situated in the heights of the Hindu Kush, and its eastern extremity lies in the fertile plains of Punjab. It is unthinkable that any sedentary Neolithic population would have occupied such a territory at any given point in time.

*One could, however, argue that New Guinea and central Nigeria are highly linguistically diverse in part as a function of time. Both areas have been inhabited by modern humans for a very long period. Most of Eurasia has been populated by Homo sapiens sapiens for considerably time than West Africa, and to some extent even New Guinea (the presence of Neanderthals probably impeded the movement of modern humans into western Eurasia for millennia). As a result, one might expect somewhat greater linguistic differentiation in those places as compared to southern Anatolia. But it is also true that the Americas, which had been populated by modern humans for less time than western Eurasia, were also characterized by pronounced linguistic diversity. Significantly, agricultural areas in pre-Columbian North and South America that were not occupied by state-level societies were characterized by spatially restricted language groups.


The Consistently Incorrect Mapping of Language Differentiation in Bouckaert et al.

As mentioned in previous GeoCurrents posts, the animated map that accompanies the Science article of Bouckaert et al. depicts their model in action, showing the expansion and differentiation of the Indo-European languages in time and space. Earlier posts criticized the map’s contour shadings, which indicate high probabilities of IE languages being spoken in given areas at given times. Today’s post takes on a related issue, that of the branching lines that spread across the map as the presentation unfolds, indicating both linguistic relationship and the general directions of language-group expansion. Here we can clearly see that the model generates a nearly continuous stream of misleading information and outright error.

Analyzing the ramifying lines on animated map is challenging. Nothing is labeled, colors are often hard to differentiate, and no key is provided. The companion website does promise a “legend for movie S1,” but provides only a brief caption: “Movie showing the expansion of the Indo-European languages through time. Contours on the map represent the 95% highest posterior density distribution of the range of Indo-European.” One must thus infer what the lines represent based on the supplementary text and on the manner in which different segments lengthen and divide in particular places as time proceeds.

Each line represents a branch of the Indo-European language family. Those that appear early in the animation indicate the deepest divisions, while those that emerge later represent the shallower splits of linguistic “sub-sub-families” and so on. In some cases, minor instances of linguistic differentiation are marked, extending down to the dialectal level. The North Germanic line, for example, begins to bifurcate on the Sweden-Norway border in the late 1700s, showing the divergence of Norwegian and Swedish, and then splits again in central Sweden in the mid 1800s, indicating differentiation that, according to the authors, produced three separate Swedish languages (see the maps below). Over most of the map, however, splits at the level of individual languages, let alone that of dialects, are not noted: if they were, the map would be so cluttered by the end as to be undecipherable. Yet again, consistency does not seem to be a priority.

The lines are not of uniform appearance. Older language stems are clearly depicted in a darker shade than more recent branches. As elsewhere, differentiating the hues employed is difficult, especially after the background color used to denote IE languages in general abruptly changes from yellowish-greens to shades of blue-green. (As a result of this problem, in some of the maps that follow I have changed the green lines under investigation to shades of red.) Interpreting differences in line shape and thickness is another challenge. Almost all lines are equally thick and even, extending uninterrupted across the map. In some instances, however, thin, irregularly shaped spurs emerge from the main stems, some of which eventually thicken and spread into new areas. Certain lines are interrupted, with unexplained gaps appearing on the map. Some of these gaps seem to indicate language divergence without diffusion, but other remain mysterious, as is the case with the differently shaped and colored line fragments that appear in what is now western Germany (see map detail to the left). By the end of the animation, Italy is covered by a jumble of oddly uneven and discontinuous lines that are almost impossible to parse out, as can also be seen on the map posted here.

The spatial extension of the lines over time seemingly indicates the pace of expansion of the various IE subgroups into new territories, while the shaded contours depict the expansion of Indo-European as a whole. The two methods of showing expansion, however, do not always correspond. While the 95 percent probability contour for IE as a whole never reaches Russia (except for a tiny zone in near Pskov), the East Slavic line pushes well into what is now western Russia, although it does not do so until the early 1600s. Such a depiction is of course absurd on face value, as East Slavic languages had been spoken in this area and well beyond it for many hundreds of years; it must be recalled, however, that the animated map is designed to show only the latest possible time of expansion, not the actual period in which it occurred.

The major significance of the lines, however, is not their depiction of language group expansion but rather of linguistic divergence. The authors emphasize repeatedly that their animated map depicts the locations at which linguistic differentiation occurred, which in turn generated the branching patterns of the Indo-European tree. Although they formally model such divergence as occurring at precise points, they admit that it cannot be pinpointed in such a manner:

Our phylogeographic model allows us to infer the location of ancestral langauge (sic) divergence events corresponding to the root and internal nodes of the Indo-European family tree. Since we model internal node locations as points in space, our posterior estimate for the location of divergence events can be interpreted as a composite of the range over which the ancestral language was spoken and stochastic uncertainty inherent in the model.

Regardless of the uncertainty that the model encompasses, language divergence cannot realistically be modeled as occurring through discrete events that happen in restricted places. The differentiation of languages is rather a process that often occurs over an extended period through an expansive area of related dialects (see the earlier GeoCurrents post on the “wave model”). Leaving such objections aside, however, it must still be asked whether the model of Bouckaert et al. accurately depicts the generalized locations and timings of the divergence “events” that gave rise to the different branches of the Indo-European family, allowing that they did not occur at the precise points indicated on the map, but rather merely in the general vicinity of those places. Here the answer is—yet again—an emphatic “no.” As it turns out, virtually every depiction of linguistic differentiation that can be traced by historical sources is incorrect. Considering as well the erroneous mapping of linguistic expansion given by both the extending lines and the spreading contours, the animated map can only be regarded as a vast compendium of error. It is not that it fails to get everything right, but rather that it gets virtually nothing right.

To illustrate the level of error generated by the model, I will examine in detail the depictions of the expansion and differentiation of several branches of the Indo-European family. One could do the same for all IE sub-families, but such an exercise would be unnecessarily tedious. Before beginning the exercise, a few stipulations are necessary. To begin with, the following analysis is based strictly on the animated map, ignoring material found elsewhere in the article or website, which often runs against the cartographic depiction. While the authors note in their textual supplements, for example, that West Germanic speakers arrived in Britain around 400 CE, the map delays the event for several hundred years. Yet as we have previously seen, what such a cartographic portrayal actually means is that the diffusion of Germanic languages to Britain could have occurred no later than the date indicated by the map, within the general parameters of uncertainty allowed. My point, however, is that we know from historical sources that Germanic languages definitely arrived in Britain at a much earlier period, as the authors themselves acknowledge. If the cartographic depiction of the linguistic “Germanification” of Britain is thus not simply “wrong,” it is both misleading and exceptionally trite.

The Greek and Albanian subfamilies make good starting point, as their cartographic depiction is particularly telling. Bouckaert et al. idiosyncratically regard Greek and Albanian as together constituting a distinct IE sub-family. (Most linguists regard Albanian as an IE isolate that shares certain affinities with Balto-Slavic, Germanic, and Greek; the Science authors classify it with Greek most likely on the basis of borrowed words, as the two languages have been in intimate contact for millennia). Their animated map depicts the ancestral Albano-Hellenic group as arriving on the eastern shores of the Greek Peninsula circa 3000 BCE, and then differentiating into the Greek and Albanian branches around 1500 BCE. Greek then pushes southward into Attica (the Athens area), while Albanian moves to the west into Thessaly in what is now central-eastern Greece. Subsequently, virtual stasis ensues for a few thousand years, with no significant movement of either branch and no further linguistic differentiation. Motion finally kicks in during the thirteenth century CE, when Albanian experiences a “divergence event” in central Greece and begins expanding to the west and north. By the 1500s, the northern Albanian branch finally reaches what is now Albania. At about the same time, the southern Albanian line begins a several-hundred-year maritime phase during which it diffuses across the waters of the Adriatic, finally reaching southern Italy in the 1800s.

The actual geo-histories of the Greek and Albanian languages are completely unlike the fantasy version advanced by the model. As it would again be wearisome to recount all of the many errors involved, I will focus instead on explaining why their depiction is so spectacularly wrong. As is generally true, the erroneous portrayals of these two language groups was predetermined by the error-pocked initial map of language distribution, ancient and modern, that informs the mathematic model. As was discussed in earlier posts, Illyrian, the likely progenitor of Albanian, is ignored, Ancient Greek is absurdly shown as limited to Attica, Albanian is unreasonably divided into four languages, and the areas occupied by Albanian-speaking communities in southern Greece are grotesquely exaggerated while those of Albania itself are absurdly reduced. As garbage is fed into the equations, garbage not surprisingly comes out.

The depiction of the Balto-Slavic languages is risible as well. This language sub-family is portrayed as branching off the main western IE stem circa 3000 BCE in the northern Danubian basin, and then as heading northward over the Carpathian Mountains into what is now central Poland. A small gap emerges on this line circa 950 BCE roughly along the Carpathian crest, which might indicate the Slavic languages differentiating from the Baltic ones. The Baltic line then continues to move northward, although it does not reach Lithuania until the fifth century of the Common Era. A Slavic spur, meanwhile, clearly emerges at roughly 300 BCE, again in the Carpathian Mountains, and begins to slowly creep southward in the early centuries of the Common Era. Diffusing back across the Danubian Basin, it reaches what is now Croatia in the 600s CE. By 900, it has extended as far south as Macedonia, at which point it breaks into several segments. East Slavic emerges out of the same Carpathian hub in the mid 900s CE, and then heads in a northeasterly direction; a hundred years later, West Slavic makes its appearance, branching off from roughly the same location. By the early 1600s, West Slavic has moved westward along the modern Czech-Poland border, approaching what is now eastern Germany. Over a hundred years later, it finally reaches the area now occupied by the Lusatian (Sorbian) speakers. Meanwhile, the East Slavic branch generates three smaller branches circa 1600 in the area where modern Poland, Ukraine, and Belarus converge; these twigs presumably represent Ukrainian, Polish, and Belarusian, which Bouckaert et al.—and no one else—regard as forming a minor Slavic sub-family.

Everything that we know about the historical evolution and distribution of the Slavic languages directly contradicts the mapping of Bouckaert et al., as we should now come to expect. As it would again be tiresome to specify all of these errors, I will note only a few of the more glaring examples. First, it has long been established that the Slavic languages had expanded westward all the way to the Elbe River in what is now central northern Germany in the immediate post-Roman period, entering the lands that had been essentially abandoned by the Germanic tribes that invaded the dying Western Roman Empire. It is also understood that the process of Drang nach Osten in the high medieval period resulted in the re-Germanization of the far western Slavic lands, extending as far east as Silesia and Pomerania. The Lusatian-speaking areas, however, resisted this tide, and thus long remained as Slavic enclaves in a Germanic sea. Silesia and Pomerania, however, were in turn “re-Slavicized” after the post-WWII expulsions of German-speakers. The modeled spread of the South Slavic languages is equally off base. It is also well known that Slavic languages pushed southward into Greece beginning in the 500s and especially during the chaotic aftermath of the Byzantine coup of 602, reaching the central Peloponnesus by the end of the century. As Byzantine power collapsed though most of the peninsula, the Greek language retreated to coastal enclaves. The re-Hellenization of the Greek Peninsula did not begin until the reign of the Empress Irene in the late 700s, and was never fully completed. In regard to the East Slavic branch, numerous absurdities have been discussed in previous posts, and hence will not be recounted here.

Perhaps the most amusing depictions concern the expansion of Insular North Germanic, a minor branch that today includes only Icelandic and Faroese. Recall that Bouckaert et al. model the spread of languages over water the same way that they model it over land, only at a much slower pace (with the exception of their “sailor [sub-] model,” which postulates equal rates of expansion over water and land.) But they always take expansion over any surface as a gradual, diffusional process; recall that instances of “rapid” expansion are purposively ignored, although the pace required for such a designation is never specified. The expansion of North Germanic languages to the islands of the North Atlantic is thus modeled as an example conventional diffusion across isotropic space. The animated map thus show the language group spreading out of northern Denmark in the 700s and heading into the North Sea. Some two hundred years later, these languages are portrayed as reaching the Faroe Islands, and by the mid 1000s they are shown as having finally landed on Iceland.

The only way to make sense out of such mapping is to imagine the speakers of these languages as living at sea on boats that remained relatively stationary over the course of many years, gradually diffusing to the north as the decades passed. The authors, I am almost certain, would object to this characterization, noting that their mapping of Insular North Germanic expansion is not actually meant to depict what it actually does depict (“the language could have arrived any time earlier than the date at which our model shows it as arriving”).  The fact remains, however, that the ancestral language of Icelandic—Old Norse—arrived in Iceland by way of a few voyages that lasted weeks, not month or years, let alone centuries.  This relatively well-attested process was intentional, can be dated relatively precisely to the late 800s, and is known to have been initiated largely by men from what is now Norway, although most of their wives/female-slaves were Irish (see, e.g., Bryan Sykes’ Saxons, Vikings, and Celts: The Genetic Roots of Britain and Ireland). By the explicit criteria specified by the authors, such a “rapid expansion of a single language” should have been ignored. But regardless of how such particular instances are handled, it is clear that if one insists on modeling the spread of languages to distant islands by a process of diffusion, nonsense necessarily results.

Finally, the portrayal of the Romance languages is equally ludicrous. This history of this group is particularly well known, as the spread and differentiation of the various Romance languages, all descended from Latin, occurred in relatively recent times and have been thoroughly documented in written sources, many of which Bouckaert et al. reference in their supplementary materials. Latin spread rapidly with the armies and administrative hierarchies of the Roman Empire, and is hence discounted by the model. As Latin expanded, it began to differentiate, a process that began well before the establishment of the Empire; as noted in a previous GeoCurrents post, a non-IE substrate on Sardinia evidently resulted in significant divergences from standard Latin on the island during the Republican period. Elsewhere, various vernacular forms of speech began to diverge under Roman rule, a process that accelerated after the fall of the Western Empire in the fifth century. The result was the establishment of a widespread Romance dialect continuum that eventually gave way, although never completely, to the standardized national languages of the modern era.

Now consider the manner in which Bouckaert et al. model the spread of the Romance languages. As they do no consider the initial expansion of Latin, they keep the Romance branch confined to central Italy until the fall of the Western Empire. As the empire weakens in the third century, new branches seem to emerge and begin to diffuse in this Italian heartland, although the color scheme leaves some doubt about this process (see the map call-outs). Romance languages clearly emerge in the following century, and by the early 600s one branch finally makes its way to what is now southern France, whereas another has extended to the middle of the Adriatic Sea. Three hundred years later, the western branch reaches the Pyrenees. In the twelfth century, another “divergence event” produces the group that encompasses French and Walloon; beginning along the Mediterranean coast, this division does not reach central France until the 1600s. The Iberian branch, however, is even more delayed, not reaching Portugal until the late 1800s. At about the same time, another Romance sub-family finally makes its landfall in Sardinia.

I anticipate that if the authors were to respond to such criticisms, they would charge me with engaging in a naively literal reading of their animated map. Language divergence “events” along a branching patterns of linguistic differentiation, they might insist, have to be mapped as if they took place at a single location, when in actuality the model supposes only that they took place somewhere within the much larger areas in which the given parent languages were spoken. Such an objection would be fair enough, but it still does not hold water if the actual differentiation processes took place hundreds of miles away from the areas indicated on their maps. In actuality, French emerged out of the Germanic-influenced “Vulgar” Latin dialect(s) of the Paris Basin, and subsequently spread outward, due in large part to the power and prestige of Paris and the French state. Significantly, it did not diffuse outward in an even manner, but rather spread to cities and town well before it penetrated the countryside. French also expanded more slowly where it encountered markedly different dialects/languages, and where other Romance dialects had already established their own prestige registers. Yet again, the issue is not that Bouckaert et al. make few mistakes and that we are unwilling to tolerate error, as has been charged. The issue is rather that their model gets just about everything wrong, often spectacularly so.



Sykes, Bryan (2007) Saxons, Vikings, and Celts: The Genetic Roots of Britain and Ireland. W. W. Norton & Company.

Linguistic Phylogenies Are Not the Same as Biological Phylogenies

(Note: This post is jointly written by Martin Lewis and Asya Pereltsvaig)

A key assumption of Bouckaert et al. is that the diversification and spread of languages operates so similarly to the diversification and spread of biological organism that the two processes can successfully be modeled in the same manner. The parallels between organic and linguistic evolution are indeed pronounced. Both processes entail replicating codes that continually change, giving rise to novel varieties that increasingly differ from their progenitors over time. As a result, “phylogenetic trees,” showing descent from common ancestors, are a common feature of both evolutionary biology and linguistics.

But despite their similarities, organic evolution and linguistic evolution are in many ways highly dissimilar. Encoding information for communication is not the same as encoding information that generates life: language is vastly more fluid and complex than the genetic code; individual languages are much less clearly differentiated from each other than are species; and language is a social phenomenon, given to influences largely irrelevant for biological evolution. The key differences can be summarized as follows: biological evolution is unconstrained but governed by natural selection (any mutation can happen, but which mutations remain in the pool depends in large part on natural selection), whereas linguistic variation (seen in terms of deep grammatical properties) is constrained by a system of parameters but is not subject to natural selection. As a result, the branching trees of linguistic descent are merely analogous to the phylogenetic diagrams of biological evolution, and do not indicate the same kind of relationships.

Although organic evolution operates through a much more restricted set of message-carrying units than does human language, it nonetheless produces diversity at a much deeper level. Given the biological constraints of the human brain/mind (as of yet less than fully understood), there are only so many ways in which any given language can be structured. To be sure, the number of possible human languages, both extant and extinct, as well we those that may arise in the future, is vast, but all human languages appear to be “variation on a theme,” guided by the same parameters. Some languages have as few as two vowels (Ubykh, Northwest Caucasian) and others as few as six consonants (Rotokas, North Bougainville); other languages may have as many as 20 vowels (e.g. the Taa language, spoken in Botswana and Namibia, is reported by some sources to have as many as 20 or even 30 vowels, depending on analysis) and as many as 84 consonants (as in Ubykh; the Taa language is reported to have 87 consonants under one analysis, 164 under another). But crucially, all languages differentiate vowels from consonants and use both. Some languages put verbs before subjects and objects, while others place them at the ends of sentences, but all languages have verbs, subjects and objects.* Some languages can build sentence-long words packed with of numerous prefixes, infixes, or suffixes, while others use stand-alone, stripped-down words to do the grammatical work of expressing tense, number etc., but all languages make words from morphemes—and all construct sentences. As a result of this limited space of possibilities, completely unrelated languages evolving on their own often come to share major grammatical traits.

Linguistic evolution, unlike that of the biological realm, moves at a rapid clip. In non-literate societies, words change so quickly that after some five to eight thousand years not enough cognates can be traced back to establish linguistic relatedness. In the same time span, grammatical structures can undergo wholesale transformations, and sound inventories can change drastically as well. As a result, even clearly related languages can have next to nothing in common with each other, and can only be linked through investigations into their ancestors. Hindi and English, two of the three most widely spoken Indo-European languages, are dissimilar in almost every respect.** On casual inspection, Hindi would seem to have more in common with the non-Indo-European languages of the Indian sub-continent than it does with English.

Thus, relatedness at the family level and overall linguistic similarity often fail to correspond. Maps showing major language patterns typically bear little if any resemblance to maps depicting linguistic families. Even something as seemingly basic as word order correlates poorly with lines of descent. For example, Indo-European languages can be SVO (subject-verb-object; marked by red dots on the map to the left), such as English, Romance, and most Slavic languages (but Sorbian, a Slavic language, is SOV); SOV (marked by blue dots), such as the Indo-Iranian languages (yet Kashmiri is SVO); or VSO (marked by yellow dots), such as the Insular Celtic languages (yet Cornish is SVO). Some other families, such as Austronesian, have an even greater variability in the basic word order:  Niuean is VSO, Malagasy is VOS, Rotuman is SVO, and Tuvaluan is OVS.

Similarly, features of morphological typology (how words are formed from morphemes) often cross-cut connections established by common descent. Whereas Proto-Indo-European, like most of its daughters, was a synthetic language (building words from multiple non-root morphemes), English and Afrikaans are relatively analytical (with low ratios of morphemes to words), which gives them a certain affinity with Mandarin Chinese (a highly analytical language). As discussed in an earlier GeoCurrents post, isolating languages are found in Africa (Hausa, an Afroasiatic language), Asia (Vietnamese, Austroasiatic), Oceania (Rapanui, Austronesian), and the Americas (Kipea, Kiriri). In phonology as well, similar patterns obtain, as sound inventories often fail to show systematic correspondences with language families. The Indo-European languages of South Asia, for example, are in many respects more phonologically similar to the Dravidian languages of the same region than they are to most other IE language. One of the characteristic phonological markers of the region, the rich inventory of retroflex consonants, is also scattered across the rest of the world, found in about 20 percent of all languages belonging to a wide variety of families.

One of the best ways to appreciate the relative insignificance of language families in regard to the global distribution of such features is to explore the maps that can be generated on the WALS website, such as the one reproduced above. Few if any of these maps bear much resemblance to the familiar depiction of the world’s major language families.

Again, the contrast with biological evolution is stark. The farther removed organisms are from each other on the tree of life, the fewer genes they necessarily share. Even when convergent evolution results in similarities between distantly related organisms, the parallels are relatively superficial. As a result, modern genetic inquiry can establish precise levels of biological relatedness, a process that has revolutionized taxonomy over the past few decades. In the biological realm, moreover, the farther one moves up different branches of evolutionary descent, the more distinctive the organisms found along it generally become. Chordates (the phylum that includes vertebrates) share a distant common ancestor with echinoderms (sea stars and their relatives), and some tunicates, primitive members of phylum Chordata, might be mistaken by unschooled observers for sea lilies in phylum Echinodermata. (Tunicates more generally look like unrelated jellyfish and other cnidarians; a few could be mistaken for rocks, but such rocks disconcertingly bleed when cut open.) But no one would ever mistake any mammal with a sand dollar, a sea cucumber, or any other echinoderm, animals characterized by radial rather than bilateral symmetry. The two phyla have simply evolved in strikingly different directions. If linguistic evolution worked in the same manner, it is questionable whether translation between distant languages would even be possible. Moreover, the disparate patterns of spatial distribution of deep grammatical properties, such as the ones illustrated by the WALS maps, would not be found.

In language, deep grammatical properties can radically change, often taking on the same forms as those encountered in wholly unrelated tongues. As a result, linguistic relationships are often anything but obvious, and can only be discerned though intensive study; significantly, such hidden connections can hold true even for relatively recently emerged languages. A fluent speaker of the major Germanic languages, for example, might be nonplused to learn that Frisian is more closely related to English than it is to Dutch. Yet according to some specialists, even Low German is “phylogenetically” closer to English than it is to (High) German—even though Low German is generally regarded as a mere dialect (or group of dialects) of German!

Linguistic evolution is only vaguely analogous to organic evolution for a variety of reasons, but a crucial factor is the fact that vastly less sharing occurs across biological lineages. We now know that genes can jump from one species to another, but the process is relatively rare; in this realm, change generally occurs as a result of random mutations acted upon by natural selection, not from the borrowing of elements from other species. When it comes to languages, however, sharing is ubiquitous. Languages are almost always borrowing words, and sometimes they adopt grammatical properties of other languages as well. At times, two completely unrelated languages essentially merge to create a hybrid tongue. To be sure, linguists are almost always able to determine which language contributed more elements and more basic structures, and hence should count as the parent tongue. (It should be noted that the use of the terms “parent” and “daughter” in relation to languages is misleading since, unlike in the biological realm, where individual organisms are discrete, the transition from “parent” to “daughter” language is always gradual.) When it comes to creole languages, however, such determinations are not always easy. In regard to grammar, different creoles of completely different parentage are often more similar to each other than they are to any of their source languages. In some instances of mixed languages, admixtures of vocabulary, grammar, and phonology run so deep that linguists abandon the quest for unambiguous classification. Cappadocian Greek, for example, is slotted by the Wikipedia into the seemingly impossible “Greek-Turkish” language family. Does Indo-European therefore encompass this language? Other sources, such as the Ethnologue, place this language in the Greek branch of the Indo-European family, but Turkish influences on Cappadocian Greek are pronounced: it has certain sounds that have been borrowed from Turkish, as well as vowel harmony; it has developed agglutinative inflectional morphology and lost (some) grammatical gender distinctions; and its basic word order is SOV. And Cappadocian Greek is by no means the only example of such a thoroughly “mixed language.” In the biological realm, in contrast, such mixtures are so obviously impossible that they have generated their own nonsense genre, as exemplified by Sara Ball’s delightful flip-book, Crocguphant.

Linguistic family trees must therefore be taken as often showing lines of partial descent, unlike the phylogenetic diagrams of organic evolution. To gain a more complete understanding of linguistic relatedness, it is necessary to complement language families with other kinds of connections. The various languages of a Sprachbund, or a linguistic convergence area, for example, derive from different families, yet nonetheless come to share many features through long histories of mutual interaction. One must also consider linguistic strata, which take into account the influences imposed by one language on another. The role of a linguistic substratum, derived from a previously existing language that was later supplanted by another tongue, can be profound. In many cases, such linguistic substrates were instrumental in generating subfamilies; the Germanic languages, for example, are distinct from other Indo-European languages not merely because they drifted in their own particular direction, but also because that acquired a major substrate from another (unknown) language family. Sometimes, the ghostly presence of a long extinct language or language family can be detected through such substrates. Vedic Sanskrit, for example, was definitely an Indo-European language, but it was influenced not only by the preexisting Dravidian and Munda languages of the Indian subcontinent, but also by an unknown substrate deemed by Colin Masica “Language X.”

A useful alternative to the linguistic tree is the so-called wave model, or Wellentheorie, originally devised to explain some of the characteristics of the Germanic languages that seemed to defy the phylogenetic approach. In wave theory, fluid dialect continua replace the stable, geographically bounded languages required by models predicated on direct descent from ancestral tongues. Here, innovations can occur at any points within a dialect continuum; such changes then spread outward in a circular manner, eventually dissipating as the distance from the innovation center increases.*** If a bundle of innovations substantially overlap and become entrenched, a new dialect, or even language, can be said to have emerged. But according to wave theory, such a “language” is still best viewed as an “impermanent collection of features at the intersections of multiple circles.”

Wave theory does recognize, however, the fact that a single language/dialect can appropriate an entire dialect continuum, subordinating more localized speech forms and eventually driving them into extinction, as indeed was the case in regard to Standard German over most of Germany. Such a process, however, generally requires the power of the state or of some other overarching institution. Such geographically expansive and culturally potent organizations, however, are a feature of the relatively recent past; for most of humankind’s existence, the institutions necessary for producing linguistic standardization over broad areas were lacking. We are so used to the modern world of mass communication over vast distances and of language-standardizing governments and educational systems that we easily forget that in earlier times, and in many remote areas to this day, different linguistic environments prevailed. Overall, we suspect that for most of human history, the wave theory more accurately captures the process of language change than does the standard phylogenetic model. Yet in the most general terms, the two models complement each other relatively well.

*Debate does rage, however, about whether the so-called “non-configurational languages” such as the Australian language Warlpiri, have subjects and objects in the same sense as the more familiar, “configurational” languages like English or French. The reader is referred to Baker (2001) for evidence of subject-object asymmetries in such non-configurational languages.

**For example, Hindi makes a phonemic distinction between aspirated and unaspirated voiced stops, has fusional case/number morphology, subject-object-verb word order, postpositions, and uses the ergative-absolutive alignment in the preterite and perfect tenses; English, in contrast, has no aspirated voiced stops (and does not use aspiration phonemically at all), has largely abandoned fusional morphology, has lost the case system except with pronouns, employs a subject-verb-object word order, uses prepositions rather than postpositions, and is characterized by nominative-accusative alignment.

***Ironically, the diffusion analogy of Bouckaert et al. may be best suited to describing dialectal continua rather than divergence and expansion of languages and language families; we shall return to this point in a forthcoming post.



Baker, Mark C. (2001) The Natures of Nonconfigurationality. In Mark Baltin and Chris Collins (eds.) The Handbook of Contemporary Syntactic Theory. Oxford: Blackwell. Pp. 407-438.


103 Errors in Mapping Indo-European Languages in Bouckaert et al. Concluded: Part V, Western Europe

By now, all of the cartographic failings of Bouckaert et al. have become familiar. On the map of France and neighboring areas, for example, we see the unreasonable elevation of minor dialects to the status of discrete languages (three forms of Breton make the list), the replacement of a non-Indo-European language with an Indo-European languages (the Basque region is shown as French speaking), the improper use of political boundaries as linguistic boundaries (French is not shown as extending into Switzerland), the preferential classification of dialects as languages when they are associated with states (Walloon counts as a language, unlike the other equally distinctive langues d’oïl of northern France or the langues d’oc of southern France; Flemish counts as a language, unlike other equally distinctive forms of Dutch), and the simple geographical misplacement of languages (Romansh is placed in northwestern Italy rather than southern Switzerland). Of particular note in regard to the linguistic mapping of France is the fact that Corsica is completely obliterated by circle #48 (see the map of the Italian Peninsula in the previous post).


The mapping of the Iberian Peninsula is particularly simplistic. The authors have simply placed Portuguese in Portugal and Spanish, along with Catalan, in Spain. The fact that Galician in northwestern Spain is closer to Portuguese than to Spanish is ignored, and the Basque-speaking region is mapped as if it were Spanish speaking. The Balearic Islands are also neglected, as archipelagoes generally are in the authors’ land-biased approach.


The map of the British Isles severely misconstrues the Celtic tongues. Irish, for example, is shown as extending across all of the Republic of Ireland and as entirely absent from Northern Ireland. In actuality, Irish has long been largely limited to the western margin of the island, and as late as the early 20th century was still spoken in parts of what was to become the political unit of Northern Ireland. The mapping here, in other words, is yet again political rather than linguistic. By the same token, Welsh is placed in the coal-mining districts of southern Wales where it has been absent for generations, just as Cornish is depicted in areas where it was not been spoken for hundreds of years. The mapping of Scottish Gaelic is not bad, but the term used—“Scots Gaelic”—is off the mark. The proper term is “Scottish Gaelic,” as “Scots” refers to a different language altogether. Scots, or Lowland Scots, is usually regarded as a highly distinctive form of English, but some linguists regard it as a language in its own right (CNN has recently reported on the demise of one of its dialects).*




The mapping of extinct language is also poorly executed. Old English is essentially restricted to the historical kingdom of Wessex, even though the language extended as far north as the Edinburgh region of what is now southeastern Scotland, and included dialects of Kent, Mercia, and Northumbria. Significantly, even the Wessex (West Saxon) dialect of Old English extended farther to the east than what Bouckaert et al. would allow for Old English in its entirety.

The language map of Bouckaert et al. that I have criticized over these past five posts is a cornerstone of their model, yet it is also wholly inadequate for the task. Many of the errors found here ramify through all of the maps that they have produced.  But even if a serviceable map had been constructed, the model would still yield nonsense, as most of the assumptions upon which it is based are unwarranted, as we shall in more detail see in subsequent posts.

*Although I am no expert on this topic, I would argue that Lowland Scots is almost but not quite interintelligible with Standard English, especially in its spoken form, and thus deserves to be regarded as a separate language. Although I love the poetry of Robert Burns, I generally need translation. Take for example, these versus from “Auld Lang Syne”:

In the Original Scots:

We twa hae run about the braes,

and pu’d the gowans fine;

But we’ve wander’d mony a weary fit,

sin auld lang syne.


We twa hae paidl’d i’ the burn,

frae morning sun till dine;

But seas between us braid hae roar’d

sin auld lang syne.


In Standard English:

We two have run about the slopes,

and picked the daisies fine;

But we’ve wandered many a weary foot,

Since long long ago.


We two have paddled in the stream,

from morning sun till dinner time;

But seas between us broad have roared

since long long ago.


Or listen to the delightful poem “To a Mouse” on ScotsIndependent website:


Wee, sleekit, cow’rin, tim’rous beastie,

O, what a panic’s in thy breastie!

Thou need na start awa sae hasty

Wi bickering brattle!

I wad be laith to rin an’ chase thee,

Wi’ murdering pattle.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part IV (Central Europe)

(Continued) The main problems with the language map of eastern Central Europe in Bouckaert et al. have already been discussed; to whit, the depiction of “national” languages as coterminous with state boundaries. The authors do occasionally deviate from this norm, showing, for example, a tiny non-Romanian area in northwestern Romania. Note also that they show Latvian as failing to reach Latvia’s northwestern coast. This view is indeed historically accurate, as northern Courland was the land of the Livonians, a Finnic-speaking people. The last native speaker of Livonian, however, died in 2009; for decades before that, Livonian was severely endangered and most speakers were bilingual in Latvian or Russian. If the map purports to depict the present situation, it is flatly wrong here. If it depicts the relatively recent past, as it does for some areas, it is more on target. Unfortunately, no time specification is provided.

Such unspecified chronology is a more intractable problem for the depiction of extinct languages. Major languages of the distant past often experienced major geographical changes, sometimes literally moving en mass when their speakers migrated. The Goths, for example, probably originated in what is now Sweden, later crossed the Baltic into northern Central Europe, subsequently moved into the steppes north and northwest of the Black Sea, and eventually spread with victorious warrior bands over much of the Roman Empire; the final redoubt of the language was the Crimean Peninsula, where it persisted until the ninth century and perhaps until early modern times. Any Gothic language polygon would thus fit a specific place only at a specific time. Bouckaert et al. have apparently selected the period just after the movement of Gothic out of Scandinavia, although the area specified does not seem to match what (little) is known about the early relocation of the language (see the map to the left).

As mentioned in the previous post, the placing of Byelorussian (Belarusian) in a small corner of the Czech Republic is a careless transcription error. But the intended depiction, that of Eastern Czech, is still off base. Czech is not heavily differentiated into dialects. The truly distinctive forms of the language are half way to Polish. Cieszyn Silesian and other Lach dialects are regarded by most Czech linguists as a Polish-influenced form of Czech and by most Polish linguists as a Czech-influenced form of Polish (politics do tend to intrude into linguistic discussions). Such dialects, however, are not on the map. What is (supposed to be) shown is “Eastern Czech,” placed in a small corner in the southeastern part of the Czech Republic. It is unclear what this designation refers to. Across the entire eastern half of the republic, one finds the Moravian dialect (or dialects), which are not strikingly different from standard Czech.

The linguistic depiction of the Italian Peninsula in Bouckaert et al. contains some curious features. This portion of the map is difficult to decipher, as extinct languages overlay extant languages, and much the area is covered by the circular labels. It is still clear, however, that the mapping here remains inconsistent. Italian is shown as extending neither into the Po Valley in the north nor to Sicily in the south. Fair enough: the local dialects spoken (or spoken until recently) in those areas are markedly different from Standard Italian, based on the Tuscan dialect. Yet the authors place other parts of the peninsula with equally distinctive dialects, such as Apulia in the southeast, in the Italian language category. In regard to the extinct Indo-European languages mapped here, the major issue is why only Umbrian and Oscan were selected to accompany Latin.




Most of the problems found on the map of Germany and environs have already been discussed. Note, for example, how Luxembourgish makes the cut on political grounds, whereas other distinctive German dialects are ignored. Of special note here is the demarcation of two Lusatian (or Sorbian) languages, although only one is labeled on this map segment. These Slavic tongues of eastern Germany are distinctive, and mapping them as separate languages makes linguistic sense. But it is difficult to understand why these relatively minor languages, with 40,000 and 10,000 speakers respectively, have been added to the tally, whereas Iranian and Indic I-E languages with hundreds of thousands to tens of millions of speakers have been ignored.

The language mapping of Scandinavia shows, yet again, striking geopolitical influence. Here we have Danish blanketing Denmark, Riksmal (or the Norwegian “national language”) everywhere in Norway except the islands and Finmark, and three separate Swedish languages covering all of Sweden except the islands, which remain unmarked. The straight east-west line that separates two supposedly distinct Swedish languages is a curious and highly unlikely feature.

But as one would expect, the continental Scandinavian languages do not actually correspond so well to national territories. Overall, the region is characterized by a dialect continuum so pronounced that some scholars regard all of the mainland North Germanic tongues as a single, regionally differentiated language. Swedish and Danish are almost interintelligible, and Norwegian is often regarded as a kind of a bridge: as a common saying puts it, “Norwegian is Danish spoken in Swedish.” (Norwegian vocabulary is similar to that of Danish, whereas its phonology is more like that of Swedish). But it is more complicated than that, as there is no single Norwegian language at any level. Local dialects cross the border with Sweden, but even in terms of official state recognition, Bokmål (“book language”) competes with Nynorsk (“New Norwegian”), and neither of these two variants are exactly the same as the standardized but non-official Riksmål (“national language”) and Høgnorsk (“High Norwegian”) forms. The differences between Bokmål and Nynorsk are not purely lexical (e.g. Bokmål pike ‘girl’ vs. Nynorsk jente ‘girl’), but concern grammatical patterns too (e.g. Bokmål does not distinguish masculine and feminine genders, whereas Nynorsk does). In a sense, the differences between Bokmål and Nynorsk are more pronounced than those between Bokmål and Danish (e.g. Danish word for ‘girl’ is pige, and most dialects of Danish and its standardized form do not distinguish masculine and feminine genders). The contention among these different language varieties is at once political, cultural, and historical, tied up with Norway’s former subordination to Denmark. Norwegian linguistic nationalists have often wanted to purge specifically Danish elements from the language, whereas linguistic traditionalists would like to preserve them.

Legacies of geopolitical change are also evident in the Scania region of southern Sweden. The dialects of Sweden’s far south are close to those of Denmark—so close, in fact, that some scholars place them within an “East Danish” category. Significantly, Scania was part of the Kingdom of Denmark until it was lost to the rising power of Sweden in 1658; it did not become an integral part of Sweden, however, until 1719, and which point a policy of linguistic “Swedenization” was initiated. “Eastern Danish” is thus considered by some to be a more historical than a linguistic category.

One of the oddest features of the mapping strategies employed by Bouckaert et al. is their reluctance to include islands within the territories of any language. In some cases, island groups are appended to mainland polygons, as can be seen here in the depiction of Danish (in the same manner, the Hebrides are mapped as Scottish-Gaelic speaking). Most often, however, islands and archipelagos are simply ignored, as one can see here in the cases of Norway’s Lofoten and Sweden’s Gotland and Olaand. Had Gotland been considered, I wonder whether it would have been mapped as Gutnish speaking. Gutnish, a disappearing dialect, is distinctive, and is sometimes said to be a direct descendent of ancient Gothic.

The mapping of Old Norse as coinciding with Iceland is also untenable. When Old Norse was spoken on Iceland it was also spoken in Norway, Sweden, Denmark, in northern Scotland, and pockets of the western British Isles.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part III: From Western Russia to the Balkan Peninsula

(Continued) The most glaring error in the linguistic map of western Russia and environs by Bouckaert et al. concerns the labeling of Belarus. The number “22,” placed in the center of the country, is listed as signifying the “Czech E,” which presumably means “eastern Czech.” As the authors have correspondingly appended the label “Byelorussian” to a small area in the eastern Czech Republic, the error is obviously one of transposition. Such mistakes can occur inadvertently, although the fact that it has gone undetected indicates a troubling failure to engage in routine proofreading.

A much deeper problem is indicated by the intentional mapping. Note how the polygons indicating the Belarusian and Ukrainian languages correspond precisely to the present-day territories of Belarus and Ukraine respectively. Such exact political-linguistic correspondence is rare, and when it is encountered it generally indicates a recent history of state-led linguistic repression or ethnic cleansing, which should be taken into account in any historical consideration of linguistic geography. In the case of Belarus and Ukraine, however, the current distribution of the national languages does not even come close to fitting precisely within the geographical bodies of the respective countries.

Belarusian is widely spoken in Belarus but it is not the country’s majority language and it is dominant only in the west and the south, as can be seen on the Wikipedia map posted here. Even in these areas, Belarusian is losing ground among the young, and is thus classified as a “threatened language.” The threat stems from Russian, which, according to the 2009 national census, is spoken at home by 72 percent of the people of Belarus. Identifying the Belarusian language with the national territory of Belarus is—yet again—a political rather than a linguistic statement.

Placing the Ukrainian language precisely within the territorial bounds of Ukraine is an even more egregious error. The fact that eastern Ukraine and the Crimean Peninsula are mostly Russian-speaking areas is well known, as it is mentioned almost every time that Ukrainian elections are discussed. According to the Constitution of the Autonomous Republic Crimea, Russian rather than Ukrainian serves as the “language of interethnic communication”. Moreover, government duties in Crimea are fulfilled mainly in Russian, hence it is a de facto official language. The issue of whether Russian should be made co-official in other areas of Eastern and Southern Ukraine that are already de facto Russian-speaking is hotly debated on the parliamentary level. Before WWII, moreover, the linguistic map of the region was far more complex than it is now, an observation that holds true for most of eastern and central Europe. The southern Crimea, for example, was then dominated by people speaking Crimean Tatar, a language in the Turkic family.

The depiction of European Russia is little better. In this case, political boundaries are not slavishly followed, as large areas of northern Russia are correctly shown as non-Russian speaking. But many northern regions that are Russian-speaking, such as Saint Petersburg, are oddly excluded from the realm. Conversely, sizable areas in eastern European Russia are mapped as Russian-speaking when in actuality they are inhabited by peoples speaking Uralic and Turkic languages. It is admittedly difficult to map such languages as (Volga) Tatar, Mari, and Udmurt, as they are not spoken in geographically contiguous areas but rather form archipelagos in a Russian sea. But do such technical challenges warrant the exclusion of such language? More than six million citizens of the Russian Federation speak Tatar as their first language, and mapping them as if they were Russian speakers fails to given them the recognition that they deserve. The Udmurt language, spoken by about half a million speakers, has been recently propelled to the focus of the public attention in Russia and in the rest of Europe when a band of Udmurt-speaking (and -singing) grandmothers won second place at the Eurovision Song Contest.

Such mapping difficulties are by no means limited to western Russia. In many parts of the Indo-European realm, languages are interspersed, forming complex amalgams. As mentioned above, such mixtures were much more intricate before the horrors of the Second World War and its immediate aftermath. Depicting such areas as linguistically uniform, as Bouckaert et al. routinely do, thus results in intrinsic distortions. Such distortions, moreover, seem to be a necessary feature of their basic methodology, as they depict every language within a discrete and uniform polygon. Linking together languages whose speakers are scattered in separate communities over large areas into single bounded spaces results in such absurdities as the gerrymandered Kurdistan mentioned in the previous post.

Such procrustean tendencies reach a laughable extreme in the depiction of the Romani language (that of the so-called Gypsies), seen on the map of the Balkans posted to the left. Romani, labeled 74, is impossible to locate precisely, as the area indicated is covered by the circle16 in western Bulgaria. Presumably, a small, discrete Romani polygon lies below this numerical tag. To restrict the Romani language to this area is beyond absurd. Romani, like the Roma people who (sometimes) speak it, is dispersed over most of Europe. Bouckaert et al., however, do not even manage to adequately locate the language’s center of gravity, as far more people speak Romani in Romania than in Bulgaria. Mapping Romani is, of course, an extraordinarily difficult task, as the linguistic community is not only scattered widely, but its members often relocate. As a result, most cartographers simply indicate the numbers and percentages of Romani speakers (or Roma people more generally) found in different countries.

The rest of the map is not much better. Although the authors differentiate four separate Albanian languages, they depict the northern half of Albania as non-Albanian speaking. They also limit Serbo-Croatian to Serbia and Montenegro, excluding Croatia and Bosnia. Here the categories used and the map itself fail to correspond; what the map shows is the political-linguistic construct of Serbian (plus Montenegrin), used since the break-up of Yugoslavia, whereas the label turns back to the Yugoslavian idea of a single Serbo-Croatian languages, which also encompasses Bosnian and Croatian. From a linguistic standpoint, Serbo-Croatian works best, as all of its politically standardized forms are mutually intelligible to some degree. But by the same token, Bulgarian and Macedonian, shown here as separate languages, are similarly interintelligible. The underlying problem here is the lack of uniformity in the treatment of different languages: if they have four Albanian languages as well as separate languages in Bulgaria and Macedonia, they should have separated Serbian, Croatian, Bosnian, and Montenegrin—or better still, they should have differentiated the non-political dialectal divisions of Serbo-Croatian: Chakavian, Kaykavian, Western Shtokavian, Eastern Shtokavian, and Torlakian.

Finally, the mapping of Greek, both ancient and modern, is bizarrely idiosyncratic.  On what possible basis could the authors limit ancient Greek to Athens and its vicinity? The implicit argument here is that only Attic Greek was Greek, with the other Hellenic polities speaking non-Greek languages, a nonsensical idea. And yet they don’t even manage to map Attic Greek properly, leaving out the islands on which it was spoken. One can only conclude that the authors are incompetent at mapping languages, a cornerstone of their approach.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part II: from Afghanistan to Anatolia

(Continued) Moving westward, the linguistic mapping of Iran and environs by Bouckaert et al. contains roughly the same density of error as that of South Asia. As most of these mistakes are noted in map call-outs, and others have been discussed in previous posts, I will focus here on the authors’ misperceptions about the Persian language.

The authors have divided Persian into two languages, labeled “Persian List” and “Tadzik” (a non-standard spelling of “Tajik”). Linguists, however, generally agree that Persian is a single language, albeit one with ten or so dialects, three of which serve as standard literary forms. These three official varieties are labeled Western Persian (or Farsi), found primarily in Iran, Eastern Persian (or Dari), spoken mostly in central and northern Afghanistan, and Tajik Persian (or Tajiki), located in Tajikistan and Uzbekistan. One would have to take an extreme “splitting” position to regard Farsi and Tajik Persian as separate languages. As the Wikipedia notes, “Persian-speaking peoples of Iran, Afghanistan, and Tajikistan can understand one another with a relatively high degree of mutual intelligibility, give or take minor differences in vocabulary, pronunciation, and grammar—much in the same relationship as shared between British and American English.” (It is also significant that the Tajiks historically call their tongue Zabani Farsī). And if separating Farsi and Tajiki is problematic enough, ignoring Dari Persian, spoken by 15-18 million people, is absurd. Doing so sunders the geographically contiguous Persian zone into two widely separated language zones.*

The most glaring blunder on the map of Anatolia and environs concerns the delineation of Kurdish. Here the main problem is the opposite of the one encountered in regard to Persian: several clearly separate languages are lumped together. By strictly linguistic criteria, Kurdish is a subfamily of related tongues. As the Wikipedia puts it, “Kurdish is not a unified standard language but a discursive construct of languages spoken by ethnic Kurds, referring to a group of speech varieties that are not necessarily mutually intelligible …” Kurdish proper is itself divided into two (or three) languages: Kurmanji, Sorani, and, sometimes, Kermanshahi. Philip G. Kreyenbroek, cited in the Wikipedia article referred to above, claims that, “From a linguistic or at least a grammatical point of view … Kurmanji and Sorani differ as much from each other as English and German.” The idea of a single Kurdish language is once again a political construct, albeit one based not on an actual political unit, but rather on the aspirations of most Kurdish people for a state rooted in trans-linguistic ethnic solidarity.

But not only do Bouckaert et al. elide the distinction between these two Kurdish languages, but they also subsume another language into the same category. The language in question is Zazaki (1.5-2.5 million speakers), located in the central part of eastern Turkey. The Zaza people are usually considered by others, and often by themselves, as members of the wider Kurdish ethnic formation, but their language is quite distinctive. It is most closely related to Gorani, spoken in Iran to the south of the Kurdish zone, but it also bears affinity with Talysh, another Iranian language ignored by Bouckaert et al.

Not only are the Kurdish languages misclassified, but so too they are inaccurately mapped. The Kurdish polygon of Bouckaert et al. is truly peculiar, as it excludes the southern part of the Kurdish region (most of the Sorani-speaking zone) while including a western extension into mostly non-Kurdish-speaking areas. Its longer eastern “panhandle” pushes far enough to take in the Kurdish areas in northeastern Iran, but in the process includes non-Kurdish areas along the Caspian Sea and in the Alborz (Elburz) Mountains. Such a fanciful depiction brings to mind the infamous “Gerry-Mander” of U.S. political history. If oriented conventionally, with north at the top, Bouckaert’s gerrymandered Kurdistan reminds me of a lounging rodent; if tilted on its side, it looks more like a galloping dinosaur.

I have also posted an excellent map of the ancient Anatolian languages, which makes a nice contrast to the simplistic depiction of these tongues in Bouckaert et al.

*As a final note on this map, western Afghanistan, a mixed Dari- and Pashto-speaking area, seems to contain an unlabeled polygon for a modern Indo-European languages, which I have marked with a question mark.


The Misleading and Inconsistent Language Selection in Bouckaert et al.

To successfully model the spread and divergence of a language family, one must select languages for one’s data set in a comprehensive, balanced, and consistent manner. Results will be skewed if large numbers of languages are excluded from analysis, if some regions and linguistic branches are covered much more thoroughly than others, or if both dialects and languages are selected based on different criteria in different parts of the world. Bouckaert et al., unfortunately, do all of this and more. The authors favor certain areas and linguistic sub-families, minimizing others. Biases relating to preservation and examination seem to guide most such decisions. Most extinct Indo-European languages that are well documented, such as Old English and Old Norse, are included in the analysis, whereas those that are poorly known, such as all of the Scythian languages of the hypothesized proto-Indo-European homeland in the Pontic Steppes, are simply ignored. Likewise, living languages that have been intensively studied get preference over those that have not received similar scrutiny. Selecting and ignoring languages in such a manner may be convenient for formal modeling, but deep and systematic distortions result.

One of the more vexing issues in linguistics is the differentiation of languages from dialects. As in biological taxonomy, “lumpers” argue endlessly with “splitters.” Whether one accepts either position is immaterial for formal analysis, but one must maintain consistency. Bouckaert et al., however, shift wildly from fine splitting to gross lumping. Their treatment of Albanian exemplifies the former approach, as they divide it into four separate languages (listed as Albanian C, Albanian K, Albanian G, and Albanian Top). Albanian is indeed divided into Gheg and Tosk, which can easily count as separate languages, but no other dialects approach such status in most divisional schemes. The split-happy Ethnologue, however, does count two minor Albanian dialects in Italy and Greece—linguistically indistinct from Tosk in Albania—as separate languages, an approach that Bouckaert et al. chose to follow. In several other parts of Europe they adopt a similar method, classifying Breton as three separate languages, Sardinian as three, and the minor Slavic tongue of Lusatian (also known as Upper Sorbian) as two. But elsewhere in Europe they reject such fine divisions. They take Serbo-Croatian, for example, as a single language—yet oddly give it the ISO code for its Bosnian dialect [BOS]). They also regard German as one tongue; if they had remained consistent and followed the Ethnologue here, they would have included such languages as Bavarian, Mainfränkisch (East Franconian), Pfalzisch, Upper Saxon, and Swabian. In South Asia and the Iranian zone, the authors’ “lumping” tendency reaches an extreme. They count Hindi as a single language despite its pronounced dialectal variation (even the Wikipedia discusses the “Hindi languages”). They do the same with Lahnda, a dialect continuum that encompasses, according to the Ethnologue, eight separate languages.

Bigger problems for Bouckaert et al. are encountered in their basic enumeration of the Indo-European languages of Asia. Whereas the comprehensive Wikipedia family tree for the Iranian branch of Indo-European includes more than fifty extant languages, the selective approach of Bouckaert et al. considers only nine. The authors are even more remiss when it comes to the Indo-Aryan languages of northern South Asia. Punjabi, widely regarded as the world’s tenth most widely spoken language with more than 100 million speakers*, is nowhere to be seen. Whereas the authors list only fifteen extant I-E languages in South Asia, the Ethnologue counts more than 200. A few of the major Indo-Aryan languages discounted by Bouckaert et al. include Rajasthani (20 million** speakers), Bhili (1.5 million), Sylheti (10 million), Garhwali (3 million), Kutchi (2 million), Awadhi (38 million), Kannauji (6 million), and Bhojpuri (38 million). Yet in one part of the region, they abruptly switch to an idiosyncratic splitting approach, differentiating the Waziri dialect from Pashto, which they oddly call “Afghan.” The major split in this language, the north/south divide between “Pashto” and “Pakhto,” however, remains invisible.

By including European I-E languages much more readily than non-European ones, the authors evince a form of Eurocentrism. The same tendency is encountered in their treatment of extinct languages. For western and central Europe, nine dead languages are listed, including Old Irish, Old High German, Old English, and Old Prussian. Fair enough. But for northern South Asia, an area of roughly similar territorial extent and historical population levels, only Vedic Sanskrit makes the list. The many extinct Prakrit languages are excluded without reason. Here preservation bias cannot be the culprit, as a number of these languages are relatively well known, Even Pali, a semi-living language owing to its liturgical position in the Theravada Buddhist community, is inexplicably left off the map.

The Bouckaert model stumbles even more sharply in regard to extinct Iranian languages. Only two are included: Old Persian and Avestan. Major Eastern Iranian languages that were once important literary vehicles, such as Sogdian, Bactrian, Khotanese and Khwarezmian, are simply disregarded. So too are the less well-known Scythian languages of the steppe zone.*** As noted in previous posts, had the Scythian languages been included in the model, the geographical patterns generated would likely have been quite different. Although one could argue that the Scythian languages are not known well enough to have been used, such an argument amounts to an admission that preservation bias compromises the approach. The failure to include well-known Sogdian, on the other hand, cannot be attributed to preservation bias, and is perhaps rooted instead in carelessness, ignorance, or the simple desire to mold the data in order to reach pre-established conclusions.

As the supplementary materials make clear, the authors of the study are fully aware that they have excluded a number of Indo-European languages, both living and dead. Yet in an interview with Isabelle Boni for the general public, co-author Quentin Atkinson maintains that “we compare these words across all Indo-European languages” (emphasis added). Such a statement is careless and misleading at best.

*Admittedly, Western Punjabi is sometimes counted as one of the Lahnda languages, but not Eastern Punjabi.

** The 20 million figure used here assumes that Marwari is counted as a separate language, as it is in Bouckaert et al.

***It is also notable that the Indo-European Thracian language(s), along with the other Paleo-Balkan languages, are likewise ignored.


On Mathematical Modeling and Inter-Disciplinary Work in Historical Linguistics: A Reply to Alexei Drummond—and a Friendly Critique of the Field

We would like to thank everyone who has posted comments on our recent posts on Indo-European linguistics, whether favorable or critical. As we have been highly critical ourselves, we can only expect the same in return; such is the give-and-take of the scholarly endeavor. We  will post detail replies to critical comments next week, after Asya Pereltsvaig returns from her travels. The present post responds only to the first comment posted on GeoCurrents by one of the co-authors of the Science article that we have taken to task. In that response, Alexei Drummond takes on some significant epistemological and methodological issues that demand a considered answer. As Drummond argues:

Personally I would love to include more direct evidence-based information into the computational analysis to correct the details (and see if that changes the main inference of the location of the origin), but that would require the linguists and archaeologists to actually embrace the value of computer models to synthesize large amounts of data. How can a human mind, however elegantly expressed its written conclusions, correctly balance the thousands of items of evidence to provide a probabilistic statement about history in a way that others can verify (i.e. The Horse, the Wheel and Language)? What is good about our approach is that the simplifying assumptions are clearly stated and can be improved upon in subsequent analyses. I just wish that the historical linguistics crowd would try *constructive* rather than destructive criticism for a change. We want what you want: to determine what happened. So as we are all scientists, we should work towards common ground, shouldn’t we?

Try as we might, we find little to disagree with in this eloquent appeal for the use of computational techniques and interdisciplinary research. As Asya Pereltsvaig has emphasized, we respect the work of linguists who use such methods in their own research. We advance no objections to computational methods per se, but rather to this specific application. Successful modeling cannot rest on unsubstantiated and most likely false assumptions about language spread and diversification, cannot disdain verification efforts, cannot be inherently unfalsifiable, and cannot be consistently contradicted by the empirical record. Drummond is surely right that well-crafted mathematical models can be continually adjusted to better fit the reality that they seek to represent—but only if they rest on solid foundation. Certainly the model under consideration could be sharpened, as has been suggested by another co-author, by incorporating elements of physical geography beyond the water/land dichotomy; such an improvement could weed out such blunders as having the Tocharians’ advance along 20,000-foot ridges while bypassing their eventual home in the Tarim basin. But as long as the model rests on the untenable assumption that languages spread through a contagion-like process and diverge in speciation-like events, the result will still be of little value. Subsequent posts will examine how languages do spread and change. As we shall see, such linguistic processes are vastly more complex than the scenarios posited by the Science team. That does not mean that they cannot be mathematical modeled, only that any such efforts will have be much more involved than what we have seen thus far.

We therefore hope that Alexei Drummond will continue to apply his formidable skills to the problems of language spread and diversification. We also hope that in the future he can collaborate not merely with other modelers, scholars whose skill sets overlap to a great extent, but also with experts with complementary skills and frameworks of knowledge. In particular, such work must be done with a bone fide Indo-Europeanist; collaborators with proficiency in world history, geography, and linguistics more generally would also prove highly beneficial.

Although is easy for us to dish out such advice, it would probably prove much more difficult for anyone to take it. As Drummond notes, it seems likely that many if not most historical linguists would rebuff any such invitations for collaboration. Here it becomes necessary for us reverse our critical attention and apply it to historical linguistics itself. Although this series of posts seeks to vindicate the field, we are convinced that a successful defense of any beleaguered intellectual enterprise demands a self-critical* eye.

Historical linguistics is currently in crisis not only because of unsubstantiated attacks or the failure of others to appreciate its intellectual achievements; it is also languishing because its practitioners have failed to meet the challenges that they face. All told, they have remained too insular and too comfortable with their own research paradigms. Emphasizing, like good scientists, the narrow acquisition of knowledge along established research fronts, few members of the guild have been willing to stand back and address the larger implications of their own work for the study of human pre-history (and history), let alone offer edification for a general audience. By the same token, few historical linguists have collaborated extensively with scholars in other disciplines. It is no accident that the three best-known scholars in the debate on Indo-European origins are (or were) all archeologists: Maria Gimbutas, Colin Renfrew, and David Anthony.

Historical linguists might reply that progress in linguistic research demands tightly focused inquiry and highly specialized disciplinary techniques, and would thereby gain little through interdisciplinary collaboration. Such arguments make sense when applied to specific issues, but collapse when it comes to broader matters, such as the origin of the Indo-European family, which is as much a matter of history and geography as it is of linguistics. And regardless of whatever intellectual arguments can be made for highly focused specialization, pragmatic considerations call for a different approach; it is a fact that historical linguistics is a diminishing field that has been unable to fend off mass-media celebrations of encroachments on its own terrain. If their field is to survive, historical linguists much realize that they can no longer be satisfied merely by communicating with each other. They not only must engage more with other scholars, but they must also reach out to the educated public.

Our charge is perhaps not as difficult as it might seem. The public is deeply interested in such issues, as attested by the articles in the popular press on the Bouckaert et al. paper. Asya and I have discovered the same interest while teaching on the intersection of linguistics, history, and geography in Stanford University’s Continuing Studies (adult education) Program, where our classes are consistently among the most popular offerings. Although we would like to think that our teaching skills have something to do with our enrollment numbers, we realize that they stem largely from demand for instruction on a topic that many people find intrinsically fascinating. Next winter, we will be teaching a class specifically on the geo-history of the world’s major language families. But in looking for a text that draws together the major issues within a single, comprehensible framework, we find ourselves frustrated. The best work that we have located thus far is a 1994 Scientific American article entitled, “World Linguistic Diversity,” by none other than archeologist Colin Renfrew. It is unfortunately short and somewhat dated, and it is almost certainly wrong on such major issues as the origin of Indo-European and the existence of Altaic. We do find it odd, and rather sad, that no comparable work has, to our knowledge, been produced by a historical linguist.

* “Self-criticism” is not the best term here, as neither of us is a historical linguist. I am a historical geographer and Asya Pereltsvaig is a linguist who specializes in syntax. What we thus offer is perhaps best described as “friendly criticism.”


Why the Indo-European Debate Matters—And Matters Deeply

As expected, we have received a few complaints from friends, acquaintances, and Facebook-followers in regard to the current Indo-European series. “Why get so exercised over a single article,” some ask, reminding us that science is a self-correcting endeavor that will eventually winnow away the chaff. Others question the entire enterprise, wondering why we would care so much about such an obscure topic.

We agree that science is, in the long run, a self-correcting undertaking, which gives it vast power. But self-correction does not come automatically; it takes work, which we are happy to provide. And in the short-term, counterfeit research can do great harm, as the Lysenko Affair in the Soviet Union so well demonstrated. We also find it deeply troubling that a nonsensical article would not only be accepted for publication in one of the world’s premier scientific journals, but would immediately be trumpeted in the mass media for “solving” one of the key mysteries of human pre-history. The episode uncovers a whiff of corruption in the scientific-journalist establishment that needs a blast of fresh air.

In regard to the second set of complaints, we must reject them outright. The Indo-European issue is not obscure, trivial, or unrelated to pressing issues of our day. In fact, it is difficult to locate a single topic of historical debate that has been more ideologically fraught and politically laden over the past 150 years than that of Indo-European origin and expansion.

Indo-European studies took on a heavy ideological burden in the late 1800s, a development that would indirectly lead to the most hideous examples of genocide and mass-murder that the world has ever witnessed. The supposedly superior “Aryans” of Nazi mythology were none other than the speakers of Proto-Indo-European (PIE). Nazi propagandists conjured their own wildly off-base theories about I-E origins, but their fantasies had roots in the scholarly endeavors of German philologists. And while Nazism was militarily crushed and its ideological foundations pulverized, the movement refuses to die. Indeed, it seems to be experiencing something of a revival in eastern Germany, Hungary, and—of all places—Russia. On numerous occasions, I have found myself directed by Google to the odious “Stormfront” website while searching for images and ethnographic descriptions of various Eurasian ethnic groups. The Aryan myth also continues to feed racially troubling ideologies outside of Europe, particularly in Iran and northern India.

Even scholars who have sought to undermine the noxious notion of the Aryan Herrenvolk have occasionally generated their own benign but still fantasy-laden counter-narratives. The key figure here is the late Lithuanian-American archeologist Marija Gimbutas, noted for placing the I-E homeland in the Pontic Steppes. Gimbutas’s scientific research was solid, and we suspect that she was largely correct in locating the PIE homeland. But in seeking to turn the Nazi view on its head, she went too far—and some of her lay followers went much too far. In the feminist retelling of the tale that she inspired, the Aryans become the Kurgans, a uniquely violent, male-dominated people who destroyed the peaceful, gender-equitable if not matriarchal civilization of “Old Europe.” In Riane Eisler’s 1988 treatise, The Chalice and the Blade: Our History, Our Future, the Kurgan conquests are seen as ushering in a global age of male domination and mass violence. The work was a bestseller, blurbed by noted anthropologist Ashley Montagu as the “most important book since Darwin’s Origin of Species.”

Eisler’s global vision failed from the onset: as male domination characterized almost all historically known human societies, it cannot be attributed to a single ancient people located in one particular part of the Earth. Recent research has also tended to undermine many of her more specific claims. The Old Europeans were probably not as peaceful and female-centered as they had been portrayed, and the PIE speakers and their immediate descendents were probably not so insistently androcentric. Certainly the early Indo-European speakers were no strangers to violence and domination, but how do we account for the female Scythian skeletons from the Kurgan homeland tricked out in military gear? Perhaps Herodotus was on to something when he wrote of Amazon tribes in the area. More to the point, we now understand that the early Indo-European-speakers could not have simply invaded Old Europe and subjugated its inhabitants, as they lacked the state-level forms of military organization necessary for wide conquests. As Anthony shows so well in The Horse, the Wheel and Language, the process was almost certainly one of gradual incursions, marked by both social predation and mutualism, that allowed the militarily advantaged, semi-pastoral, equestrian I-E speakers to slowly spread their forms of speech. And while their languages did indeed expand over vast areas, they did not simply replace pre-existing tongues. Almost everywhere, older linguistic elements survived. Major non-I-E substrates characterize such I-E subfamilies as Germanic and Greek. A huge problem for both Nazi ideology and the Gimbutas/Eisler thesis is the fact that most of the Germanic root words pertaining to war are non-Indo-European. The mysteries here remain deep.

Considering the misuses to which the issue of I-E origins has been put, it is understandable that some people would want to reject the idea that the original speakers were war-like horse-riders from some remote, northern homeland. All such troublesome interpretations would vanish if I-E expansion could instead be linked to the gradual movement of simple farmers from the Near Eastern agricultural heartland into the sparsely settled lands of Mesolithic Europe. But if the evidence indicates otherwise, as it most assuredly does, the result is merely another myth. Scientific responsibility demands the search for truth, even if the truth leads into uncomfortable areas.

Regardless of the complications introduced by ideological distortions, investigations of I-E origins and expansion have a huge bearing of the study of human prehistory. Indo-European, after all, is by far the world’s largest language family when counted by the number of speakers. Linguistic evidence about the family’s spread tells us much of significance about the historical development of a vast section of the Earth’s surface over many centuries, even millennia. Studies of human prehistory depend crucially on three lines of evidence: those derived from archeological digs; from genetic studies; and from linguistics. Over the past decade, much progress has been made in bridging linguistic and archeological evidence, as demonstrated by David Anthony’s The Horse, the Wheel, and Language. To the extent that the burgeoning genetic investigations of Y- and mitochondrial DNA lineages can be incorporated into this linguistic-archeological nexus, a much richer understanding of the prehistoric human past awaits. For a path-breaking interdisciplinary foray into this territory, see Andrew Shryock and Daniel Lord Smail, Deep History: The Architecture of Past and Present.

Such developments, however, risk being cut short if the field of historical linguistics continues to languish. Further progress will depend not only on linguists carrying out their own research, but also on their passing down of their knowledge and techniques to future generations of students. Such lines of intellectual transmission, however, are threatened by cutbacks in linguistic departments, as well as by the assaults on the field mounted by interlopers who have somehow managed to convince many scientists that linguistic evidence is of little account when it comes to studying the history of languages. To the extent that the Anatolian hypothesis gains ground among archeologists and geneticists on the basis of the recent Science article, our collective knowledge of the past will take a sharp step backwards.

The most troubling aspect of the affair, however, is not the threats that it poses but rather the revelations that it makes about the integrity of the scientific and journalistic establishments. A scholarly journal such as Science is duty-bound to vet any potential contribution through established experts. Yet I have a difficult time imagining that the article in question was subjected to proper peer-review through any qualified specialist in the field in which it sits: Indo-European historical linguistics. Either the article was never sent to a competent linguistics reviewer, or the resulting review was irresponsibly ignored. And yet this is not the first time that a preposterous article on historical linguistics has appeared in Science (and also in Nature), as we shall see in future posts. Have the editors of this august journal decided that the discipline of linguists has somehow failed, and that its field of historical inquiry should therefore be handed over to epidemiologists and computational modelers? If so, on what possible grounds was this decision reached? Unless such questions can be answered, I have a difficult time avoiding the conclusion that the editors of Science have betrayed the basic canons of academic responsibility.

While contemplating these issues, I am continually reminded of the Sokal Hoax, an episode that revealed the vacuity of postmodernist literary theory and “science studies” in the mid-1990s. This affair came to my attention when I was participating in the conference on “The Flight from Science and Reason” organized by the New York Academy of Sciences. A rumor began to circulate among the attendees that a noted physicist and mathematician with solid leftist political credentials was perpetrating a prank that would debunk Social Text, perhaps the leading journal of poststructuralist theory, and in so doing deflate the pretension of those who sought to undermine science in the name of human liberation. Sokal’s article, entitled “Transgressing the Boundaries: Towards a Transformative Hermeneutics of Quantum Gravity,” argues that since science is merely a social construct, quantum gravity, especially as interpreted through the new-age lens of “morphogenetic fields,” can have progressive implications for political action. The paper was accepted and duly published, despite the fact that it was, as its author soon admitted, “a pastiche of Left-wing cant, fawning references, grandiose quotations, and outright nonsense . . . structured around the silliest quotations [by postmodernist academics] he could find about mathematics and physics.” Sokal designed the hoax as a kind of test of the allegations made by Paul Gross and Norman Levitt in their book Higher Superstition: The Academic Left and Its Quarrels With Science. As he discovered, even the most palpable nonsense imaginable could be published in Social Text so long as it sounded good and flattered the editors’ ideological preconceptions.”

While the Sokal Affair was a purposive hoax, the members of the Boukaert team evidently believe that their article constitutes a contribution to knowledge. But what the authors think about their own work is of no significance, as the arguments they make must stand on their own. Had Alan Sokal actually believed that the “construction” of quantum gravity could be a politically progressive act, would his article have been any less nonsensical? The current authors have thus perpetrated an unwitting hoax, but the end results should be no less embarrassing for the editors of Science than the Sokal Affair was for those of Social Text. Boukaert et al. begin by improperly framing the problem, and then go on to err at every turn. It is not so much that the article’s conclusions are incorrect, but rather that every assumption it makes, every technique it employs, and virtually every “fact” that it marshals is either incorrect, inappropriate, or misleading. Yet this work was published in one of the world’s most prestigious scientific journals. Something here smells rather fishy.

But if the mere publication of the article in Science raises questions about intellectual integrity, its immediate celebration in the pages of the New York Times points to a deeper mire. Science publishes hundreds of articles each year, a tiny fraction of which are ever mentioned in the New York Times, let alone showcased in the newspaper’s main section. Yet the Times has gone out of its way on more than one occasion to trumpet “contributions” to linguistic history from members of the Bouckaert team, specifically Quentin Atkinson. Evidently, the editors of the supposed newspaper-of-record in the United States have concluded that the work of these scholars constitutes one of the most important scientific stories of the past decade. On what possible basis could such an assessment have been rationally made?

Journalists, like academics, are expected to adhere to certain standards of professional behavior. Unless they are writing for the editorial pages or are explicitly employed in “advocacy journalism,” reporters are expected to remain as objective as possible, not letting their own interests, political predilections, or friendship and kin networks direct their work. Such guidelines are impossible to follow to the letter, and as a result complete objectivity is a mere ideal. But such an ideal is still supposed to influence behavior in self-respecting media outlets, eliminating the excesses of partisanship. In the present case, however, all such ethical fetters seem to have been removed. Nicholas Wade’s reporting on this issue has been non-objective in the extreme. One can only speculate as to why Wade has been determined to act as Quentin Atkinson’s pocket journalist, ever ready to proclaim his latest clumsy foray into linguistics as a scientific breakthrough on par with plate tectonics.

To appreciate the level of corruption revealed by the Bouckaert Affair, imagine that a parallel series of events occurred in a different walk of life, such as business. Imagine, for example, that an established financial firm with a reasonably good reputation decided to apply its mathematical models to an unrelated business, one in which both the leaders and employees of the company had no experience. Being ignorant of their new field, they made a number of naïve and ultimately untenable assumptions about how it operates, and thus when they applied their favored methods, unexpected breakdowns occurred. Soon the firm began to hemorrhage money. But rather than admit to their failure, the managers instead crowed about their success, hiding their mounting losses in misleading accounting sheets and obscurely written reports. But even as the company began to collapse, its reputation strengthened and its stock-market valuation rose. Such gains, it turns out, stemmed from glowing reports on its new venture in the business media, most notably the New York Times. The most substantive Times’ piece on the venture appeared not in the paper’s business pages, but in its main news section, gaining it a particularly wide readership. The fact that it was written by the former editor of its business section, a person widely regarded as one of the country’s leading economic journalists, helped propel the story. For a while, it appeared as if the firm could do no wrong. And then …

In the world of commerce, such a story would end with the quick death of the firm, as well as that of its business model. To the extent that any company making consistent losses will eventually fail, business—like science—is a self-correcting enterprise. Failure in business, however, is generally more pressing than it is in science, as rather more money and power is typically at stake. Intrinsic error can linger in science for decades, as demonstrated by the prolonged resistance of geologists to the ever-mounting evidence for continental drift. In a field as marginal as Indo-European studies, well-funded pseudo-scientific works could withstand invalidation by under-funded scholars for many years. In the popular imagination, moreover, erroneous ideas can escape correction altogether, lodging so firmly as to be all but irremovable by evidence. Examples include the widely known non-facts that the Eskimo languages have a multitude of words for snow, and that Europeans before Columbus thought that the world was flat. The Indo-European Affair, in short, matters, and matters deeply. I find it cause for deep concern, and as a result I will continue to write about it.

But after one more post, the current series on Indo-European origins will go on hiatus for a few weeks. Both Asya and I must travel for a short period, so blogging in general will be light for the next week or so.


Quentin Atkinson’s Nonsensical Maps of Indo-European Expansion

The website that accompanies “Mapping the Origins and Expansion of the Indo-European Language Family” (August 24 Science), maintained by co-author Quentin D. Atkinson, proudly features several maps that allow the easy visualization of the patterns generated by the model. One is a conventional map that purports to show “language expansion in time and space,” depicting and dating the spread of Indo-European languages through a red-to-blue color scheme. The other cartographic product is a sequence of numerous map-frames that ostensibly shows Indo-European (I-E) expansion from the seventh millennium BCE to 1974 CE. This Google-Earth-based animated map, or “movie,” as Atkinson calls it, is explained in terms that are at once simplistic and cryptic:

Watch the Indo-European expansion unfold. This movie shows how our model reconstructs the expansion of the Indo-European languages through time. Contours on the map represent the 95% highest posterior density distribution for the range of Indo-European.

The analysis that I provide below takes these maps on their own terms, as advertised: as if, in other words, they indicate what Atkinson and his colleagues believe to be the “unfolding” of the Indo-European language family in “time and space” as substantiated by their mathematical model. But if one reads the fine print found elsewhere, one discovers that the maps are not actually what they purport to be. The authors admit up front that these figures deliver incorrect information, owing to the fact that crucial pieces of data were excluded from the model:

This figure needs to be interpreted with the caveat that we can only represent the geographic extent corresponding to language divergence events, and only between those languages that are in our 
sample. The rapid expansion of a single language and nodes associated with branches not represented in our sample will not be reflected in this figure. For example, the lack of Continental Celtic variants in our sample means we miss the Celtic incursion into Iberia and instead infer a later arrival into the Iberian Peninsula associated with the break-up of the Romance languages (and not the initial rapid expansion 
of Latin). The timing represented here therefore offers a minimum age for expansion into a given area.

This admission is extraordinary, as it amounts to saying that “even though our data set is too incomplete to produce accurate results, our model should nonetheless be regarded as powerful enough to settle the most highly debated topic in historical linguistics,” and that “even though we make no claims as to the earliest dates in which Indo-European languages were established in any given area, our approach still shows that the language family originated in Anatolia.” I do not think that I have ever encountered a more flagrant example of “having one’s cake and eating it too” in an academic work. In fact, as is demonstrated in a previous discussion thread that is reproduced below, the “caveat” itself errs at virtually every turn.*

In a comment on the previous post, co-author Alexei Drummond framed the study’s limitations in more direct language:

Our geographical reconstructions are only for the language lineages that are direct ancestors of the particular sample of IE languages we analyzed. Our inferred geographic distributions don’t say anything about the full extent of IE languages at any time past or present.

If the geographic patterns depicted on the maps say nothing about the “full extent” of I-E languages “at any time,” why are viewers of the animation invited to “watch the Indo-European expansion unfold”? The claim is evidently inherently misleading. But as we shall see below, the problems run much deeper, as in numerous instances the maps fail to accurately show the partial extent of I-E languages. But before delving into such specificities, a few words about the mapping project in general are in order.

Many problems plague the authors’ cartographic depictions. The two maps, static and animated, fail to correspond in their details, often in a glaring manner. The animated map, moreover, lacks anything approaching a key, and hence is difficult to interpret. The temporal framing of the two maps is oddly displaced, as the “movie” purports to take the story up to 1974 CE, whereas the static map terminates at roughly 1800 CE. Potentially confusing is the fact that the static map gives dates in “BP,” or “before present” (which by conventions means prior to 1950 CE), whereas the animated map uses the historically Christian calendar. Both maps, it is essential to note, show only the expansion and not the contraction of Indo-European, although this essential feature also goes unmentioned. Areas that ceased to be Indo-European speaking centuries ago, such as the supposed Anatolian heartland, continue to be shaded as I-E throughout the animation.

Although the contours mentioned in the “explanation” of the animated map are visible in the greenish shading, the overall coloration scheme remains vague. As the animation unfolds, the hypothesized I-E homeland circa 6500 BC—Anatolia, the Caucasus, the northern Middle East, and the greater Aegean—is washed in yellow, whereas later geographical addition to the realm appear in shades of green. Yet at approximately 2225 BCE, most of the heartland abruptly turns green as well, with the exception of a swath extending from Cyprus through what is now Lebanon to central Iraq and two areas on either side of the Black Sea. Another such abrupt color switch occurs later in the animation.

Also unspecified are the thick green lines, which begin as a several-pixel splotch at roughly 6200 BCE that gyrates in place for about 1,500 years before spreading across the map to form a web. An unwary reader might assume that such lines indicate pathways of migration, but he or she would be mistaken, as movement along specific corridors defies the underlying diffusional model, which postulates gradual expansion along broad fronts with scattered outliers pushing into new territories. The lines actually indicate supposed examples of family-level linguistic divergence. Such relational links often extend into areas that are not shaded as I-E; note, for example, the green lines pushing into unmarked western Russia and northern Sweden on the first map. A naïve reader might wrongly assume that such extensions signal relatively recent movement, with little actual settlement to date.

As mentioned above, the static map and its animated companion do not correspond well. Unlike the animated version, the conventional map shows Corsica, the Balearic Islands, Crete, and Cyprus, for example, as never having been occupied by Indo-European speakers. The animation, to the contrary, puts Cyprus in the initial I-E homeland in the seventh millennium BCE. (Both depictions of the island are incorrect; the first known language of Cyprus, non-I-E Eteocypriot, was supplanted by the Greek (I-E) dialect of Arcadocypriot in the late Bronze Age.) Also notable is the static map’s depiction of Indo-European occupation in areas unmarked on the animated map, including western Norway and western Russia. (Neither map manages to show northern Norway as ever having been occupied by Indo-European-speakers.)

Although the discrepancies between the two maps are never explained, a few of them might be deduced. Consider, for example, the different treatments of western Russia in the maps posted here. In the animated depiction of 1974, only a small portion of this region is shaded as ever having been I-E speaking, yet the static map shows a sizable area as having become largely Indo-European over the past 500 to 1,000 years. This map depicts the distribution of I-E languages in western Russia with discontinuous blotches, seemingly placed at random, which would apparently indicate that the language family spread into this area in a spatially sporadic manner and never managed to fill in the gaps. On the basis of this particular disparity, one might assume that only areas of (supposedly) continuous I-E occupation receive shading on the animated map frames. But if this is indeed the case, the guideline is apparently reversed elsewhere. Note that sizable portions of Central Asia are similarly splotched on the static map, yet are shaded on the animated map. The area that now constitutes Kyrgyzstan is fully shaded on one map, yet remains almost entirely blank on the other. A swath across what are now Syria and Iraq is blobbed red on the static map, apparently indicating partial I-E expansion in the Neolithic, yet is blanketed with yellow on the animated map from the earliest frames. Cartographic consistency is evidently not high on the authors’ agenda.

Far more troubling than disparities between the two maps, however, are inconsistencies between both of them and the historical record. Overall, the fit between the modeled spread of I-E languages and what we know of its actual expansion is poor. In pointing out some of the more flagrant errors, I will begin at the end of the “movie,” which shows the accumulated spread of I-E languages to 1974 CE, contrasting it with the depictions on the static map. I will subsequently work backward in time on the “historically unfolding” movie, pointing out crucial errors for several particular periods. To reiterate, I will consider what the maps literally show, ignoring for the most part their hidden meanings.

As mentioned in the previous post, the most obvious blunder in the 1974 depiction is the omission of Russia and Eastern Ukraine from the Indo-European-speaking realm. On the final map frame, the only parts of Russia that are shaded are the Pskov district, the far southern Crimea, and the largely non-I-E-speaking northern Caucasus. The same map also fails to mark other areas long characterized by I-E speech, such as southern Iberia, Balochistan, southern Sri Lanka, and Orissa in eastern India. The static map, however, does successfully mark most of these places as I-E speaking, yet conversely errs in placing several non- (and never-) I-E-speaking areas in the Indo-European zone, such as northeastern Sri Lanka as well as Manipur and environs in northeastern India. Unlike the animation, this map does show I-E in Western Russia, but only in the past 1,000 to 1,500 years, as discontinuous as late as 1800 CE, and as disappearing entirely in far western Siberia. Such depictions, needless to say, are erroneous; although pockets of Uralic languages persist to the present in eastern European Russia and Western Siberia, the bulk of the region was solidly Russian speaking well before the termination date of 1974. Compounding such errors is the sprinkling of bluish dots in southern Tibet, northern Nepal, and northwestern Burma. Some of the most inhospitable parts of the central Sahara are also vaguely marked with blue to show I-E expansion over the past millennium.

The static map is, in a word, preposterous. What possible Indo-European language could ever have been spoken in the Kachin uplands of Burma over the past 1,000 years, much less in essentially uninhabited areas of the Tibetan Plateau and the Sahara Desert? Note as well that northern Tunisia and northeastern Algeria are clearly marked as having been substantially I-E speaking in recent centuries. On first glance, I wondered whether the authors were trying to show the spread of Latin in this region under the Roman Empire; if so, the coloration is wrong, as blue indicates I-E expansion in the past 1,000 years. But as we have seen, Latin does not count in Atkinson’s scheme, as it supposedly spread too quickly as an individual language (it actually spread quite slowly here; non-I-E Punic continued to be spoken in the region as a minority language up to Augustine’s time). But as it so happens, the blue splotches around Tunis do not indicate anything nearly so specific. Rather, like the light red blobs in central Arabia, they merely show that the model occasionally spits out randomly (and incorrectly) placed outliers at some remove from main areas of Indo-European speech.

Other inaccuracies abound on the static map, including incomplete I-E occupation at the termination date (1974) in western France, Andalucía (but not in Spain’s Basque Country!), and northeastern Scotland, as well as a complete absence of the language family from Gotland in the Baltic along with the previously mentioned Mediterranean islands. The map seems to show that Indo-European languages have never quite yet reached the Atlantic, although of course the authors would likely counter that the map does not actually depict what it claims to depict. Or consider the model’s portrayal of non-I-E-speaking areas in Fennoscandia with that of an actual language map of the region, as can be seen to the left. The fit is poor.

The Fennoscandia map detail also presents evidence that contemporary geopolitical boundaries anachronistically mold the hypothesized language-family distribution in the Science model. As can be seen on the actual language map, linguistic and political boundaries do not correspond particularly well in this area; Estonia and Finland may be non-I-E-speaking countries, but not over their entire expanses. On Atkinson’s map, however, I-E coloring abruptly and transhistorically ends exactly at the modern Estonian border, a most suspicious situation. The general lack of I-E shading for Moldova also makes me wary—and is completely bizarre. A clear example of contemporary geopolitical contamination is found in the portrayal of Central Asia. Note the salient of solid I-E coloration extending northward into Tajikistan’s portion of the Fergana Valley, avoiding the core of the valley held by Uzbekistan. Such a portrayal would be understandable if the map depicted merely present-day conditions, as Tajikistan is mostly I-E-speaking whereas Uzbekistan is not. But the sorting of “Sarts” into Uzbeks and Tajiks, along with the forced “Uzbekization” of many previously Persian speakers, in this historically heavily bilingual area is largely the product of Soviet geo-ethnic machinations. If one delves back to the first millennium CE and earlier, the entire region was heavily I-E-speaking (Sogdian and other Iranian languages).


As one dials back the animated map to earlier periods, the mire only deepens. As it would be too tedious to recount all of the map’s many miscues, I will focus on a few particular time slices.






Consider, for example, the depiction of western Europe circa 1000 CE. At this time, western France, Sicily, and the entire Iberian Peninsula are shown as non-I-E-speaking, although a line of I-E linguistic relationship has been etched across southern France roughly to the Spanish border at the crest of the Pyrenees. The false implications conveyed here—which are fully admitted as erroneous by the authors—are that Roman Hispania and Aquitania were never Latinized, and that the preexisting Celtiberian and Gaulish tongues were not I-E. The same 1000 CE map frame also incorrectly excludes from the I-E realm the South Asian areas that now constitute southern Gujarat, southern Balochistan, most of Maharashtra, and southern Sri Lanka. Note as well that most Norse areas are not given an I-E shading, nor is northern Scotland. Yet at the same time, southern Tibet is placed within the I-E zone! Even the essentially uninhabited and uninhabitable region of Aksai Chin is depicted as Indo-European-speaking at this time; I can’t help but imagine proto-Dardic speaking yetis.

Turn back to the portrayal of the year 18 BCE, and the errors compound. The most conspicuous I-E omission here is the Scythian/Sarmatian realm, which by itself is enough to discredit the model; it almost seems as if the authors intentionally manipulated their data to exclude the linguistically hypothesized steppe homeland of the I-E family. The northeastern salient of I-E languages depicted for the time, which denotes the Tocharian languages, oddly excludes a significant portion of the Tocharian homeland in the Tarim Basin to focus instead on the lofty Tien Shan Mountains. Tellingly, the diffusional front hypothesized here has the ancestors of the Tocharians advancing along ridges well in excess of 20,000 feet in elevation.






Several nice examples of demonstrably false information are found on the depiction of the Mediterranean Basin circa 700 BCE. Here we see the greater Aegean along with the Italian Peninsula clearly colored as I-E, but with little else falling in the same category; Sicily, most of Sardinia, and most of the littoral zone of southern France and eastern Iberia are excluded. Yet we have incontrovertible knowledge that Greek-speaking colonies had been firmly planted in western Sicily, Cyrenaica in North Africa, and over a large expanse of the northwestern Mediterranean coastlands. The spread of the Greek language to Crete, moreover, occurred much earlier, as attested by the Bronze Age Linear B script.  The model fails here in part because it does not count the “rapid” spread of individual languages; Greek colonization, however, took place over hundreds of years, and some of the dialects of ancient Greek were differentiated enough to be classifiable as separate languages.


While the 700 BCE map frame unduly restricts the spread of I-E over much of the Mediterranean, it also improperly extends it in other parts of the basin. Several relatively well-known non-I-E languages persisted in the map’s “green zone” well beyond 700 BCE. On the island of Lemnos, the non-I-E Lemnian language vanished only with the Athenian conquest in the fifth century BCE, while Etruscan and Raetic survived into the first millennium CE. Together, Lemnian, Etruscan, and Raetic seem to have constituted the extinct Tyrsenian language family, which might have included Minoan (Eteocretan) and Eteocypriot as well. The scattered distribution of this family in antiquity probably signals that Tyrsenian languages had blanketed a much broader area before the incursion of I-E speakers. In the Science model, however, the entire Aegean region is mapped as I-E speaking as early as 6500 BCE.  Are we to imagine a post-I-E migration of Tyrsenian speakers into the Aegean from Etruscan- or Raetic-speaking areas further to the west? Yet historians who have viewed the Tyrsenian Etruscans as non-indigenous have instead tended to locate their homeland in Anatolia, the hearth of I-E in the Science model! Today, however, a near consensus has emerged that the Tyrsenian languages represent a pre-I-E substrate that likely extended across much of the northeastern Mediterranean in the fifth millennium BCE, if not significantly later as well.

Finally, consider the depiction of supposedly I-E-speaking “greater Anatolia”—including what is now Syria and northern Iraq as well the Caucasus—in the Bronze Age, circa 1500 BCE. Yet we have unassailable historical evidence of widely spread non-IE languages over much of the region at this time, including Hurrian, Hattic, and, for a somewhat later period, Urartian. Much evidence suggests, moreover, that the three (or perhaps four) extant Caucasian language families covered much broader swaths of land in ancient times than they do today; modern Azerbaijan, for example, was a largely NE-Caucasian-speaking area, as attested by both historical sources and the extant language of Udi. For the Science model to make sense, later migrations of several different non-I-E groups would have had to have pushed through long-inhabited I-E lowlands to settle in inhospitable areas of mountainous terrain. Such a scenario, to say the least, strains credulity.

*. Let us consider here the various elements of the authors’ “caveat”:

1. “we can only represent the geographic extent corresponding to language divergence events.” Do languages really diverge in discrete events? Does not language divergence happen continually? Whenever one segment of a language community adopts a new word, a new sound, or a new grammatical feature, some degree of divergence has occurred. It is always an open question as to when diverging dialects become separate language; in the modern world, the issue is more political than linguistic (cf Serbo-Croatian, Serbian, Croatian, Bosnian, and Montenegrin).

2. “only between those languages that are in our sample.” That is interesting, seeing as Atkinson claims in an interview (to be cited later) that “all” I-E languages were included (an impossibility, as there are no hard and fast divisions between languages and dialects). But more to the point, if one can simply exclude languages at will from the sample, then one can then mold the results. Drop a few more languages, and the maps will differ. In such a manner, one can get the results that one wants.

3  “nodes associated with branches not represented in our sample will not be reflected in this figure.” Yes indeed, which is one reason why the figures are so spectacularly wrong.

4. “the lack of Continental Celtic variants in our sample means we miss the Celtic incursion into Iberia and instead infer a later arrival into the Iberian peninsular…” I am glad that the authors begin to acknowledge their own errors here, but they still do not go far enough; they do make an inference, and that inference is simply incorrect. They also miss not just Celtiberian and Latin, but also Mozarabic, Ladino, and several other I-E languages of the Iberian Peninsula (the map frame for 1000 CE still shows only partial I-E coverage).

5. “associated with the break-up of the Romance languages.”  The model assumes that Latin began to “break-up” with the fall of the Western Roman Empire.  That is incorrect, as divergence began much earlier. The “vulgar” Latin of the distant provinces was not the language of Cicero.

6. “not the initial rapid expansion of Latin.” Latin did indeed expand rapidly as a language of administration, but not necessarily as a language of everyday use. Basque remained in use throughout, although the maps produced by the study indicate otherwise.

7. “The timing represented here therefore offers a minimum age

for expansion into a given area.” This proviso is particularly rich, as it alone undermines the approach. In other words, I-E languages could have been found in any part of the study area at much earlier times than indicated? If so, how can one pinpoint Anatolia as the place of origin? If one claims to “find” a location of origin, then one is automatically making an argument for “maximum ages” in areas that fall outside that supposed birthplace.

Mismodeling Indo-European Origin and Expansion: Bouckaert, Atkinson, Wade and the Assault on Historical Linguistics

Dear Readers,

As GeoCurrents passed through its August slowdown, plans were made for a series on the Summer Olympics. Thanks to the efforts of Chris Kremer, we have gathered statistics—and made maps—relating Olympic medal count by country to population and GDP, both overall and in regard to specific categories of competition. The series, however, has been put on hold by the recent publication of two heralded articles on the history and geography of the Indo-European language family. On August 24, a short piece in Science—“Mapping the Origins and Expansion of the Indo-European Language Family”—made extravagant claims, purporting to overturn the most influential historical-linguistic account of the world’s most widespread language family. On the same day, Nicholas Wade, noted New York Times science reporter, wrote a half-page spread in the news section of the Times on the Science report, entitled “Family Tree of Languages Has Roots in Anatolia, Biologists Say.” Over the next few days, the story was picked up—and often twisted in the process—by assorted journalists. Within a few days, headlines appeared as preposterous as “English Language Originated in Turkey.”

As Wade’s title indicates, the Science article, written by Remco Bouckaert and eight others (most notably Quentin D. Atkinson), seeks to overturn the thesis that the Indo-European (I-E) family originated north of the Black and Caspian seas. It instead locates the I-E heartland in what is now Turkey, supporting the “Anatolian” thesis advanced a generation ago by archeologist Colin Renfrew. The Science team bases its claims on mathematical grounds, using techniques derived from evolutionary biology and epidemiology to draw linguistic family trees and model the geographical spread of language groups. According to Wade, the authors claim that their study does nothing less than “solve” a “long-standing problem in archaeology: the origin of the Indo-European family of languages.” (Strictly speaking, however, the problem is not an archaeological one, as excavations by themselves tell us nothing about the languages of non-literate peoples; it is rather a linguistic problem with major bearing on prehistory more generally.)

As GeoCurrents is deeply interested in the intersection of language, geography, and history, the two articles immediately grabbed our attention. Our initial response was one of profound skepticism, as it hardly seemed likely that a single mathematical study could “solve” one of the most carefully examined conundrums of the distant human past. Recent work in both linguistics and archeology, moreover, has tended against the Anatolian hypothesis, placing Indo-European origins in the steppe and parkland zone of what is now Ukraine, southwest Russia, and environs. The massive literature on the subject was exhaustively weighed as recently as 2007 by David W. Anthony in his magisterial study, The Horse, the Wheel, and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World. Could such a brief article as that of Bouckaert et al. really overturn Anthony’s profound syntheses so easily?

The more we examined the articles in question, the more our reservations deepened. In the Science piece, the painstaking work of generations of historical linguists who have rigorously examined Indo-European origins and expansion is shrugged off as if it were of no account, even though the study itself rests entirely on the taken-for-granted work of linguists in establishing relations among languages based on words of common descent (cognates). In Wade’s New York Times article, contending accounts and lines of evidence are mentioned, but in a casual and slipshod manner. More problematic are the graphics offered by Bouckaert and company. The linguistic family trees generated by their model are clearly wrong, as we shall see in forthcoming posts. And on the website that accompanies the article, an animated map (“movie,” according to its creators) of Indo-European expansion is so error-riddled as to be amusing, and the conventional map on the same site is almost as bad. Mathematically intricate though it may be, the model employed by the authors nonetheless churns out demonstrably false information.

Failing the most basic tests of verification, the Bouckaert article typifies the kind of undue reductionism that sometimes gives scientific excursions into human history and behavior a bad name, based on the belief that a few key concepts linked to clever techniques can allow one to side-step complexity, promising mathematically elegant short-cuts to knowledge. While purporting to offer a truly scientific* approach, Bouckaert et al. actually forward an example of scientism, or the inappropriate and overweening application of specific scientific techniques to problems that lie beyond their own purview.

The Science article lays its stake to scientific standing in a straightforward but unconvincing manner. The authors claim that as two theories of Indo-European (I-E) origin vie for acceptance, a geo-mathematical analysis based on established linguistic and historical data can show which one is correct. Actually, many theories of I-E origin have been proposed over the years, most of which—including the Anatolian hypothesis—have been rejected by most specialists on empirical grounds. Establishing the firm numerical base necessary for an all-encompassing mathematical analysis of splitting and spreading languages is, moreover, all but impossible. The list of basic cognates found among Indo-European languages is not settled, nor is the actual enumeration of separate I-E languages, and the timing of the branching of the linguistic tree remains controversial as well. As a result of such uncertainties, errors can easily accumulate and compound, undermining the approach.

The scientific failings of the Bouckaert et al. article, however, go much deeper than that of mere data uncertainty. The study rests on unexamined postulates about language spread, assuming that the process works through simple spatial diffusion in much the same way as a virus spreads from organism to organism. Such a hypothesis is intriguing, but must be regarded as a proposition rather than a given, as it does not rest on a foundation of evidence. The scientific method calls for all such assumptions to be put to the test. One can easily do so in this instance. One could, for example, mathematically model the hypothesized diffusion of Indo-European languages for historical periods in which we have firm linguistic-geographical information to see if the predicted patterns conform to those of the real world. If they do not, one could only conclude that the approach fails. Such failure could stem either from the fact that the data used are too incomplete and compromised to be of value (garbage in/garbage out), of from a more general collapse of the diffusional model. Either possibility would invalidate the Science article.

Such a study, it turns out, has been conducted—and by none other than Bouckaert et al. in the Science article in question. Their model not only looks back 8,500 years into the past, when the locations and relations of languages families are only conjectured, but also comes up to the near present (1974), when such matters are well known. Here a single glance at their maps reveals the failure of their entire project, as they depict eastern Ukraine and almost all of Russia as never having been occupied by Indo-European speakers. Are we to believe that Russian and Ukrainian are not I-E languages? Or perhaps that Russians and Ukrainian speakers do not actually live in Russia and Ukraine? By the same token, are we to conclude that the Scythian languages of antiquity were not I-E? Or perhaps that the Scythians did not actually live in Scythia? And these are by no means the only instances of the study invalidating itself, as we shall soon demonstrate. An honest scientific report would have admitted as much, yet that of Bouckaert et al. instead trumpets its own success. How could that possibly be?

One can only speculate as to why the authors proved incapable of noting the failure of their model to mirror reality. Did they neglect to look at their own maps, trusting that the underlying equations were so powerful that they would automatically deliver? Could their faith in their model trump their concern for empirical evidence? Or could it be that their knowledge of linguistic geography is so scanty that they do not grasp the distribution of the Russian language, much less that of Scythian? If so, they are not operating at an acceptable undergraduate level of geo-historical knowledge. Alternatively, the authors might be aware that their model generates nonsense, but prefer to pretend otherwise, hoping to buffalo the broader scholarly community. They seem, after all, to conceal their approach as much as possible, couching their “findings” in jargon-ridden prose that proves a challenge not just for lay readers but also for specialists in neighboring subfields. (Translations of such passages as “Contours on the map represent the 95% highest posterior density distribution for the range of Indo-European” will be forthcoming.)

Regardless of whether the authors are intentionally trying to mislead the public or have simply succeeded in fooling themselves, their work approaches scientific malpractice. Science ultimately demands empirical verification, and here the project fails miserably. If generating scads of false information does not falsify the model, what possibly could? Non-falsifiable claims are, of course, non-scientific claims. The end result is a grotesquely rationalistic and hence ultimately irrational approach to the human past. As such, examining the claims made by the Science team becomes an example of what my colleagues Robert Proctor and Londa Schiebinger have aptly deemed “agnotology,” or “the study of culturally induced ignorance or doubt, particularly the publication of inaccurate or misleading scientific data.”

As the critique we offer is harsh and encompassing, GeoCurrents will devote a number of posts to examining in detail the claims made and techniques employed by Bouckaert, Atkinson, and their colleagues. But before delving into the nitty-gritty, a few words are in order about what ultimately lies at stake. We are exercised about the Science article not merely because of our passion for the seemingly esoteric issue of Indo-European origins, but also because we fear for the future of historical linguistics—and history more generally. The Bouckaert study, coupled with the mass-media celebration of the misinformation that it presents, constitutes an assault on a field that has generated an extraordinary body of rigorously derived information about the human past. Such an attack occurs at an unfortunate moment, as historical linguistics is already in crisis. Linguistics departments have been cutting positions in historical inquiry for some time, creating an environment in which even the best young scholars in the field are often unable to obtain academic positions.

The devaluation of historical linguistics is merely one aspect of a much larger shift away from the study of the past. Subdisciplines such as historical geography and historical sociology have been diminishing for decades, and even the discipline of history faces declining enrollments and reduced faculty slots. Academic history itself, moreover, has been progressively shying away from the deeper reaches of the human past to focus on modern if not recent historical processes. Such developments do not bode well for the maintenance of an educated public. At the risk of descending into hyperbole, we do worry about the emergence of something approaching institutionally produced societal dementia. The past matters, and we care deeply for the preservation of its study.

*Make no mistake: we at GeoCurrents are strong supporters of the scientific method. Linguistics is itself a logically constituted, rigorous endeavor that counts as a science in the larger sense of the word, and I have myself co-edited a work defending science and reason against eco-radical and other far-left attacks (The Flight from Science and Reason, edited by Paul R. Gross, Norman Levitt, and Martin W. Lewis. 1997. New York Academy of Sciences).


Geographical Illiteracy in Civilization V

Since 1991, the Civilization series of computer games has been the best product on offer for the historically or geographically inclined gamer. The latest incarnation of the game, Civilization V, features dozens of unique playable “civilizations” that include broad linguistic or ethnic groups like the Celts and Polynesians, long-gone empires like Babylonia and Carthage, and modern states like the Netherlands. Each civilization has unique elements such as a leader (e.g., Boudicca or Nebuchadnezzar II) and a distinct play style that help it to achieve one of several victory conditions. The game also features innumerable scenarios, both official and fan-made, that allow players to immerse themselves in—and attempt to alter—historical events like the fall of the Western Roman Empire or the Japanese invasion of Korea. While the game is generally excellent, there are a number of historical and geographic mistakes in its ubiquitous loading-screen maps that are shown to well over one million unwitting pairs of eyes, and are thus worthy of correction.

When loading a game, players are presented with a map of their chosen civilization’s territory at the time the leader chosen for the game held power. Sometimes the game’s artists simply get carried away, showing Attila’s Hunnic Empire (if one can call it that) controlling all of Denmark. Most likely Attila—a horse-riding nomad who never even went further north than modern Cologne—didn’t even know such a place existed. A similar problem concerns the map of the Maya, which implies that Pacal the Great ruled all of the Mayan city-states in the 7th Century C.E. rather than just Palenque and its immediate hinterland. In contrast, the realm of Harold Bluetooth, a 10th Century Danish King and the namesake of the eponymous wireless technology, actually appears to be understated. Denmark at the time controlled much of Scandinavia, a fact not represented by the map.

A rather humorous error concerns the distinction between the ancient Egyptian city of Thebes and the ancient Greek city of Thebes. In the map of Ramses II’s Egypt in the 13th Century B.C.E, Thebes, Egypt is not included. Rather, Thebes appears in Greece, a place that would not see an actual city named Thebes for several hundred years. On the map of Greece during the time of Alexander the Great, the artists have another chance to get the Thebes question right, but alas they fail once again. This time, Thebes, Egypt is shown while Thebes, Greece—arguably the most important Greek city at that time—disappears.

One of the most elegant features of Civilization V is the experience of negotiating with other leaders who speak in their native languages. For example, the game’s Hiawatha simulation speaks to the player in Mowhak, and the Theodora simulation speaks to the player in Medieval Greek. Designers even gave long-dead languages a shot, having Nebuchadnezzar II speak Akkadian. Nevertheless, it is somewhat disappointing to see Ramses II speaking modern Arabic when Middle and Late Egyptian are relatively well known. Languages also help highlight the incongruous nature of some “civilizations,” such as that of the Celts. The game’s Celtic leader, Boudicca, ruled an ancient tribe known as the Iceni in what is now Norfolk in Eastern England. The game’s Boudicca speaks modern Welsh, and then goes ahead and builds a capital city named Edinburgh.

Despite its many small mistakes and a one-size-fits-all definition of “civilization” that forces pretty much every kind of human grouping into the nation-state framework, Civilization remains a fantastic diversion with this author’s highest recommendation. Here’s hoping that the artists for Civilization VI spend a few more minutes on Google before drawing their maps.




Visualizing California’s Soggy Past

A previous GeoNote highlighted a collaborative effort to map historical changes in California’s Sacramento-San Joaquin RiverDelta. In a similar spirit, the fantasy satellite map shown at left, created by Central Valley geographer Mark Clark and noted by Frank Jacobs, imagines what the entire state might have looked like in 1851. Perhaps the map’s most salient feature is massive Tulare Lake, which dominates the Southern San Joaquin valley. Tulare Lake, now completely dry in all but the wettest years, once boasted a surface area of 1,780 square kilometers (690 square miles), making it the largest freshwater lake west of theMississippi River. The rain and melt water that fed the lake in times past now forms a vital input for California’s $36 billion agriculture industry.

Tulare Lake, along with the other extensive river and wetland systems depicted in the map, were drained in the late 19th and early 20th centuries near the end of a wetland-drainage movement that is as old as the country itself. In fact, much of America’s prime agricultural land in the Midwest was once wetland. As shown on the maps below, which were taken from a USGS report, the states of Iowa, Illinois, Missouri, Indiana, and Ohio—as well as California—have lost over 95 percent of their wetlands since European colonization, primarily to agriculture. Most of the changes in the East occurred during the 19th century.


