Language mapping

Punjabi and the Problems of Mapping Dialect Continua

Dialects Sometimes Called Punjabi MapThe Wikipedia list of the world’s most widely spoken languages, by mother tongue, puts Punjabi in tenth place, with its roughly 100 million native speakers exceeding the figures given for German, French, Italian, Turkish, Persian and many other well-known languages. The Wikipedia article on the Punjabi language stresses its growing appeal, noting that, “The influence of Punjabi as a cultural language in Indian Subcontinent is increasing day by day mainly due to Bollywood. Most Bollywood movies now have Punjabi vocabulary mixed in, along a few songs fully sung in Punjabi.”

But despite Punjabi’s obvious importance, it is extremely difficult to find a map of the language on the internet. Partly this is due to the fact that Punjabi spans the India-Pakistan border, and most maps of individual languages are country-based. One can thus find many language maps of India that depict Punjabi, and virtually all language maps of Pakistan do so as well. But on Pakistani language maps, the area covered by Punjabi has been diminishing in recent years. Maps made in earlier decades typically showed virtually all of northeastern quadrant of the country as Punjabi-speaking, whereas many recent maps retain the Punjabi label only for the core zone of this region. On these maps, what used to be the southern Punjabi area is now typically mapped as Saraiki-speaking, whereas the north is depicted as Hindko-speaking. Saraiki and Hindko, moreover, are sometimes merged together as the Lahnda language, sometimes called “Western Punjabi.” This linguistic reclassification scheme, however, is quite controversial, especially in Pakistan. Here Punjabi partisans are often irritated by the diminution of their language, whereas locally based scholars are happy to see their own speech-forms elevated to the status of separate languages.

Such controversies stem from the fact that Punjabi forms a dialect continuum, which means that adjacent dialects may be virtually identical, but the farther one travels, the more distinctive they become. As a result, dialects on the opposite sides of such a continuum may be non-mutually intelligible, and hence separate languages by standard linguistic criteria, yet no clear language boundaries can actually be located. The Punjabi dialect continuum is further complicated by the fact that it merges with the Hindi dialect continuum in northern India and with the Sindhi dialect continuum in southern Pakistan. To a certain extent, one can thus imagine a much larger dialect continuum stretching across most of northern South Asia. The standardized form of Hindi is a completely different languages from standardized Punjabi, but on the margins the situation is not always so clear-cut. The presence of Urdu adds yet another layer of complexity.

A relatively new Wikipedia language map (dated January 31, 2013) deals with these issues by mapping local dialects in the Punjabi-speaking area in both Pakistan and India. The caption of this map found on the “Punjabi Language” Wikipedia article (but not on other Wiki articles that use it) is delightfully honest: “Dialects Sometimes called Punjabi.” Note that on this map “Hindko” is highly restricted, whereas “Saraiki” does not appear at all. One must wonder how much sub-dialectal variation is found in some of these mapped dialect areas, particularly in the elongated Derawali zone (colored red on the map).

The Wikipedia article on Derawali  indicates that a certain degree of linguistic convergence is now occurring: “Today like all other dialects in Punjab, a process of unification and getting closer to Standard Pakistani Punjabi (Urdu influenced Majhi written in Shahmukhi) has made it [Derawali] quite similar morphologically, syntactically and mutually intangible with Standard Punjabi.” The lexical table provided in the same article, however, makes Derawali seen quite different from standard Punjabi. Whereas in the latter, the English words “boy, girl, woman, and man” are rendered “Munda, Kuri, Znaani, Aadmi,” in Derawali they are given as “Chohr, Chohir, Aurat, Mard.”

Punjabi and the Problems of Mapping Dialect Continua Read More »

Linguistic Phylogenies Are Not the Same as Biological Phylogenies

(Note: This post is jointly written by Martin Lewis and Asya Pereltsvaig)

A key assumption of Bouckaert et al. is that the diversification and spread of languages operates so similarly to the diversification and spread of biological organism that the two processes can successfully be modeled in the same manner. The parallels between organic and linguistic evolution are indeed pronounced. Both processes entail replicating codes that continually change, giving rise to novel varieties that increasingly differ from their progenitors over time. As a result, “phylogenetic trees,” showing descent from common ancestors, are a common feature of both evolutionary biology and linguistics.

But despite their similarities, organic evolution and linguistic evolution are in many ways highly dissimilar. Encoding information for communication is not the same as encoding information that generates life: language is vastly more fluid and complex than the genetic code; individual languages are much less clearly differentiated from each other than are species; and language is a social phenomenon, given to influences largely irrelevant for biological evolution. The key differences can be summarized as follows: biological evolution is unconstrained but governed by natural selection (any mutation can happen, but which mutations remain in the pool depends in large part on natural selection), whereas linguistic variation (seen in terms of deep grammatical properties) is constrained by a system of parameters but is not subject to natural selection. As a result, the branching trees of linguistic descent are merely analogous to the phylogenetic diagrams of biological evolution, and do not indicate the same kind of relationships.

Although organic evolution operates through a much more restricted set of message-carrying units than does human language, it nonetheless produces diversity at a much deeper level. Given the biological constraints of the human brain/mind (as of yet less than fully understood), there are only so many ways in which any given language can be structured. To be sure, the number of possible human languages, both extant and extinct, as well we those that may arise in the future, is vast, but all human languages appear to be “variation on a theme,” guided by the same parameters. Some languages have as few as two vowels (Ubykh, Northwest Caucasian) and others as few as six consonants (Rotokas, North Bougainville); other languages may have as many as 20 vowels (e.g. the Taa language, spoken in Botswana and Namibia, is reported by some sources to have as many as 20 or even 30 vowels, depending on analysis) and as many as 84 consonants (as in Ubykh; the Taa language is reported to have 87 consonants under one analysis, 164 under another). But crucially, all languages differentiate vowels from consonants and use both. Some languages put verbs before subjects and objects, while others place them at the ends of sentences, but all languages have verbs, subjects and objects.* Some languages can build sentence-long words packed with of numerous prefixes, infixes, or suffixes, while others use stand-alone, stripped-down words to do the grammatical work of expressing tense, number etc., but all languages make words from morphemes—and all construct sentences. As a result of this limited space of possibilities, completely unrelated languages evolving on their own often come to share major grammatical traits.

Linguistic evolution, unlike that of the biological realm, moves at a rapid clip. In non-literate societies, words change so quickly that after some five to eight thousand years not enough cognates can be traced back to establish linguistic relatedness. In the same time span, grammatical structures can undergo wholesale transformations, and sound inventories can change drastically as well. As a result, even clearly related languages can have next to nothing in common with each other, and can only be linked through investigations into their ancestors. Hindi and English, two of the three most widely spoken Indo-European languages, are dissimilar in almost every respect.** On casual inspection, Hindi would seem to have more in common with the non-Indo-European languages of the Indian sub-continent than it does with English.

Thus, relatedness at the family level and overall linguistic similarity often fail to correspond. Maps showing major language patterns typically bear little if any resemblance to maps depicting linguistic families. Even something as seemingly basic as word order correlates poorly with lines of descent. For example, Indo-European languages can be SVO (subject-verb-object; marked by red dots on the map to the left), such as English, Romance, and most Slavic languages (but Sorbian, a Slavic language, is SOV); SOV (marked by blue dots), such as the Indo-Iranian languages (yet Kashmiri is SVO); or VSO (marked by yellow dots), such as the Insular Celtic languages (yet Cornish is SVO). Some other families, such as Austronesian, have an even greater variability in the basic word order:  Niuean is VSO, Malagasy is VOS, Rotuman is SVO, and Tuvaluan is OVS.

Similarly, features of morphological typology (how words are formed from morphemes) often cross-cut connections established by common descent. Whereas Proto-Indo-European, like most of its daughters, was a synthetic language (building words from multiple non-root morphemes), English and Afrikaans are relatively analytical (with low ratios of morphemes to words), which gives them a certain affinity with Mandarin Chinese (a highly analytical language). As discussed in an earlier GeoCurrents post, isolating languages are found in Africa (Hausa, an Afroasiatic language), Asia (Vietnamese, Austroasiatic), Oceania (Rapanui, Austronesian), and the Americas (Kipea, Kiriri). In phonology as well, similar patterns obtain, as sound inventories often fail to show systematic correspondences with language families. The Indo-European languages of South Asia, for example, are in many respects more phonologically similar to the Dravidian languages of the same region than they are to most other IE language. One of the characteristic phonological markers of the region, the rich inventory of retroflex consonants, is also scattered across the rest of the world, found in about 20 percent of all languages belonging to a wide variety of families.

One of the best ways to appreciate the relative insignificance of language families in regard to the global distribution of such features is to explore the maps that can be generated on the WALS website, such as the one reproduced above. Few if any of these maps bear much resemblance to the familiar depiction of the world’s major language families.

Again, the contrast with biological evolution is stark. The farther removed organisms are from each other on the tree of life, the fewer genes they necessarily share. Even when convergent evolution results in similarities between distantly related organisms, the parallels are relatively superficial. As a result, modern genetic inquiry can establish precise levels of biological relatedness, a process that has revolutionized taxonomy over the past few decades. In the biological realm, moreover, the farther one moves up different branches of evolutionary descent, the more distinctive the organisms found along it generally become. Chordates (the phylum that includes vertebrates) share a distant common ancestor with echinoderms (sea stars and their relatives), and some tunicates, primitive members of phylum Chordata, might be mistaken by unschooled observers for sea lilies in phylum Echinodermata. (Tunicates more generally look like unrelated jellyfish and other cnidarians; a few could be mistaken for rocks, but such rocks disconcertingly bleed when cut open.) But no one would ever mistake any mammal with a sand dollar, a sea cucumber, or any other echinoderm, animals characterized by radial rather than bilateral symmetry. The two phyla have simply evolved in strikingly different directions. If linguistic evolution worked in the same manner, it is questionable whether translation between distant languages would even be possible. Moreover, the disparate patterns of spatial distribution of deep grammatical properties, such as the ones illustrated by the WALS maps, would not be found.

In language, deep grammatical properties can radically change, often taking on the same forms as those encountered in wholly unrelated tongues. As a result, linguistic relationships are often anything but obvious, and can only be discerned though intensive study; significantly, such hidden connections can hold true even for relatively recently emerged languages. A fluent speaker of the major Germanic languages, for example, might be nonplused to learn that Frisian is more closely related to English than it is to Dutch. Yet according to some specialists, even Low German is “phylogenetically” closer to English than it is to (High) German—even though Low German is generally regarded as a mere dialect (or group of dialects) of German!

Linguistic evolution is only vaguely analogous to organic evolution for a variety of reasons, but a crucial factor is the fact that vastly less sharing occurs across biological lineages. We now know that genes can jump from one species to another, but the process is relatively rare; in this realm, change generally occurs as a result of random mutations acted upon by natural selection, not from the borrowing of elements from other species. When it comes to languages, however, sharing is ubiquitous. Languages are almost always borrowing words, and sometimes they adopt grammatical properties of other languages as well. At times, two completely unrelated languages essentially merge to create a hybrid tongue. To be sure, linguists are almost always able to determine which language contributed more elements and more basic structures, and hence should count as the parent tongue. (It should be noted that the use of the terms “parent” and “daughter” in relation to languages is misleading since, unlike in the biological realm, where individual organisms are discrete, the transition from “parent” to “daughter” language is always gradual.) When it comes to creole languages, however, such determinations are not always easy. In regard to grammar, different creoles of completely different parentage are often more similar to each other than they are to any of their source languages. In some instances of mixed languages, admixtures of vocabulary, grammar, and phonology run so deep that linguists abandon the quest for unambiguous classification. Cappadocian Greek, for example, is slotted by the Wikipedia into the seemingly impossible “Greek-Turkish” language family. Does Indo-European therefore encompass this language? Other sources, such as the Ethnologue, place this language in the Greek branch of the Indo-European family, but Turkish influences on Cappadocian Greek are pronounced: it has certain sounds that have been borrowed from Turkish, as well as vowel harmony; it has developed agglutinative inflectional morphology and lost (some) grammatical gender distinctions; and its basic word order is SOV. And Cappadocian Greek is by no means the only example of such a thoroughly “mixed language.” In the biological realm, in contrast, such mixtures are so obviously impossible that they have generated their own nonsense genre, as exemplified by Sara Ball’s delightful flip-book, Crocguphant.

Linguistic family trees must therefore be taken as often showing lines of partial descent, unlike the phylogenetic diagrams of organic evolution. To gain a more complete understanding of linguistic relatedness, it is necessary to complement language families with other kinds of connections. The various languages of a Sprachbund, or a linguistic convergence area, for example, derive from different families, yet nonetheless come to share many features through long histories of mutual interaction. One must also consider linguistic strata, which take into account the influences imposed by one language on another. The role of a linguistic substratum, derived from a previously existing language that was later supplanted by another tongue, can be profound. In many cases, such linguistic substrates were instrumental in generating subfamilies; the Germanic languages, for example, are distinct from other Indo-European languages not merely because they drifted in their own particular direction, but also because that acquired a major substrate from another (unknown) language family. Sometimes, the ghostly presence of a long extinct language or language family can be detected through such substrates. Vedic Sanskrit, for example, was definitely an Indo-European language, but it was influenced not only by the preexisting Dravidian and Munda languages of the Indian subcontinent, but also by an unknown substrate deemed by Colin Masica “Language X.”

A useful alternative to the linguistic tree is the so-called wave model, or Wellentheorie, originally devised to explain some of the characteristics of the Germanic languages that seemed to defy the phylogenetic approach. In wave theory, fluid dialect continua replace the stable, geographically bounded languages required by models predicated on direct descent from ancestral tongues. Here, innovations can occur at any points within a dialect continuum; such changes then spread outward in a circular manner, eventually dissipating as the distance from the innovation center increases.*** If a bundle of innovations substantially overlap and become entrenched, a new dialect, or even language, can be said to have emerged. But according to wave theory, such a “language” is still best viewed as an “impermanent collection of features at the intersections of multiple circles.”

Wave theory does recognize, however, the fact that a single language/dialect can appropriate an entire dialect continuum, subordinating more localized speech forms and eventually driving them into extinction, as indeed was the case in regard to Standard German over most of Germany. Such a process, however, generally requires the power of the state or of some other overarching institution. Such geographically expansive and culturally potent organizations, however, are a feature of the relatively recent past; for most of humankind’s existence, the institutions necessary for producing linguistic standardization over broad areas were lacking. We are so used to the modern world of mass communication over vast distances and of language-standardizing governments and educational systems that we easily forget that in earlier times, and in many remote areas to this day, different linguistic environments prevailed. Overall, we suspect that for most of human history, the wave theory more accurately captures the process of language change than does the standard phylogenetic model. Yet in the most general terms, the two models complement each other relatively well.

*Debate does rage, however, about whether the so-called “non-configurational languages” such as the Australian language Warlpiri, have subjects and objects in the same sense as the more familiar, “configurational” languages like English or French. The reader is referred to Baker (2001) for evidence of subject-object asymmetries in such non-configurational languages.

**For example, Hindi makes a phonemic distinction between aspirated and unaspirated voiced stops, has fusional case/number morphology, subject-object-verb word order, postpositions, and uses the ergative-absolutive alignment in the preterite and perfect tenses; English, in contrast, has no aspirated voiced stops (and does not use aspiration phonemically at all), has largely abandoned fusional morphology, has lost the case system except with pronouns, employs a subject-verb-object word order, uses prepositions rather than postpositions, and is characterized by nominative-accusative alignment.

***Ironically, the diffusion analogy of Bouckaert et al. may be best suited to describing dialectal continua rather than divergence and expansion of languages and language families; we shall return to this point in a forthcoming post.



Baker, Mark C. (2001) The Natures of Nonconfigurationality. In Mark Baltin and Chris Collins (eds.) The Handbook of Contemporary Syntactic Theory. Oxford: Blackwell. Pp. 407-438.


Linguistic Phylogenies Are Not the Same as Biological Phylogenies Read More »

103 Errors in Mapping Indo-European Languages in Bouckaert et al. Concluded: Part V, Western Europe

By now, all of the cartographic failings of Bouckaert et al. have become familiar. On the map of France and neighboring areas, for example, we see the unreasonable elevation of minor dialects to the status of discrete languages (three forms of Breton make the list), the replacement of a non-Indo-European language with an Indo-European languages (the Basque region is shown as French speaking), the improper use of political boundaries as linguistic boundaries (French is not shown as extending into Switzerland), the preferential classification of dialects as languages when they are associated with states (Walloon counts as a language, unlike the other equally distinctive langues d’oïl of northern France or the langues d’oc of southern France; Flemish counts as a language, unlike other equally distinctive forms of Dutch), and the simple geographical misplacement of languages (Romansh is placed in northwestern Italy rather than southern Switzerland). Of particular note in regard to the linguistic mapping of France is the fact that Corsica is completely obliterated by circle #48 (see the map of the Italian Peninsula in the previous post).


The mapping of the Iberian Peninsula is particularly simplistic. The authors have simply placed Portuguese in Portugal and Spanish, along with Catalan, in Spain. The fact that Galician in northwestern Spain is closer to Portuguese than to Spanish is ignored, and the Basque-speaking region is mapped as if it were Spanish speaking. The Balearic Islands are also neglected, as archipelagoes generally are in the authors’ land-biased approach.


The map of the British Isles severely misconstrues the Celtic tongues. Irish, for example, is shown as extending across all of the Republic of Ireland and as entirely absent from Northern Ireland. In actuality, Irish has long been largely limited to the western margin of the island, and as late as the early 20th century was still spoken in parts of what was to become the political unit of Northern Ireland. The mapping here, in other words, is yet again political rather than linguistic. By the same token, Welsh is placed in the coal-mining districts of southern Wales where it has been absent for generations, just as Cornish is depicted in areas where it was not been spoken for hundreds of years. The mapping of Scottish Gaelic is not bad, but the term used—“Scots Gaelic”—is off the mark. The proper term is “Scottish Gaelic,” as “Scots” refers to a different language altogether. Scots, or Lowland Scots, is usually regarded as a highly distinctive form of English, but some linguists regard it as a language in its own right (CNN has recently reported on the demise of one of its dialects).*




The mapping of extinct language is also poorly executed. Old English is essentially restricted to the historical kingdom of Wessex, even though the language extended as far north as the Edinburgh region of what is now southeastern Scotland, and included dialects of Kent, Mercia, and Northumbria. Significantly, even the Wessex (West Saxon) dialect of Old English extended farther to the east than what Bouckaert et al. would allow for Old English in its entirety.

The language map of Bouckaert et al. that I have criticized over these past five posts is a cornerstone of their model, yet it is also wholly inadequate for the task. Many of the errors found here ramify through all of the maps that they have produced.  But even if a serviceable map had been constructed, the model would still yield nonsense, as most of the assumptions upon which it is based are unwarranted, as we shall in more detail see in subsequent posts.

*Although I am no expert on this topic, I would argue that Lowland Scots is almost but not quite interintelligible with Standard English, especially in its spoken form, and thus deserves to be regarded as a separate language. Although I love the poetry of Robert Burns, I generally need translation. Take for example, these versus from “Auld Lang Syne”:

In the Original Scots:

We twa hae run about the braes,

and pu’d the gowans fine;

But we’ve wander’d mony a weary fit,

sin auld lang syne.


We twa hae paidl’d i’ the burn,

frae morning sun till dine;

But seas between us braid hae roar’d

sin auld lang syne.


In Standard English:

We two have run about the slopes,

and picked the daisies fine;

But we’ve wandered many a weary foot,

Since long long ago.


We two have paddled in the stream,

from morning sun till dinner time;

But seas between us broad have roared

since long long ago.


Or listen to the delightful poem “To a Mouse” on ScotsIndependent website:


Wee, sleekit, cow’rin, tim’rous beastie,

O, what a panic’s in thy breastie!

Thou need na start awa sae hasty

Wi bickering brattle!

I wad be laith to rin an’ chase thee,

Wi’ murdering pattle.


103 Errors in Mapping Indo-European Languages in Bouckaert et al. Concluded: Part V, Western Europe Read More »

103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part IV (Central Europe)

(Continued) The main problems with the language map of eastern Central Europe in Bouckaert et al. have already been discussed; to whit, the depiction of “national” languages as coterminous with state boundaries. The authors do occasionally deviate from this norm, showing, for example, a tiny non-Romanian area in northwestern Romania. Note also that they show Latvian as failing to reach Latvia’s northwestern coast. This view is indeed historically accurate, as northern Courland was the land of the Livonians, a Finnic-speaking people. The last native speaker of Livonian, however, died in 2009; for decades before that, Livonian was severely endangered and most speakers were bilingual in Latvian or Russian. If the map purports to depict the present situation, it is flatly wrong here. If it depicts the relatively recent past, as it does for some areas, it is more on target. Unfortunately, no time specification is provided.

Such unspecified chronology is a more intractable problem for the depiction of extinct languages. Major languages of the distant past often experienced major geographical changes, sometimes literally moving en mass when their speakers migrated. The Goths, for example, probably originated in what is now Sweden, later crossed the Baltic into northern Central Europe, subsequently moved into the steppes north and northwest of the Black Sea, and eventually spread with victorious warrior bands over much of the Roman Empire; the final redoubt of the language was the Crimean Peninsula, where it persisted until the ninth century and perhaps until early modern times. Any Gothic language polygon would thus fit a specific place only at a specific time. Bouckaert et al. have apparently selected the period just after the movement of Gothic out of Scandinavia, although the area specified does not seem to match what (little) is known about the early relocation of the language (see the map to the left).

As mentioned in the previous post, the placing of Byelorussian (Belarusian) in a small corner of the Czech Republic is a careless transcription error. But the intended depiction, that of Eastern Czech, is still off base. Czech is not heavily differentiated into dialects. The truly distinctive forms of the language are half way to Polish. Cieszyn Silesian and other Lach dialects are regarded by most Czech linguists as a Polish-influenced form of Czech and by most Polish linguists as a Czech-influenced form of Polish (politics do tend to intrude into linguistic discussions). Such dialects, however, are not on the map. What is (supposed to be) shown is “Eastern Czech,” placed in a small corner in the southeastern part of the Czech Republic. It is unclear what this designation refers to. Across the entire eastern half of the republic, one finds the Moravian dialect (or dialects), which are not strikingly different from standard Czech.

The linguistic depiction of the Italian Peninsula in Bouckaert et al. contains some curious features. This portion of the map is difficult to decipher, as extinct languages overlay extant languages, and much the area is covered by the circular labels. It is still clear, however, that the mapping here remains inconsistent. Italian is shown as extending neither into the Po Valley in the north nor to Sicily in the south. Fair enough: the local dialects spoken (or spoken until recently) in those areas are markedly different from Standard Italian, based on the Tuscan dialect. Yet the authors place other parts of the peninsula with equally distinctive dialects, such as Apulia in the southeast, in the Italian language category. In regard to the extinct Indo-European languages mapped here, the major issue is why only Umbrian and Oscan were selected to accompany Latin.




Most of the problems found on the map of Germany and environs have already been discussed. Note, for example, how Luxembourgish makes the cut on political grounds, whereas other distinctive German dialects are ignored. Of special note here is the demarcation of two Lusatian (or Sorbian) languages, although only one is labeled on this map segment. These Slavic tongues of eastern Germany are distinctive, and mapping them as separate languages makes linguistic sense. But it is difficult to understand why these relatively minor languages, with 40,000 and 10,000 speakers respectively, have been added to the tally, whereas Iranian and Indic I-E languages with hundreds of thousands to tens of millions of speakers have been ignored.

The language mapping of Scandinavia shows, yet again, striking geopolitical influence. Here we have Danish blanketing Denmark, Riksmal (or the Norwegian “national language”) everywhere in Norway except the islands and Finmark, and three separate Swedish languages covering all of Sweden except the islands, which remain unmarked. The straight east-west line that separates two supposedly distinct Swedish languages is a curious and highly unlikely feature.

But as one would expect, the continental Scandinavian languages do not actually correspond so well to national territories. Overall, the region is characterized by a dialect continuum so pronounced that some scholars regard all of the mainland North Germanic tongues as a single, regionally differentiated language. Swedish and Danish are almost interintelligible, and Norwegian is often regarded as a kind of a bridge: as a common saying puts it, “Norwegian is Danish spoken in Swedish.” (Norwegian vocabulary is similar to that of Danish, whereas its phonology is more like that of Swedish). But it is more complicated than that, as there is no single Norwegian language at any level. Local dialects cross the border with Sweden, but even in terms of official state recognition, Bokmål (“book language”) competes with Nynorsk (“New Norwegian”), and neither of these two variants are exactly the same as the standardized but non-official Riksmål (“national language”) and Høgnorsk (“High Norwegian”) forms. The differences between Bokmål and Nynorsk are not purely lexical (e.g. Bokmål pike ‘girl’ vs. Nynorsk jente ‘girl’), but concern grammatical patterns too (e.g. Bokmål does not distinguish masculine and feminine genders, whereas Nynorsk does). In a sense, the differences between Bokmål and Nynorsk are more pronounced than those between Bokmål and Danish (e.g. Danish word for ‘girl’ is pige, and most dialects of Danish and its standardized form do not distinguish masculine and feminine genders). The contention among these different language varieties is at once political, cultural, and historical, tied up with Norway’s former subordination to Denmark. Norwegian linguistic nationalists have often wanted to purge specifically Danish elements from the language, whereas linguistic traditionalists would like to preserve them.

Legacies of geopolitical change are also evident in the Scania region of southern Sweden. The dialects of Sweden’s far south are close to those of Denmark—so close, in fact, that some scholars place them within an “East Danish” category. Significantly, Scania was part of the Kingdom of Denmark until it was lost to the rising power of Sweden in 1658; it did not become an integral part of Sweden, however, until 1719, and which point a policy of linguistic “Swedenization” was initiated. “Eastern Danish” is thus considered by some to be a more historical than a linguistic category.

One of the oddest features of the mapping strategies employed by Bouckaert et al. is their reluctance to include islands within the territories of any language. In some cases, island groups are appended to mainland polygons, as can be seen here in the depiction of Danish (in the same manner, the Hebrides are mapped as Scottish-Gaelic speaking). Most often, however, islands and archipelagos are simply ignored, as one can see here in the cases of Norway’s Lofoten and Sweden’s Gotland and Olaand. Had Gotland been considered, I wonder whether it would have been mapped as Gutnish speaking. Gutnish, a disappearing dialect, is distinctive, and is sometimes said to be a direct descendent of ancient Gothic.

The mapping of Old Norse as coinciding with Iceland is also untenable. When Old Norse was spoken on Iceland it was also spoken in Norway, Sweden, Denmark, in northern Scotland, and pockets of the western British Isles.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part IV (Central Europe) Read More »

103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part III: From Western Russia to the Balkan Peninsula

(Continued) The most glaring error in the linguistic map of western Russia and environs by Bouckaert et al. concerns the labeling of Belarus. The number “22,” placed in the center of the country, is listed as signifying the “Czech E,” which presumably means “eastern Czech.” As the authors have correspondingly appended the label “Byelorussian” to a small area in the eastern Czech Republic, the error is obviously one of transposition. Such mistakes can occur inadvertently, although the fact that it has gone undetected indicates a troubling failure to engage in routine proofreading.

A much deeper problem is indicated by the intentional mapping. Note how the polygons indicating the Belarusian and Ukrainian languages correspond precisely to the present-day territories of Belarus and Ukraine respectively. Such exact political-linguistic correspondence is rare, and when it is encountered it generally indicates a recent history of state-led linguistic repression or ethnic cleansing, which should be taken into account in any historical consideration of linguistic geography. In the case of Belarus and Ukraine, however, the current distribution of the national languages does not even come close to fitting precisely within the geographical bodies of the respective countries.

Belarusian is widely spoken in Belarus but it is not the country’s majority language and it is dominant only in the west and the south, as can be seen on the Wikipedia map posted here. Even in these areas, Belarusian is losing ground among the young, and is thus classified as a “threatened language.” The threat stems from Russian, which, according to the 2009 national census, is spoken at home by 72 percent of the people of Belarus. Identifying the Belarusian language with the national territory of Belarus is—yet again—a political rather than a linguistic statement.

Placing the Ukrainian language precisely within the territorial bounds of Ukraine is an even more egregious error. The fact that eastern Ukraine and the Crimean Peninsula are mostly Russian-speaking areas is well known, as it is mentioned almost every time that Ukrainian elections are discussed. According to the Constitution of the Autonomous Republic Crimea, Russian rather than Ukrainian serves as the “language of interethnic communication”. Moreover, government duties in Crimea are fulfilled mainly in Russian, hence it is a de facto official language. The issue of whether Russian should be made co-official in other areas of Eastern and Southern Ukraine that are already de facto Russian-speaking is hotly debated on the parliamentary level. Before WWII, moreover, the linguistic map of the region was far more complex than it is now, an observation that holds true for most of eastern and central Europe. The southern Crimea, for example, was then dominated by people speaking Crimean Tatar, a language in the Turkic family.

The depiction of European Russia is little better. In this case, political boundaries are not slavishly followed, as large areas of northern Russia are correctly shown as non-Russian speaking. But many northern regions that are Russian-speaking, such as Saint Petersburg, are oddly excluded from the realm. Conversely, sizable areas in eastern European Russia are mapped as Russian-speaking when in actuality they are inhabited by peoples speaking Uralic and Turkic languages. It is admittedly difficult to map such languages as (Volga) Tatar, Mari, and Udmurt, as they are not spoken in geographically contiguous areas but rather form archipelagos in a Russian sea. But do such technical challenges warrant the exclusion of such language? More than six million citizens of the Russian Federation speak Tatar as their first language, and mapping them as if they were Russian speakers fails to given them the recognition that they deserve. The Udmurt language, spoken by about half a million speakers, has been recently propelled to the focus of the public attention in Russia and in the rest of Europe when a band of Udmurt-speaking (and -singing) grandmothers won second place at the Eurovision Song Contest.

Such mapping difficulties are by no means limited to western Russia. In many parts of the Indo-European realm, languages are interspersed, forming complex amalgams. As mentioned above, such mixtures were much more intricate before the horrors of the Second World War and its immediate aftermath. Depicting such areas as linguistically uniform, as Bouckaert et al. routinely do, thus results in intrinsic distortions. Such distortions, moreover, seem to be a necessary feature of their basic methodology, as they depict every language within a discrete and uniform polygon. Linking together languages whose speakers are scattered in separate communities over large areas into single bounded spaces results in such absurdities as the gerrymandered Kurdistan mentioned in the previous post.

Such procrustean tendencies reach a laughable extreme in the depiction of the Romani language (that of the so-called Gypsies), seen on the map of the Balkans posted to the left. Romani, labeled 74, is impossible to locate precisely, as the area indicated is covered by the circle16 in western Bulgaria. Presumably, a small, discrete Romani polygon lies below this numerical tag. To restrict the Romani language to this area is beyond absurd. Romani, like the Roma people who (sometimes) speak it, is dispersed over most of Europe. Bouckaert et al., however, do not even manage to adequately locate the language’s center of gravity, as far more people speak Romani in Romania than in Bulgaria. Mapping Romani is, of course, an extraordinarily difficult task, as the linguistic community is not only scattered widely, but its members often relocate. As a result, most cartographers simply indicate the numbers and percentages of Romani speakers (or Roma people more generally) found in different countries.

The rest of the map is not much better. Although the authors differentiate four separate Albanian languages, they depict the northern half of Albania as non-Albanian speaking. They also limit Serbo-Croatian to Serbia and Montenegro, excluding Croatia and Bosnia. Here the categories used and the map itself fail to correspond; what the map shows is the political-linguistic construct of Serbian (plus Montenegrin), used since the break-up of Yugoslavia, whereas the label turns back to the Yugoslavian idea of a single Serbo-Croatian languages, which also encompasses Bosnian and Croatian. From a linguistic standpoint, Serbo-Croatian works best, as all of its politically standardized forms are mutually intelligible to some degree. But by the same token, Bulgarian and Macedonian, shown here as separate languages, are similarly interintelligible. The underlying problem here is the lack of uniformity in the treatment of different languages: if they have four Albanian languages as well as separate languages in Bulgaria and Macedonia, they should have separated Serbian, Croatian, Bosnian, and Montenegrin—or better still, they should have differentiated the non-political dialectal divisions of Serbo-Croatian: Chakavian, Kaykavian, Western Shtokavian, Eastern Shtokavian, and Torlakian.

Finally, the mapping of Greek, both ancient and modern, is bizarrely idiosyncratic.  On what possible basis could the authors limit ancient Greek to Athens and its vicinity? The implicit argument here is that only Attic Greek was Greek, with the other Hellenic polities speaking non-Greek languages, a nonsensical idea. And yet they don’t even manage to map Attic Greek properly, leaving out the islands on which it was spoken. One can only conclude that the authors are incompetent at mapping languages, a cornerstone of their approach.


103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part III: From Western Russia to the Balkan Peninsula Read More »

103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part I

As our criticisms of Bouckaert et al. have been extremely harsh, we must justify them in some detail. I have accused the authors of erring “at every turn,” a charge that reeks of hyperbole. But even if that claim is exaggerated, it is still not too far from the mark. To demonstrate the extraordinary density of error in the Science article, the next few posts will dissect the authors’ base map of Indo-European languages (Figure S6 in their Supplementary Materials). This map, depicting the distribution of both modern and ancient Indo-European languages, forms a key input for their “explicit geographic model of language expansion” (Bouckaert et al., p. 957), as the locations of the sampled languages shown on this map are fed into the model in order to calculate the location of the PIE homeland. Many of the errors and inconsistencies found on their other maps stem from mistakes made in this initial figure.

The map in question shows the location of the 103 Indo-European languages analyzed. The brief caption notes that “colored polygons represent the geographic area assigned to each language based on Ethnologue.” This assertion is misleading at best. The Ethnologue does not consistently map modern languages, and it pays little attention to long-extinct ones such as Hittite. And where the Ethnologue does map, it typically does so in vastly greater detail than Bouckaert et al. Compare, for example, how the two sources depict the languages of what is now southern and central Pakistan in the paired figures to the left.

Regardless of the source (or sources) used, the map is highly inaccurate. To illustrate the cavalcade of error found in Bouckaert et al., I have isolated 103 miscues, some admittedly rather minor, but others highly significant. As recounting all of them would be tedious, I will simply note them in call-outs on expanded details from their “master map.” I have prepared twelve such enlarged maps, each focusing on a different part of the historically Indo-European-speaking world. I will post these maps sequentially over the next few days, discussing in the accompanying posts some of their more egregious errors. Today’s post will conclude with a consideration of South Asia; subsequent ones will move in a westward direction, terminating in the British Isles.

Before examining the portrayal of the Indian Subcontinent in Bouckaert et al., a few words are in order about their general approach to mapping. Analyzing their base-map is no easy matter, as they do not follow conventional cartographic procedures. Their all-important polygons are often impossible to trace, obscured by the large, numbered circles used to label the 103 languages. Another perceptual problem stems from their use of overlays, with multiple extinct languages (in red) layered upon extant languages (in blue). The resulting color blends yield confusing intermediate shades. Note on the detail posted to the left the depictions of Luvian, Hittite, Classical Armenian, Kurdish, and modern Armenian. Determining which language is indicated in which places takes some patience.

A more intractable problem concerns the map’s temporal framing. The short explanation provided in the caption makes the issue seem simple: “Red areas indicate ancient languages and blue areas indicate modern languages.” Left unanswered is the time frame of “linguistic modernity.” In some places, the term is defined broadly, extending back hundreds of years. Cornwall, for example, is shown as inhabited by speakers of modern Cornish. Such a view is anachronistic, as Cornish had disappeared from most of the peninsula by 1700, and was essentially extinct before the modern revival movement began in the 20th century. (Today Cornish is estimated to have only “a few” native speakers.) Elsewhere, the mapping of “modern languages” refers to the late 20th century. The German zone, for example, fits only the post-WWII period, after millions of German speakers had been expelled from Pomerania, Silesia, and Sudetenland. The map, to put it simply, plays fast and loose with time and space.

Even more problematic is the mapping of many languages on the basis of political rather than linguistic features. As was noted in an earlier post, all of the maps used in the study show signs of what I called “geopolitical contamination,” in which the boundaries of modern-day states incorrectly determine those of language groups, following Max Weinreich’s dictum that “a language is a dialect with an army and navy.” I was puzzled, for example, by the fact that Moldova was placed outside of the Indo-European realm in Figure S4, showcased on Quentin Atkinson’s website. The reason is readily apparent when one considers the map of the 103 language polygons (Figure S6). Here Romanian is depicted as almost exactly coincident with Romania. Moldova is fully excluded from this realm, even though the official “Moldovan Language” is differentiated from Romanian solely on political grounds. One can indeed identify a Moldovan subdialect of Romanian, but it spans the Romanian-Moldovan border. Moldova should thus have been placed within the Romanian polygon, yet it is instead depicted in the same manner as Hungary, giving the impression that it lies outside the Indo-European realm. The consequences of such a strategy are troubling for the contemporary world, but become positively pernicious when retroactively extended into the past, which is precisely what the Bouckaert model does. As a result, almost all of Moldova is ludicrously mapped as most likely never having been occupied by Indo-European speakers in Figure S4.








Such geopolitical contamination is clearly evident in the depiction of the languages of South Asia, posted here. Note that Bengali, often regarded as the world’s sixth most widely spoken language, is essentially limited to Bangladesh, its 80+ million speakers in the Indian state of West Bengal written out of the linguistic community. Even more unreasonably, Vedic Sanskrit is given the polygon of a modern political unit. The supposed territory of this ancient language is outlined and shaded in red in the map posted here. This area, it turns out, precisely fits the territorial extent of Punjab before it was partitioned by the British. That colonial-era Punjab would have no bearing on the distribution of Vedic Sanskrit, spoken some 3,000 years ago, should go without saying. It is also worth noting that the former Punjab included what is now the Indian Himalayan state of Himachal Pradesh, which features peaks 22,000 feet above sea level. It is safe to assume that such areas were never part of the Vedic Sanskrit realm.


Mapping Vedic Sanskrit is no easy task, but that is no excuse for using a modern geopolitical proxy. Careful studies show that the world of the Rig Veda was largely limited to what are now the Indian and Pakistani states of Punjab along with the Vale of Peshawar and Swat Valley. “Vedic India” in the larger sense extended from this region down the Ganges Valley through Bihar and southward to encompass Gujarat, as can be seen in the second map posted here. Either of these two areas could easily have been used for the Vedic Sanskrit polygon.


I will not comment further on the remaining errors and infelicities on the Bouckaert et al. portrayal of South Asia, as a number of them are noted on the map itself. I have also posted a fine Wikipedia map of the current distribution of the Indo-European languages of South Asia for comparative purposes. (Note that this Wikipedia map lumps a number if disparate dialects into single languages, such as Bihari.)

As we shall see in forthcoming posts, similar errors litter all other portions of the original language map employed by Bouckaert et al. As a result, it is difficult to avoid the conclusion that the authors simply do not have the level of geo-linguistic comprehension necessary for carrying out their task. I have taught the geography of modern languages at leading universities for twenty-five years, and I can peg the level of understanding demonstrated by students fairly accurately. That of Bouckaert et al. would clearly fall into the “B” range. Given the unfortunate realities of grade inflation, that means that more than half of my undergraduate students finish their terms with a better understanding of the distribution of languages than the authors of a supposedly path-breaking article on the origin and spread of the world’s largest language family published in one of the world’s leading scientific journals.



103 Errors in Mapping Indo-European Languages in Bouckaert et al., Part I Read More »

Misleading Language Maps on the Internet

Although the internet allows easy access to manifold cartographic treasures, it provides even more rapid access to misleading, poorly constructed, and laughably inaccurate maps. Consider, for example, language maps at the global scale. A simple Google image search of “world language map” yields over 600 million results, although only the top hits, and by no means all of them, actually show linguistic maps of the world. Those that do can in general be divided into two categories: maps that depict language families, and maps focused on the most widely spoken individual languages. Today’s post considers the latter category, analyzing Google’s eight most highly ranked “world language maps” that portray the distribution of specific languages.

All of these maps are actually best described as “political linguistic maps,” as they organize their depiction of language distribution in accordance with the territories of internationally recognized states. As a result, multilingual states—which constitute the majority of the world’s countries—tend to be mapped as monolingual. Canada is divided into English- and French-speaking zones in roughly half of these maps, but few other countries are treated in such a manner. In almost all cases, official languages are highlighted regardless of whether they are actually spoken by the majority of the population; as a result, Mali appears to be as much a French-speaking country as France. The criteria for language selection generally go unmentioned and in most cases seem inconsistent, but it does appear in general that the “number of native speakers” outweighs the “total number of speakers.”  As a result, one of the world’s most widely spoken and politically significant languages, Indonesian/Malaysian (“Malay”), is usually ignored, often in favor of much less widely used European languages.

Such problems, however, are difficult to avoid. Multilingualism alone—both on the individual and on communal level—makes language mapping a frustrating exercise. But in all of the maps analyzed, the flaws run much deeper. Most are riddled with errors, many at a quite elementary level. As a result, the use of such readily accessible maps risks undermining knowledge of the world by delivering misinformation. To substantiate such harsh allegations, the remainder of the post will examine in some detail each of top eight world language maps that depict individual languages.

The first map has relatively few obvious blunders, although portraying Namibia, Lesotho, and Swaziland as French-speaking is a howler. Mapping Djibouti as an Arabic-speaking country is also problematic; although Arabic has official status—as does French—relatively few Djiboutians speak it, as Somali and Afar are the main vehicles of communication. Map 1 does divide a few countries into separate languages, and does so with a degree of accuracy. Not only is Canada split at the provincial level into Anglophone and Francophone areas, but so too is Cameroon, while Chad is divided into Arabic- and French-speaking zones. Nowhere else, however, do language boundaries deviate from those of states. Only Kenya and Tanzania are portrayed (through diagonal stippling) as containing multiple languages in the same locations, but the effort is flubbed ; presumably the intention was to show English intermixed with Swahili, but Swahili does not appear in the key. As is true of almost all maps of this kind, official languages of European origin in sub-Saharan Africa are exaggerated; curiously, however, Botswana, Malawi, and South Africa—where English has official status and is widely used—are not included in the English-speaking set. The portrayal of India as uniformly Hindi-speaking is also problematic, as is the mapping of China as completely Chinese-speaking—especially considering the fact that  “Chinese” is not exactly a spoken language, but rather a group of related languages that are, with the exception of Mandarin, locally conceptualized as mere dialects.

Map 2 is a far more comprehensive effort, depicting 23 separate languages. Most are limited to a single country, sometimes incorrectly so (Austria, for example, is not depicted as German-speaking). Outrageous errors here include the depiction of Sakhalin as Japanese speaking, Mali, Cyprus, and Azerbaijan as Arabic speaking, and Belgium as speaking some uncertain language (the color used for Belgium does not appear in the key). The criteria for inclusion in this map seem particularly odd; why, for example, are relatively major languages such as Vietnamese, Bahasa Indonesia, Italian, and Polish, ignored while Finnish and Norwegian are mapped? India is depicted accurately here as “multilingual,” but it is the only country so classified!  The text-box labeled “other major languages spoken in the world” is confusing; how can “French and English” be classified here as “other languages” when both are extensively mapped? In actuality, it appears that the numbers in the box were designed to have been placed on specific countries: “1: French and Sango,” for example, pertains to Central African Republic. Unfortunately, that step was neglected.

Map 3 at least attempts to show areas of language overlap and multilingualism, although it does so in a crude manner. Note, for example, the extension of the North American French-speaking zone well out of Quebec into Newfoundland, Nova Scotia, and Maine. The map’s reduction of the vast majority of languages in the key to “local dialects, misc.” is risible. One also finds the Inuit language (actually, “languages”), spoken by fewer than 100,000 people, depicted as more significant than languages spoken by more than 100 million, such as Hindi, Bengali, and Indonesian. Yet the map simultaneously puts one of the main Inuit-speaking areas, the coastal strip of southern Greenland, in the “local dialects and miscellaneous” category.

Map 4—probably the worst of the lot—strictly depicts all sovereign states as linguistically uniform—except Canada. Switzerland and Belgium are simply mapped as French speaking—as, absurdly, are Romania, Vietnam, and even Albania. Equally egregious is the depiction of Thailand, Laos, Burma, Cambodia, and Malaysia as Mandarin speaking.  The portrayal of the entire former Soviet Union as Russian speaking is also misleading, as is the mapping of Somalia and Eritrea as Arabic speaking (although Arabic is a co-official language of both countries). Note that Israel is also mapped as Arabic speaking. The Portuguese language is oddly ignored, and Portuguese-using Guinea Bissau has been colored as a Francophone state. Francophone Burundi and Rwanda* have in contrast been depicted Anglophone, whereas Anglophone Malawi and Swaziland have been excluded from the same category.

On first glance, Map 5 appears to be comprehensive and sophisticated than the others—but it is not.  This map violates basic protocols by placing individual languages and language families at the same level of analysis. Here, for example, one finds “German” rather than “Germanic” but at the same time “Turkic” rather than “Turkish.” Yet the Turkic language family is severely misconstrued, as Turkic-speaking Azerbaijan is placed in the non-existent “Caucasian” language family, whereas non-Turkic-speaking Mongolia is included. Dividing the Slavic family on the basis of script rather than linguistic relationship is inexcusable,** as is the use of imaginary language categories, such as “West African.” The depiction of Ethiopia as entirely Amharic speaking is problematic enough, but the placing of Somalia in the same category is indefensible. Belize, Haiti, Guyana, Suriname, and French Guiana are all incorrectly mapped as Spanish speaking. For India, Bangladesh, Bhutan, and Sri Lanka—as well as Mainland Southeast Asia—cop-out categories of geographical rather than linguistic reference are employed. Bizarrely, the “languages of [Mainland] SE Asia” class is extended to Madagascar and Melanesia. And in Europe, while the relatively minor language of Albanian is mapped, Albanian-speaking Kosovo is incorrectly depicted as Slavic speaking.

One might think that the Wikipedia map (#6) of world languages would be reasonably accurate—but one would again be mistaken. Although it is difficult to see, the map severely misconstrues the Caribbean, where neither Guadeloupe not Martinique are depicted as French speaking, but Dominica—where the official language is English—is. (In the Pacific, Fiji and Samoa are also mapped as French speaking.) Dutch- and Papiamento-speaking Curaçao and Aruba, however, are portrayed as English speaking. All of Timor is depicted as Portuguese speaking, even the Indonesian half of the island.  For both India and Pakistan, the archaic term “Hindustani” is employed, which is depicted as uniformly extending across India. Kyrgyzstan is shown as Russian speaking; although Russian is an official language, it is by no means the country’s major tongue. As with many other maps of this type, the extent of Arabic is exaggerated by including Eritrea, Somalia and South Sudan. By the same token, the extent of French in Africa is overplayed, yet that of English in the same region is ignored altogether. And while Swahili is indeed an official language of Uganda, the country can hardly be regarded as Swahili speaking; English also has official status, and is more widely used. The mapping of Afghanistan as Persian speaking is justifiable, but the exclusion of Tajikistan from the same category is not. One might also ask why Italian merits depiction, but not Japanese, Turkish, Korean, Vietnamese, and Indonesian/Malaysian.

Map 7 does a better job than the others in depicting multilingualism. Yet it oddly depicts Guinea, Gabon, and Senegal as entirely French speaking, unlike the other Francophone countries of sub-Saharan Africa; compare also the divergent mapping of Mozambique and Angola in Lusophone (Portuguese speaking) Africa, and note the depiction of southern Africa as entirely English speaking. This map also mixes individual languages with language families (Turkic and Slavic), yet it manages to misconstrue its own categories.  Note that Azerbaijan is incorrectly mapped as non-Turkic, just as Bosnia and Macedonia are incorrectly mapped as non-Slavic. Finnic-speaking Estonia, however, is put in the Slavic category.

Map 8 is the odd one out in this series, as it does not actually map languages, but rather merely provides information on the percentage of people who speak certain languages over five continent-like divisions of the world. I cannot imagine how this information could be useful to anyone in any circumstance. Note also that it also makes errors in categorization, as it lists individual languages along with a language family of uncertain coherence (“Kwa”) and a certain type of language (“Creole”).

It is of course one thing to harshly criticize such maps and another to produce something better. Stay tuned for a GeoCurrents map of the world’s main languages later this summer.


*Rwanda is admittedly tending in an Anglophone direction—much to the consternation of France—but French is still more important than English.

**This distinction is also incorrectly applied, as Montenegro actually favors the Latin rather than the Cyrillic alphabet.


Misleading Language Maps on the Internet Read More »