Indo-European Origins
The Linguistic Geography of the Wikipedia

One of the highlights of the Association of American Geographers meeting last week in Seattle was the annual Geography Bowl. Student teams competed to answer all manner of geographical questions, including a few that were devilishly difficult. The most impressive answer may have come in the final round, when the two remaining teams were asked to list the five top languages, after English, used in Wikipedia articles. The Middle Atlantic team buzzed in almost immediately, and one of its members confidently and correctly recited, “German, French, Polish, Italian, and Spanish.”

Both the lack of Chinese and the presence of Polish seemed extraordinary, prompting me to query the team after the contest. The response referenced the well-known cultural pride of the Poles, as well as the fact that Polish has roughly 40 million speakers, a considerable number.

An article in the “meta-wiki” provides detailed information on the use of the 281 languages in which Wikipedia articles have been written. The table posted lists the top fourteen of these languages, with their respective number of articles (in rounded figures). As one can see, Chinese is represented here, coming in twelfth place, between Swedish and Catalan. Such a showing is hardly impressive, however, considering the fact that more than a billion people speak Mandarin Chinese, whereas only around 10 million speak Swedish and 11.5 million Catalan. But neither is the showing of the top Wikipedia language, English. To demonstrate relative Wiki language standings, I calculated the number of articles per 1,000 total* speakers for each of the top fourteen Wikipedia languages. Here English is far surpassed by a number of other languages. Considering the fact that most Swedish, Dutch, and Norwegian Wikipedia users are fully fluent in English, the quantity of articles appearing in their native languages is impressive indeed. (Admittedly, articles in languages other than English are often translated from an English original.)

Overall, European languages dominate the Wikipedia list. A number of major non-European languages rank relatively high (Vietnamese coming in 17th place, Korean 21st, Indonesian 22nd, and Arabic 25th), but they are still surpassed by European languages with far fewer speakers. Several important Asian languages, moreover, rank very low: Bengali, for example, with more than 230 million speakers, is outranked by Luxembourgish, Welsh, and Icelandic, none of which even approaches one million speakers. Sub-Saharan African languages are least represented. Swahili ranks a respectable 75th, with more than 21,000 articles, but Hausa, a major language spoken by 43 million people, ranks 245th, with only 263 articles. By this metric, Hausa is bested by such obscure tongues as Norfolk and Nauruan, and even by long-deceased Gothic.

Another notable feature of the list is the relatively large number of articles written in non-national European languages, many of which are often regarded as mere dialects. In Spain alone, Asturian is used for more than 14,000 articles, Aragonese for more than 25,000, and Galician for more than 70,000. Local linguistic pride along with regionalism and sub-state nationalism are no doubt responsible for such elevated numbers. Such processes are largely but not entirely limited to Europe. In the Philippines, the obscure tongue of Waray-Waray (3.5 million speakers) has an amazingly large Wikipedia presence, its 102,000 articles far over-shadowing the 51,000 written in the national language Tagalog (Filipino).

A final oddity is the relatively high rankings of artificial languages. Almost as many Wikipedia articles are written in Esperanto as Arabic, and the constructed language of Volapük bests Hebrew, Hindi, Thai, and Greek. Ido, with an estimated 100-1200 speakers, boasts more than 21,000 articles, while Interlingua has more than 5,000, Novial more than 2,500, Interlingue (“Occidental”) almost 2,000, and Logban over 1,000. So-called dead languages are also reasonably well represented, with Latin being used for more than 52,000 articles, Old English (Anglo-Saxon) for 2,600, and Pali for 2,300. Artificial languages from fictional societies, however, do not make the list, even though such tongues as Navi and Klingon have plenty of aficionados. The explanation comes in a footnote: “The Klingon language edition of the Wikipedia is no longer hosted by Wikimedia and is now hosted by Wikia as Klingon Wiki.”

* As opposed to native speakers.

