The dating of ancient languages by a technique called glottochronology is undergoing a revival, stimulated by the computational and statistical methods used to tease out evolutionary relationships in biology.
This year marks a half-century since the publication1 of a famous discovery that, by elucidating a remarkable mechanism for preserving and conveying information, led to a better understanding of our human heritage. This feat was performed by a partnership between an unconventional newcomer from a different field and a Cambridge scholar with specialist knowledge. Their insights depended on careful observations made by another academic, whose premature death from cancer ended her chances of sharing in the ultimate accolades.
The exploit described above is the decipherment of Linear B, an enigmatic scrawl found on clay tablets excavated at Mycenaean archaeological sites that proved to be an early example of script based on syllables, and a milestone on the path to alphabetic writing systems. An inventive young architect, Michael Ventris, joined forces with John Chadwick, a classicist with expertise in ancient dialects, to break the code and show (against the prevailing view) that Linear B was an early form of Greek. Another philologist, Alice Kober, had catalogued regularities dubbed 'triplets' in the texts that proved crucial to understanding the inflectional structure of the underlying linguistic system. However, she disparaged Ventris's early efforts and didn't live to see the final breakthrough.
Beneath the surface similarities of this narrative to the discovery of the DNA double helix lie deeper connections that have long been observed between DNA and language, as well as between the evolution of species and of languages2. Writing on page 435 of this issue, Gray and Atkinson3 bring an evolutionary biology mindset to the subject of historical linguistics. The issue they tackle is the origin of the Indo-European languages, a vast family of highly diverse tongues (see Fig. 1 of the paper, page 437). But at least as notable as their conclusions are their methods, which involve applying tools developed largely for reconstructing evolutionary relationships between species — phylogenies — using the information preserved in DNA.
Their work builds on foundations laid by the linguist Morris Swadesh, who, at just the time DNA and Linear B were being solved, was developing lexicostatistics, the quantitative study of vocabularies and vocabulary change. As a tool for inferring family trees of languages, lexicostatistics represented a departure from historical linguistic methods that characterized systematic, global patterns of change using comprehensive vocabularies and detailed knowledge of grammar and pronunciation. Swadesh focused instead on limited core vocabularies of 100–200 words (now called Swadesh lists), which reflect the most fundamental concepts expressed in any language and are expected to be relatively resistant to change. He then looked for similarities in the corresponding words in related languages, identifying the cognates (words presumed to derive from a common ancestor, or what in the realm of genes are called orthologues), thereby creating a straightforward and relatively objective metric of language kinship that made no attempt to account for the actual process of change. Simply put, two languages sharing 85% of cognates were judged to be more closely related than those sharing only 75%, and consequently to have split more recently.
Lexicostatistics was immediately controversial, particularly where Swadesh attempted to extend it to determine absolute times since language divergences by a method called glottochronology. This entailed determining an average rate of word substitution over time for recent language histories, and then extrapolating backwards. In postulating that word 'half-lives' are intrinsic constants of language, Swadesh anticipated by more than a decade the notion of the 'molecular clock' that is central to sequence-based evolutionary biology2.
In fact there are striking correspondences between the objections to Swadesh's methods and issues faced in phylogenetic reconstruction2. Why should we believe that languages (or genomes) evolve at a constant rate over time, or that individual words and word classes (or proteins and protein families) have the same inherent rates as one another4? Do Swadesh lists (or core gene sets common to many genomes5) fairly represent overall evolutionary processes? How can we be sure that similarities reflect common ancestry, rather than chance convergences6? Conversely, how do we reliably recognize distant relatives whose spellings have drifted far apart (into what biologists call the 'twilight zone' of sequence similarity)? Why should we even presume that the 'tree of language' (or that of life) is a tree, as opposed to a sort of network, given that lexical borrowings and language mixture (and horizontal transfer between genomes7) are well-known occurrences? Box 1 provides some examples.
Over the years, historical linguists and evolutionary biologists have separately tackled such challenges with steadily increasing sophistication8, for instance supplanting Swadesh's overall similarity measure (which phylogeneticists would call a 'distance' method) with cladistic techniques that account for each word (or gene, or sequence residue, or any other observable character) to model the actual process of evolution6. Gray and Atkinson3 apply the latest computational tools — maximum-likelihood models and bayesian inference techniques, which offer a good framework for dealing with issues such as variable rates — to a data set of Indo-European languages developed and refined by lexicostatisticians4.
The topology of the resulting tree of languages holds no surprises for linguists, but the work goes further to estimate absolute ages with statistical support. Calibrating and cross-validating various branchings against known historical events, Gray and Atkinson develop robust confidence intervals for the date of the root of the tree, whence the Indo-European languages arose. This range appears to be several millennia too early to support a prominent theory that the proto-language was disseminated by nomadic Kurgan horsemen from the steppes of Asia, beginning about 6,000 years ago. But the dates fit well with the notion that Indo-European originated among nascent farming communities in Anatolia, in modern-day Turkey, some 2,000–4,000 years before that.
By breathing new life into glottochronology, this work will doubtless revive old debates. But at the same time it should stimulate even more cross-fertilization of ideas among those studying the intertwined trees of life and language.
Ventris, M. & Chadwick, J. J. Hellenic Stud. 73, 84–103 (1953).
Searls, D. B. Nature 420, 211–217 (2002).
Gray, R. D. & Atkinson, Q. D. Nature 426, 435–439 (2003).
Kruskal, J. B., Dyen, I. & Black, P. in Lexicostatistics in Genetic Linguistics (ed. Dyen, A.) 30–55 (Mouton, The Hague, 1973).
Makarova, K. S. & Koonin, E. V. Genome Biol. 4, 115 (2003).
Ringe, D., Warnow, T. & Taylor, A. Trans. Philol. Soc. 100, 59–129 (2002).
Brown, J. R. Nature Rev. Genet. 4, 121–132 (2003).
Renfrew, C., McMahon, A. & Trask, L. (eds) Time Depth in Historical Linguistics (McDonald Inst. Archaeol. Res., Cambridge, UK, 2000).