Global-scale phylogenetic linguistic inference from lexical resources

Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two thirds of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference.


Background & Summary
The cultural transmission of natural languages with its patterns of near-faithful replication from generation to generation, and the diversification resulting from population splits, are known to display striking similarities to biological evolution [1,2]. The mathematical tools to recover evolutionary history developed in computational biology -phylogenetic inference -play an increasingly important role in the study of the diversity and history of human languages. [3,4,5,6,7,8,9,10,11,12,13,14] The main bottleneck for this research program is the so far still limited availability of suitable data. Most extant studies rely on manually curated collections of expert judgments pertaining to the cognacy of core vocabulary items or the grammatical classification of languages. Collecting such data is highly labor intensive. Therefore sizeable collections currently exist only for a relatively small number of well-studied language families. [8,11,15,16,17,18] Basing phylogenetic inference on expert judgments, especially judgments regarding the cognacy between words, also raises methodological concerns. The experts making those judgments are necessarily historical linguists with some prior information about the genetic relationships between the languages involved. In fact, it is virtually impossible to pass a judgment about cognacy without forming a hypothesis about such relations. In this way, data are enriched with prior assumptions of human experts in a way that is hard to control or to precisely replicate.
Modern machine learning techniques provide a way to greatly expand the empirical base of phylogenetic linguistics while avoiding the above-mentioned methodological problem.
The Automated Similarity Judgment Program (ASJP) [19] database contains 40-item core vocabulary lists from more than 7,000 languages and dialects across the globe, covering about 75% of the extant linguistic diversity. All data are in phonetic transcription with little additional annotations. 1 It is, at the current time, the most comprehensive collection of word lists available.
Phylogenetic inference techniques comes in two flavors, distance-based and character-based methods. Distance-based methods require as input a matrix of pairwise distances between taxa. Character-based methods operate on a character matrix, i.e. a classification of the taxa under consideration according to a list of discrete, finite-valued characters. While some distance-based methods are computationally highly efficient, character-based methods usually provide more precise results and afford more fine-grained analyses.
The literature contains proposals to extract both pairwise distance matrices and character data from phonetically transcribed word lists. [20,21,22] In this paper we apply those methods to the ASJP data and make both a distance matrix and a character matrix for 6,892 languages and dialects 2 derived this way available to the community. Also, we demonstrate the suitability of the results for phylogenetic inference.
While both the raw data and the algorithmic methods used in this study are freely publicly available, the computational effort required was considerable (about ten days computing time on a 160-cores parallel server). Therefore the resulting resource is worth publishing in its own right.

Creating a distance matrix from word lists
In [20] a method is developed to estimate the dissimilarity between two ASJP word lists. The main steps will be briefly recapitulated here.

Pointwise Mutual Information
ASJP entries are transcribed in a simple phonetic alphabet consisting of 41 sound classes and diacritics. In all steps described in this paper, diacritics are removed. 3 This way, each word is represented as a sequence over the 41 ASJP sound classes.
The pointwise mutual information (PMI) between two sound classes is central for most methods used in this paper. It is defined as where s(a, b) is the probability of an occurrence of a to participate in a regular sound correspondence with b in a pair of cognate words, and q(x) are the probabilities of occurrence of x in an arbitrarily chosen word. Let "-" be the gap symbol. A pairwise alignment between two strings (x, y) is a pair of strings (x , y ) over sound class symbols and gaps of equal length such that x is the result of removing all gap occurrences in x , and likewise for y . A licit alignment is one where a gap in one string is never followed by a gap in the other string. There are two parameters gp 1 and gp 2 , the gap penalties for opening and extending a gap. The aggregate PMI of an alignment is where PMI (x i , y i ) is the corresponding gap penalty if x i or y i is a gap. For a given pair of ungapped strings (x, y), PMI (x, y) is the maximal aggregate PMI of all possible licit alignments between x and y. It can efficiently be computed with a version of the Needleman-Wunsch algorithm [23]. In this study, we used the function pairwise2.align.globalds of the Biopython library [24] for performing alignments and computing PMI scores between strings.

Parameter estimation
The probabilities of occurrence q(a) for sound classes a are estimated as relative frequencies of occurrence within the ASJP entries. The scores PMI (a, b) for pairs of sound classes (a, b) and the gap penalties are estimated via an iterative procedure.
In a first step, pairwise distances between languages are computed via the method described in the next subsection, using 1−LDN (x, y) instead of PMI (x, y) as measure of string similarity, where LDN (x, y) is the normalized Levenshtein distance [25] between x and y, i.e. the edit distance between x and y divided by the length of the longest string. All pairs of languages 4 (l 1 , l 2 ) with a distance ≤ 0.7 are considered as probably related. This is a highly conservative estimate; 99.9% of all probably related languages belong to the same language family and about 60% to the same sub-family.
Next, for each pair of probably related languages (l 1 , l 2 ) and each concept c, each word for c from l 1 is aligned to each word for c from l 2 . The pair of words with the lowest LDN score is considered as potentially cognate.
All pairs of potentially cognate words are aligned using the Levenshtein algorithm, and for each pair of sound classes (a, b), s 0 (a, b) is estimated as the relative frequency of a being aligned to b across all such alignments. Alignments to gaps are excluded from this computation. PMI 0 (a, b) is then calculated according to (1).
Suppose gap penalties gp 1 , gp 2 and a threshold parameter θ are given. The final PMI scores are estimated using an iterative procedure inspired by the Expectation Maximization algorithm [26]: • For i in 1 . . . 10: 1. All potential cognate pairs are aligned using the PMI i−1 -scores.
2. s i (a, b) is estimated as the relative frequency of a aligned with b among all alignments between potential cognates x, y with PMI i−1 (x, y) ≥ θ.
The target function f (gp 1 , gp 2 , θ) is the average distance between all probably related languages using the PMI 10 -scores. The values for gp 1 , gp 2 , θ are determined as those minimizing f , using Nelder-Mead optimization [27]. The following optimal values were found: gp 1  The final PMI scores between sound classes are visualized in Figure 1. It is easy to discern that PMI (a, a) is positive for all sound classes a, and that PMI (a, b) for a = b is negative in most cases. There are a few pairs a, b with positive score, such as b/f. Generally, sound class pairs with a similar place of articulation tend to have relatively high scores. This pattern is also visible in the cluster dendrogram. We observe a primary split between vowels and consonants. Consonants are further divided into labials, dentals, and velar/uvular sounds.

Pairwise distances between languages
When aggregating PMI similarities between individual words into a distance measure between word lists, various complicating factors have to be taken into consideration: • Entries for a certain language and a certain concept often contain several synonyms. This is a potential source of bias when averaging PMI similarities of individual word pairs.
• Cognate words tend to be more similar than non-cognate ones. However, the average similarity level between non-cognate words depends on the overall similarity between the sound inventories and phonotactic struc-ture of the languages compared. To assess the informativeness of a certain PMI similarity score, it has to be calibrated against the overall distribution of PMI similarities between non-cognate words from the languages in question.
• Many ASJP word lists are incomplete, so the word lists are of unequal length.
To address the first problem, [20] defined the similarity score between languages l 1 and l 2 for concept c as the maximal PMI similarity between any pair of entries for c from l 1 and l 2 .
The second problem is addressed by estimating, for each concept c for which both languages have an entry, the p-value for the null hypothesis that none of the words for c being compared are cognate. This is done in a parameter-free way. For each pair of concepts (c 1 , c 2 ), the PMI similarities between the words for c 1 from l 1 and the words for c 2 from l 2 are computed. The maximum of these values is the similarity score for (c 1 , c 2 ). Under the simplifying assumption that cognate words always share their meaning, 5 the distribution of such similarity scores for c 1 = c 2 constitutes a sample of the overall distribution of similarity scores between non-cognates. Now consider the null hypothesis that the words for concept c are noncognate. We assume a priori that cognate word pairs are more similar than non-cognate ones. Let the similarity score for c be x. The maximum likelihood estimate for the p-value of that null hypothesis is is the relative frequency noncognate pairs with a similarity score ≥ x. If P M I(c i , c j ) is the similarity score between concept c i and c j , we have Analogously to Fisher's method [29], the p-values for all concepts are combined according to the formula If the null hypothesis is true for concept c, p c is distributed approximately according to a continuous uniform distribution over the interval (0, 1]. Accordingly, − log p c is distributed according to an exponential distribution with mean and variance = 1. 6 Suppose there are N concepts for which both l 1 and l 2 have an entry. The sum of N independently distributed random variables, each with mean and variance = 1, approximately follows a normal distribution with mean = N and variance = N . This can be transformed into a Z-statistic by normalizing according to the formula This normalization step addresses the third issue mentioned above, i.e., the varying length of word lists. Z(l 1 , l 2 ) increases with the degree of similarity between l 1 and l 2 . It is transformed into a dissimilarity measure 7 as follows: The maximal possible value Z max for Z would be achieved if both word lists have the maximal length of N = 40, and each synonymous word pair has a higher PMI score than any non-synonymous word pair. Therefore The minimal value Z min for Z would be achieved if all p c equal 1 and both word lists have length 40: We computed d(l 1 , l 2 ) for each pair of the above-mentioned 6,892 languages from the ASJP database. This distance matrix is available |https://osf.io/24be8/|. to a uniform distribution over (0, 1]. We have

Background
In [21] a method is developed to cluster words into equivalence classes in a way that approximates manual expert classifications. In this section this approach is briefly sketched.
The authors chose a supervised learning approach. They use word lists with manual expert cognate annotations from a diverse collection of language families, taken from [15,16,17,18,30]. A part of these goldstandard data were used to train a Support Vector Machine (SVM). For each pair of words (w 1 , w 2 ) from languages (l 1 , l 2 ), denoting concept c, seven feature values were computed: 1. PMI similarity. This is the string similarity measure according to [20] as described in the previous section.
2. Calibrated PMI distance. p c as defined in equation (3) above.
6. Average word length of words for concept c across all languages from the database, measured in number of symbols in ASJP transcription.

7.
Concept-language correlation. The Pearson correlation coefficient between feature 3 and feature 4 for all word pairs expressing concept c.
For each such word pair, the goldstandard contains an evaluation as cognate (1) or not cognate (0). An SVM was trained to predict these binary cognacy labels. Applying Platt scaling [31], the algorithm predics a probability of cognacy for each pair of words from different languages denoting the same concept. These probabilities were used as input for hierarchical clustering, yielding a partitioning of words into equivalence classes for each concept.
The authors divided the goldstandard data into a training set and a test set. Using an SVM trained with the training set, they achieve B-cubed F-scores [32] between 66.9% and 90.9% on the data sets in their test data, with a weighted average of 71.8% when comparing automatically inferred clusters with manual cognate classifications.

Creating a goldstandard
We adapted this approach to the task of performing automatic cognate classification on the ASJP data. Since ASJP contains data from different families and it is confined to 40 core concepts (while the data used in [21] partially cover 200-item concept lists), the method has to be modified accordingly.
We created a goldstandard dataset from the data used in [22] (which is is drawn from the same sources as the data used in [21] but has been manually edited to correct annotation mistakes). Only the 40 ASJP concepts were used. Also, we selected the source data in such a way that each dataset is drawn from a different language family. Words from different families were generally classified as non-cognate in the goldstandard. All transcriptions were converted into ASJP format. Table 1

Clustering
We used the Label Propagation algorithm [41] for clustering. For each concept, a network is constructed from the words for that concept. Two nodes are connected if and only if their predicted probability of cognacy is ≥ 0.25. Label Propagation detects community structures within the network, i.e., it partitions the nodes into clusters.

Model selection
To identify the set of features suitable for clustering the ASJP data, we performed cross-validation on the goldstandard data. The data were split into a training set, consisting of the data from six randomly chosen language families, and a test set, consisting of the remaining data. We slightly deviated from [21] by replacing features 4 and 5 by language distance d(l 1 , l 2 ) as defined in equation (6), and − log(1 − d(l 1 , l 2 )). Both are linear transformations of the original features and therefore do not affect the automatic classification. For each of the 127 non-empty subsets of the seven features, an SVM with an RBF-kernel was trained with 7,000 randomly chosen synonymous word pairs from the training set. 8 The trained SVM plus Platt scaling were used to predict the probability of cognacy for each synonymous word pair from the test set, and the resulting probabilities were used for Label Propagation clustering. This Figure 2: Expert cognacy judgments (left) and prediction of cognacy (right) depending on the selected features procedure was repeated ten times for random splits of the goldstandard data into a training set and a test set.
For each feature combination, the B-cubed F-score, averaged over the ten training/test splits, was determined. The best performance (average B-cubed F-score: 0.86) was achieved using just two features: • Word similarity. The negative logarithm of the calibrated PMI distance, and • Language log-distance. − log(1 − d(l 1 , l 2 )), with d(l 1 , l 2 ) as defined in equation (6). Figure 2 displays, for a sample of goldstandard data, how expert cognacy judgments depend on these features and how the trained SVM+Platt scaling predicts cognacy depending on those features. Most cognate pairs are concentrated in the lower right corner of the feature space, i.e., they display both high word similarity and low language log-distance. The SVM learns this non-linear dependency between the two features.

Clustering the ASJP data
A randomly selected sample of 7,000 synonymous word pairs from the goldstandard data were used to train an SVM with an RBF-kernel, using the two features obtained via model selection. Probabilities of cognacy for all pairs of synonymous pairs of ASJP entries were obtained by (a) computing word similarity and language log-distance, (b) predict their probability of cognacy using the trained SVM and Platt scaling, and (b) apply Label Propagation clustering.

Distance-based
The language distances according to the definition in equation (6) can be used as input for distance-based phylogenetic infernce. In the experiments reported below, we used the BIONJ [42] algorithm for that purpose.

Character-based
We propose two methods to extract discrete character matrices from the ASJP data.
1. Automatically inferred cognate classes. We defined one character per automatically inferred (in the sense described above) cognate class cc. If the word list for a language l has a missing entry for the concept the elements of cc refer to, the character is undefined for this language.
Otherwise l assumes value 1 if its word list contains an element of cc, and 0 otherwise. The motivation for these two types of characters is that they track two different aspects of language change. Cognacy characters contain information about lexical changes, while soundclass-concept characters also track sound changes within cognate words. Both dimensions provide information about language diversification.
Let us illustrate this with two examples.
• The Old English word for 'dog' was hund, i.e., hund in ASJP transcription. It belongs to the automatically inferred cognate class dog_149. The Modern English word for that concept is dog/dag, which belongs to class dog_150. This amounts to two mutations of cognate-class characters between Old English and Modern English, 0 → 1 for dog_150 and 1 → 0 for dog_149.
The same historic process is also tracked by the sound-concept characters; it corresponds to seven mutations: 0 → 1 for dog:d, dog:a and dog:g, and 1 → 0 for dog:h, dog:u, dog:n and dog:d.
• The word for 'tree' changed from Old English treow (treow) to Modern English tree (tri). Both entries belong to cognate class tree_17. As no lexical replacement took place for this concept, there is no mutation of cognate-class characters separating Old and Modern English here. The historical sound change processes that are reflected in these words are captured by mutations of sound-concept characters: 0 → 1 for tree:i and 1 → 0 for tree:e, tree:o and tree:w.
For a given sample of languages, we use all variable characters (i.e., characters that have value 1 and value 0 for at least one language in the sample) from both sets of characters. Phylogenetic inference was performed as Maximum-Likelihood estimation assuming Γ-distributed rates with 25 rate categories, and using ascertainment bias correction according to [43]. Base frequencies and variance of rate variation were estimated from the data.
In our phylogenetic experiments, the distance-based tree was used as initial tree for tree search. This method was applied to three character matrices: • cognate class characters, • soundclass-concept characters, and • a partitioned analysis using both types of characters simultaneously.
Inference was performed using the software RAxML [44].
Applying more advanced methods of character-based inference, such as Bayesian inference [45,46,47] proved to be impractical due to hardware limitations.

Data Records
All data that were produced are available at |https://osf.io/cufv7/| as well.

Phylogenetic inference
To evaluate the usefulness of the distance measure and the character matrices defined above for phylogenetic inference, we performed two experiments: • Experiment 1. We applied both distance-based inference and characterbased inference for all language families (according to the Glottolog classification) containing at least 10 languages in ASJP.
• Experiment 2. We sampled 100 sets of languages with a size between 20 and 400 at random and applied all four methods of phylogenetic inference to each of them.
In both experiments, each automatically inferred phylogeny was evaluated by computing the Generalized Quartet Distance (GQD) [48] to the Glottolog expert tree (restricted to the same set of languages).
The results of the first experiment are summarized in Table 2 and visualized in Figure 3. The results for the individual families are given in Table 5.
Aggregating over all families suggests that distance-based inference produces the best fit with the expert goldstandard. However, a closer inspection of the results reveals that the performance of the different phylogenetic inference methods depend on the size of the language families (measured in number of taxa available in ASJP). Combining both types of characters in a partitioned model always leads to better results than the two character types individually. While distance-based inference is superior for small language families (less than 20 taxa), character-based inference appears to be about equally good for mediumsized (20-199 taxa) and large (more than 200 taxa) language families. This assessment is based on a small sample size since there are only 33 medium-sized and 6 large language families. The results of experiment 2 confirm    Table 3 and illustrated in Figure 4. All four methods improve with growing sample size, but this effect is more pronounced for character-based inference. While combined characterbased inference and distance-based inference are comparable in performance for smaller samples of languages (n ≤ 100), character-based inference outperforms distance-based inference for larger samples, and the difference grows with sample size. The same pattern is found when the different versions of phylogenetic inference is applied to the full dataset of 6,892 languages. We find the following GQD values:  • ML tree from combined character data (|https://osf.io/dg5jh/|): 0.035

Relation to geography
Both the distances between languages and the two methods to represent languages as character vectors are designed to identify similarities between word lists. There are essentially three conceivable causal reasons why the word lists from two languages are similar: (1) common descent, (2) language contact and (3) universal tendencies in sound-meaning association due to sound symbolism, nursery forms etc. [49]. The third effect is arguably rather weak. The signal derived from common descent and from language contact should be correlated with geographic distance. If the methods proposed here extract a genuine signal from word lists, we thus expect to find such a correlation.
To test this hypothesis, we computed the geographic distance (great-circle distance) between all pairs from a sample of 500 randomly selected languages, using the geographic coordinates supplied with the ASJP data.
We furthermore extracted pairwise distances from character vectors by computing the cosine distance between those vectors, using only characters for which both languages have a defined value. In this way we obtained three matrices of pairwise linguistic distances for the mentioned sample of 500 languages: (1) The distance as defined in equation (6), called PMI distance, (2) the cosine distance between the cognate-class vectors, and (3) the cosine distance between the sound-concept vectors.
All three linguistic distance measures show a signficant correlation with geographic distance. The Pearson correlation coefficient for PMI distances is 0.252 (p=0.001 according to the Mantel test), 0.329 (p=0.001) for cognate-class dis-tance and 0.140 (p = 0.001) for sound-concept distance. Figure 5 shows the corresponding scatter plots.

Figure 5: Geographic vs. linguistic distances between languages
The visualization suggests that for all three linguistic distance measures, we find a signal at least up to 5,000 km. This is confirmed by the Mantel correlograms [50] shown in Figure 6. We find a significant positive correlation Figure 6: Mantel correlograms. Blue: significant, red: non-significant at p < 0.05.
with geographic distance for up to 5,000 km for PMI distance, and up to 4,000 km for cognate-class distance and sound-concept distance.

Usage Notes
Character-based inference from expert cognacy judgment data have been used in various downstream applications beyond phylogenetic inference, such as estimating the time course of prehistoric population events [3,7,9] or the identification of overarching patterns of cultural language evolution [5,51]. In this section it will be illustrated how the automatically inferred characters described above can be deployed to expand the scope of such investigations to larger collections of language families.

A case study: punctuated language evolution
A few decades ago, [52] proposed that biological evolution is not, in general, a gradual process. Rather, they propose, long periods of stasis are separated by short periods of rapid change co-occurring with branching speciation. This model goes by the name of punctuated equilibrium. This proposal has initiated a lively and still ongoing discussion in biology. Pagel, Venditti and Meade [53] developed a method to test a version of this hypothesis statistically. They argue that most evolutionary change occurs during speciation events. Accordingly, we expect a positive correlation between the number of speciation events a lineage underwent throughout its evolutionary history and the amount of evolutionary change that happened during that time.
Estimates of both quantities can be read off a phylogenetic tree -the number of speciation events corresponds to the number of branching nodes, and the amount of change to the total path length -provided (a) the tree is rooted and (b) branch lengths reflect evolutionary change (e.g., the expected number of mutations of a character) rather than historical time. In [53] a significant correlation is found for biomolecular data, providing evidence for punctual evolution.
In [51], the same method is applied to the study of language evolution, using expert cognacy data from three language families (Austronesian, Bantu, Indo-European). The study results in strong evidence for punctuated evolution in all three families.
We conducted a similar study for all Glottolog language families with at least 10 ASJP languages. The workflow was as follows. For each family F : • Find the language o ∈ F which has the minimal average PMI distance to the languages in F . This language will be used as outgroup.
• Infer a Maximum-Likelihood tree over the taxa F ∪ {o} with the Glottolog classification as constraint tree, using a partitioned analysis with cognateclass characters and soundclass-concept characters.
• Use o as outgroup to root the tree; remove o from the tree.
• Apply the δ-test [54] to control for the node density artifact.
• Perform Phylogenetic Generalized Least Square [55] regression with rootto-tip path lengths for all taxa as independent and root-to-tip number of nodes as dependent variable.
• If the δ-test is negative and the regression results in a significantly positive slope, there is evidence for punctuated evolution in F .
Among the 66 language families considered, the δ-test was negative for 43 families. We applied Holm-Bonferroni correction for multiple testing to determine significance in the regression analysis. The numerical results are given in Table 4. A significant positive dependency was found for the seven largest language families (Atlantic-Congo, Austronesian, Indo-European, Afro-Asiatic, Sino-Tibetan, Nuclear Trans-New Guinea, Pama-Nyungan). The relationships for these families are visualized in Figure 7. No family showed a significant negative dependency. This strengthend the conclusion of Atkinson et al. [51] that languages evolve in punctuational bursts.