Abstract
Speech consists of a continuous stream of acoustic signals, yet humans can segment words and other constituents from each other with astonishing precision. The acoustic properties that support this process are not well understood and remain understudied for the vast majority of the world’s languages, in particular regarding their potential variation. Here we report cross-linguistic evidence for the lengthening of word-initial consonants across a typologically diverse sample of 51 languages. Using Bayesian multilevel regression, we find that on average, word-initial consonants are about 13 ms longer than word-medial consonants. The cross-linguistic distribution of the effect indicates that despite individual differences in the phonology of the sampled languages, the lengthening of word-initial consonants is a widespread strategy to mark the onset of words in the continuous acoustic signal of human speech. These findings may be crucial for a better understanding of the incremental processing of speech and speech segmentation.
Similar content being viewed by others
Main
Speech is a continuous stream of acoustic signals that transmit linguistic meaning with the purpose of spoken communication. The intricate process of comprehending speech demands the sequential segmentation of the acoustic signal into discrete units such as words and phrases, which are the basic building blocks of language1,2,3,4. This segmentation is supported by a complex interaction of factors that operate on the levels of sound structure, lexicon and grammar, both for the speaker and for the listener. Several of these factors have been identified in previous research, but few have been studied across a wide range of languages. Most previous studies on speech production and processing focus on ‘Western, educated, industrial, rich and democratic (WEIRD)’ people and their languages, which undermines the potential to make species-wide generalizations about human language and cognition5,6. For the factors that affect speech production, this emerges as a particularly severe limitation in light of the huge variability of grammars and sound systems of the world’s ~7,000 languages7,8,9.
Word onsets play a special role in speech segmentation and word recognition. In the lexicon, word-initial segments are known to be more informative than later segments for distinguishing the intended word from other words10, and listeners exploit this for continuously updating hypotheses regarding word identity and boundaries as the phonetic signal progresses11. At the level of phonology, word-initial positions generally exhibit more ‘fortition’ (stronger articulation) and fewer ‘lenition’ (weaker articulation) processes than word-internal or word-final positions and are thus assigned a prominent status in phonological theories12,13,14,15. Complex consonant clusters that are restricted to word onsets through phonotactic constraints may serve as additional cues for word segmentation16. However, there is considerable cross-linguistic variation in this respect, and many languages lack consonant clusters altogether. This implies that clusters cannot be a universal method to segment speech into word units. Other, more general strategies may be more relevant instead.
Acoustic features such as modulations of segment duration and changes in fundamental frequency play a major role in structuring speech into different units. Among these features, the lengthening of vowels at the ends of prosodic phrases, clauses or utterances is attested across a wide variety of languages17,18 and is often assumed to be universal19. At the word level, the acoustic properties of word-initial phones have been argued to be particularly relevant for the prosodic organization of some languages, including English, Korean and French19,20,21. The realization of these word-initial phones may depend on language-specific properties, such as prosodic systems and consonant inventories, but also on between-speaker variation22,23. However, so far most of the evidence for these features comes from a handful of languages, most of them Indo-European.
Two closely related features of word-initial phones that have been reported for individual languages are initial lengthening and strengthening. While initial strengthening implies a stronger articulation19,23,24, initial lengthening refers to the duration of consonants. This is illustrated in Fig. 1 from the Amazonian language Mojeño Trinitario, which is also included in our sample. The example illustrates the same consonant /n/ in three different positions: utterance-initial (50 ms), word-internal (50 ms) and word-initial (100 ms). In artificial language learning experiments, it has been shown that speakers of Hungarian, Italian and English can use word-initial consonant lengthening as a cue to locate word boundaries21. Similarly, word-initial strengthening has been found to facilitate disambiguation between similar lexical items25. However, very little is known about the extent and degree of word-initial lengthening across languages. For words in utterance-initial position, it is not clear whether they display any additional temporal changes. In previous studies, utterance-initial consonants have been found to sometimes be lengthened or shortened, but with an overall small change in duration26,27. Indeed, from a functional perspective, it makes sense that no additional cue to word segmentation is necessary at the beginning of utterances, especially after a pause26,28. To our knowledge, the cross-linguistic evidence for initial lengthening processes remain scarce, and neither word- nor utterance-initial lengthening has been investigated in a worldwide sample of languages.
Our main research question is whether we can find cross-linguistic evidence for word-initial lengthening or shortening effects in observed speech across a wide range of languages. We also investigate whether we can find such an effect at utterance-initial positions. Following this, we analyse the cross-linguistic distribution of any emergent effects. To be able to make valid generalizations across languages, we also control for between-speaker variability and analyse the lengthening and shortening effects across segments with different places and manners of articulation.
Results
Evidence for word-initial lengthening across languages
We used a comprehensive corpus consisting of spontaneous speech from 51 languages, shown in Fig. 1a, recorded from 393 speakers (195 female, 198 male) of an age range between 16 and 100 years29. Of these 51 languages, 49 are spoken by non-WEIRD populations5,6. The languages in our sample display a wide range of sound inventories and prosodic systems and cover a wide spectrum of grammars. The main units of our analysis are phones (discrete segments of speech); words, as defined by experts on each language; and utterances, which we define as interpausal units—that is, chunks of speech that are not interrupted by a silent pause. The entire corpus consists of over two million phones, all of which have been time-aligned semi-automatically30. Of these, we used 874,627 phones for this study (see Methods for information on data filtering). For 49 of 51 languages, our analysis included more than 10,000 data points.
We used Bayesian linear regression to estimate the effect of word-initial and utterance-initial positions on the duration of consonants, compared with word-internal positions. We modelled the effect of both positions with a population-level estimate that is allowed to vary between all languages in the sample. For a more conservative analysis, we allowed for variation of the effects between speakers of the same language. This ensures that any inference drawn from the model can be generalized over different speakers. Similarly, we allowed the model to vary between segments of different places and manners of articulation since lengthening effects influence each kind of segment differently20. We also controlled for consonant clusters and distinguished between three levels: the consonant is (1) at the beginning of a cluster, (2) in a cluster but not at the beginning or (3) not in a cluster. All levels are modelled as varying between each language. As fixed parameters, we controlled for word length (the number of phones in a word), word form frequency (of forms in the DoReCo corpus of each language) and local speech rate. The full model including prior distributions and likelihood function is given as Fig. 2. The likelihood function defines the response variable using a gamma distribution, which transforms the response variable (duration in milliseconds) to a log scale. Converting to a log scale is a common transformation for duration measures in linguistics to compare orders of magnitude instead of comparing absolute differences in milliseconds31. The posterior distributions of parameter values in Bayesian regression studies are defined via their highest posterior density interval (HPDI), which describes the area of the distribution in which most of the sampled posterior values are represented32,33,34. In Bayesian statistics, the type S error rate for the posterior intervals is much lower than in comparable frequentist methods35. Another measure to exclude spurious effects and to produce reliable results is to include a region of practical equivalence to 0 (ROPE)36. The ROPE is values near 0 (−0.01 to 0.01 on the log scale) that we consider not to be meaningful. In the complete absence of an effect, the posterior distribution would be fully within the ROPE37. We interpret 89% HPDIs not overlapping the ROPE as evidence in favour of an effect. If the 89% HPDI overlaps the ROPE, we take the evidence as inconclusive.
The fitted model shows evidence for the word-initial lengthening of consonants in utterance-medial position for 43 of the 51 sampled languages. No language shows evidence in favour of word-initial shortening. For the languages for which we have evidence, the 89% HPDI does not intersect with zero or the values defined in the ROPE. The mean of the HPDI for the 43 languages ranges mostly between 0.1 and 0.3 on the log scale, which translates to an average effect between 8 ms and 18 ms for a segment 84 ms long (the mean duration of phones in the data). The cross-linguistic distribution provides us with high confidence in the reliability of our results. They strongly imply that the observation of lengthening of word-initial consonants in comparison with their word-internal counterparts can be generalized across languages. We show the posterior distributions for the word-initial parameter in all languages in Fig. 3.
Regarding utterance-initial positions, no language in our sample shows evidence in favour of lengthening. However, 15 languages show evidence for utterance-initial shortening. In these languages, the duration of consonants tends to be shorter in utterance-initial than in utterance-medial or final position. For the other 36 languages, the results are inconclusive. The HPDI of this distribution displays a weak tendency towards the shortening of utterance-initial consonants for some languages, but for others, the HPDI indicates a weak tendency towards their lengthening. None of those are interpretable, and no uniform cross-linguistic pattern emerges across the sample. We present the individual posterior distributions in Fig. 4.
Posterior distribution of control variables
The distribution of parameter values across the whole dataset is presented in Fig. 5. All values are on the log scale. Since the model was parameterized as treatment coding, the ‘non-initial’ level is modelled as the intercept, and both ‘utterance-initial’ and ‘word-initial’ compare directly to the ‘non-initial’ baseline. For the average consonant of 84.35 ms in our data, a lengthening on the log scale of 0.14 (the mean of the word-initial parameter) results in a lengthening of ~13 ms.
Word-form frequency has a small negative effect on duration with a mean of −0.02 (95% HPDI from −0.02 to −0.02) on the log scale. Similarly, word length in phones, measured as phones per word, has a small negative effect on duration with a mean of −0.03 (95% HPDI from −0.03 to −0.03) on the log scale. This is exactly as predicted: segments in longer words are shortened (polysyllabic shortening), and more frequent words are uttered faster. There is a strong correlation (ρ = 0.61) between both parameters31, in that many phones per word correlates with a lower word-form frequency. Incidentally, this confirms the cross-linguistic validity of Zipf’s law of abbreviation that more frequently used words are shorter38,39,40,41. Given the strong correlation between both parameters, the effects in the model should not be interpreted separately but should always be considered together statistically. Local speech rate has the expected large effect on duration in the model (−0.19, 95% HPDI from −0.20 to −0.19). As duration per sound is a central part of calculating speech rate, it is not surprising that this predictor is the strongest of all three. It is important to remember that all three predictors are modelled to be uniform across the whole dataset—that is, they are modelled not to vary between individual languages. The effects for cluster-internal consonants show more variation. Consonants outside of a cluster are shorter (−0.03, 95% HPDI from −0.05 to −0.00) than consonants at the beginning of a cluster. Consonants within a cluster are even shorter (−0.07, 95% HPDI from −0.09 to −0.04). The results per language are presented in Supplementary Information section B. Figure 5 further shows that the utterance-initial and word-initial parameters have a large standard deviation at the population level. This indicates that these predictors do not behave uniformly across languages, as we have already seen for the language-specific distributions.
Posterior evaluation of the model
We ran posterior predictive simulations to confirm that on average, we expect word-initial consonants to be longer than consonants in other positions. A common way to evaluate a Bayesian linear regression model is to run posterior predictions with simulated data33,34. We present such posterior predictions in Fig. 6, where we can observe a higher average duration for word-initial consonants than for the other positions. On average, the word-initial consonants in the simulated dataset are expected to be around ~13 ms longer (~106 ms) than consonants in other positions (~93 ms). Full posterior predictive checks according to the Bayesian Analysis Reporting Guidelines42 are presented in Supplementary Information section B.
To control for possible non-independence of data points, we carefully analysed the genealogical and spatial relations in our dataset. Our sample includes data from 30 different language families. While eight language families are represented by multiple languages (for example, seven Austronesian, four Indo-European and four Sino-Tibetan languages), there are 22 language families with only one language in our sample. In the model, we added a varying intercept per language family, which shows a very small variance between language families (0.04 on the log scale). This shows that the model cannot identify systematic patterns across language families and attributes most of the durations to variation between languages, segments or speakers. Further approximations of potential correlations between language families are provided by controlling for spatial autocorrelation, since most of the languages in our sample that are related to each other genealogically (especially Austronesian, Indo-European and Sino-Tibetan languages) are also geographically close to each other.
We also verified that the model is not biased through spatial autocorrelation. This type of bias is frequent in linguistic typology and can arise through the borrowing of structural features between languages43,44. The amount of spatial autocorrelation in data used for regression models can be measured through the Moran coefficient45,46,47,48,49. We based the computation of the Moran coefficient on the geodesic distance between the language coordinates as provided by Glottolog50, following suggestions in the literature51. We computed this coefficient using the geostan package46. In all cases, the coefficient was close to 0, indicating very little or no spatial bias in our data. The full report for each macro area is presented in Supplementary Information section B.
Discussion
The current study reports acoustic evidence that speakers from vastly different cultural, geographic and linguistic backgrounds produce longer word-initial consonants. While languages differed in the magnitude of lengthening, evidence could be observed across a large part of the sample: 43 languages provided evidence in favour of word-initial lengthening, and none provided evidence for word-initial shortening. The effect in those languages was observed while controlling for the known between-speaker variability in prosodic boundary marking23 and the intrinsic differences of lengthening effects of different segments. Since the current study is based on a comprehensive dataset consisting of languages from predominantly non-WEIRD communities from all parts of the world, the distribution of the effect indicates a universal tendency in spoken languages.
Our findings are consistent with models that argue for the dual importance of word-initial lengthening for segmenting speech. First, word-initial lengthening might directly indicate word boundaries. Second, lengthening would facilitate word recognition through the prominent pronunciation of word-initial segments, which are the most informative ones for word identification10,21. One potential reason why speakers’ word-initial lengthening is so widespread is that it can promote these two processing requirements for the listener simultaneously11. There may be additional articulatory reasons for slowing down in the vicinity of boundaries, but how exactly language comprehension and production interact in this respect remains unclear28,52. While the influence of initial lengthening on speech processing has been shown in experimental studies for speakers of some languages21, the cross-linguistic evidence for the role of initial lengthening in speech processing would ultimately have to be confirmed in perception studies. Word-initial lengthening could then emerge as an additional key factor for the segmentation of speech in the multi-faceted process of speech recognition53.
Regarding speech production, our results partially support and partially contradict predictions made by current models of articulatory phonology, such as the π-gesture model. This model predicts that articulatory gestures are slowed down at prosodic boundaries, manifested in acoustic data as lengthening effects52,54. Our findings are, in general, consistent with this view. For larger prosodic boundaries in contrast to smaller ones, the π-gesture model would predict longer durations. If we assume that a word boundary after a pause corresponds to a major prosodic boundary compared with a word boundary with no preceding pause, longer durations should be found in the former than in the latter. However, we did not find a lengthening effect for consonants utterance-initially compared with word-initial positions. For 15 of 51 languages, we even found evidence for shortening of utterance-initial consonants. These findings go against the π-gesture model predictions. The findings do, however, mirror reports on the disappearing effect of final lengthening at strong prosodic boundaries with long pauses18. This suggests that speakers systematically modulate the segmental duration of initial consonants at the word level but do not always mark boundaries of higher prosodic levels at the beginning of an utterance. The absence of additional lengthening in utterance-initial position suggests that consonant lengthening is more closely linked to the segmentation and identification of word units than to prosodically structuring speech into larger units such as prosodic phrases. Since utterances are operationalized as chunks of speech surrounded by silent pauses in our study, we interpret the lack of an effect as being related to the lack of functional ambiguity: the first segment following a pause will necessarily also be the first segment of a word, without the need for further segmentation.
Our findings align with several strands of linguistic research about the phonological role of initial segments. At the level of the syllable, onsets have long been recognized as privileged positions. They show several characteristics that other positions do not show, such as resistance to phonological change15,16,55. From a diachronic perspective, word-initial consonants tend to be more resistant to phonemic change than consonants in other positions. For example, initial consonant retention is far more typical than initial consonant loss, with some notable exceptions found in Indo-European and across Australian languages56,57,58. Initial consonant deletion as a productive synchronic process is even less common (but see ref. 59 for a counterexample). Regarding explanations for such asymmetries, our results lend support to models of evolutionary phonology that view initial strengthening as a cause for the historical development and preservation of ‘strong’ and distinctive word-initial sounds in the phonology and lexicon60. There have also been attempts to relate the role of phonological properties to the functional load of syllable onsets compared with syllable codas, and the word-initial position compared with the word-final position61,62,63. One such study investigated the lexical inventories of 12 mostly Indo-European languages and found that syllable onsets have a considerably higher functional load, giving them an extraordinary status61. Conversely, word-final positions have been shown to have a reduced degree of structural complexity63. These long-term evolutionary processes are consistent with the special role of word-initial segments during the online incremental processing of words.
While the data showed a clear cross-linguistic trend for lengthening at the beginning of words, 8 of 51 languages showed a certain degree of resistance to durational modulations at word-initial position, as evidenced by the intersection of the 89% HPDI with the ROPE (Fig. 3). While this apparent resistance could be explained by insufficient or noisy data, it is also possible that these languages lack word-initial lengthening. Language-specific factors that could affect the degree of lengthening and deserve further attention in future research include the phoneme inventory of the language, the distribution of segments with variable pronunciations (in particular glottal stops), phonological length distinctions (singletons versus geminates) and lexical stress.
Some inevitable limitations might influence the interpretation and generalizability of our findings. First, one limitation of this study lies in the corpus-based approach using aggregated language documentation data and recordings of natural speech. While these data sources provide an ecologically valid and rich set of linguistic samples, they are susceptible to noise and variability inherent in natural speech recordings. They were created over several decades, using different recording equipment and protocols, leading to potential inconsistencies in audio quality. Despite efforts in preselecting high-quality audio for the corpus30, the inherent variation in recording conditions remains a concern. However, the corpus-based approach offers the advantage of observing effects in spontaneously produced speech, outside of a strict experimental setting with a less varied sample of texts and speakers.
Second, the sample size, although comprising 51 diverse languages from 30 different language families, still poses a limitation. For some of these languages, we have data from only one (Kamas, Texistepec Popoluca and Yongning Na) or two speakers (Tabasaran, Northern Alta, Kurmanji and Southern British English), while for many other languages, we have data from more than ten speakers. In an ideal scenario, a larger sample size would enhance the study’s generalizability across an even broader spectrum of languages and language families, as well as speakers64,65. However, while other multilingual speech corpora are available66,67,68, none of these corpora, in our view, achieve the necessary balance between corpus size, detailed annotation of relevant features and metadata, and expert-informed processing allowing for reliable alignments across a multitude of low-resourced languages that are offered by DoReCo.
A third limitation of the present study lies in its simplistic view of consonant duration. Consonant duration is a multifaceted phenomenon encompassing various acoustic components such as burst, frication, voice onset time and formant transition periods. This also resonates with previous calls for acknowledging the importance of fine phonetic detail for social aspects of communication69. Complementing the study with a detailed articulatory perspective that includes annotation of articulatory gestures in the production of consonants could add more depth to our understanding of the underlying principles of word-initial consonant lengthening for specific languages. However, recording and annotating this kind of complex articulatory data is outside the scope of this study. Another limitation related to the previous one is the lack of accounting for word-level prominence in our analysis. Our corpus data are not annotated for suprasegmental features such as stress or tone. However, on the basis of available phonological descriptions, only 4 of the 51 languages can with some certainty be considered to have fixed initial word stress, while most other languages are either tone languages or stress languages with non-initial stress (Supplementary Information section B). In our model, those four languages do not seem to show any patterns for initial-lengthening effects that distinguish them from the other languages. It therefore seems unlikely that our overall results are skewed by not taking word-initial prominence into account.
Despite these limitations, the evidence across a worldwide sample of languages suggests that the lengthening of word-initial consonants is a potentially fundamental process structuring human speech. This strong effect emerges while carefully controlling for between-speaker variability and variability across segments, which adds additional credence to this conclusion. Given the diverse sample of languages in our study, we predict that this effect is replicable for other languages and datasets.
Methods
Language sample
Our study uses data from the DoReCo corpus (v.1.2)29. The corpus contains time-aligned transcriptions and annotations that mostly originated from language documentation collections covering a wide range of typologically diverse languages. In total, DoReCo v.1.2 contains corpora from 51 languages from 30 language families. All corpora are comparable in size and include at least 10,000 phones (before filtering). A detailed account of the individual corpora and their sources are presented in Extended Data Table 1. Word units in our data were defined and annotated by the language experts who contributed data to DoReCo (Extended Data Table 1), on the basis of current standards in descriptive linguistics. Within DoReCo, the heterogeneous documentation data were processed using a combination of automatic and manual techniques. Forced time alignments were created using the WebMAUS service70 first for start and end times of words, which were then corrected manually for the whole corpus30. Following this, the updated alignments were used as input to create automatic alignments at the segment level.
We have converted the corpus data to the Cross-Linguistic Data Format (CLDF)71,72 to facilitate the reuse of the data and replication of our results. A detailed description of using the corpus as a CLDF dataset is provided as Supplementary Information section A. All preprocessing steps were handled using an SQLite query that is based on the CLDF dataset. Before fitting the models, we cleaned the data by excluding certain observations. Since we are interested only in the lengthening of initial consonants, we removed all vowels from the data. We also removed geminates (that is, phonologically long consonants) due to their intrinsic lengthening. Utterance-initial stops have been excluded because their initial closure period following a pause is unmeasurable73. We excluded sounds with a duration equal to or below 30 ms, which was set as the minimum duration by the MAUS aligner, with shorter durations being indicative of imprecise last-resort alignments30. Lastly, we excluded outliers beyond three standard deviations of the mean for each speaker. For most speakers, this resulted in an upper threshold of around 300 ms, which is a very conservative threshold concerning the expected duration of individual segments. Random samples of excluded segments showed that these cases are mostly transcription or alignment errors and have been correctly excluded.
Causal effects on segment duration
In our model, we controlled for several known causal effects on the duration of phones. We controlled for inter- and intra-speaker variation in speech rate through the proxy variable ‘local speech rate’, which is equal to the average duration of phones per utterance. We also controlled for the number of phones per word and the word-form frequency as fixed effects. The word-form frequency is computed as the frequency of each form within the DoReCo corpus core set of each language. Both parameters are predicted to be highly correlated. For frequency of occurrence, more frequent words are known to be shorter (Zipf’s law of abbreviation)74,75,76,77. Longer words have been shown to have shorter components, most crucially shorter affixes and shorter phones in specific conditions such as under phrasal accent (Menzerath’s law or polysyllabic shortening)26,39,41,78,79. In our model, these three variables were log-scaled and standardized for each language.
The effect for word- and utterance-initial position was modelled with varying intercepts and slopes across all languages. This ensured that we could assess the effects in all languages, instead of interpreting the effect on the population level as being true for all languages80,81. We also included ‘speaker’ as a varying effect in our model, as there are huge amounts of variation between speakers in all linguistic domains82,83,84. It is necessary to control for this kind of variation to make valid generalizations about language64,65. Finally, we controlled for variation of the effect across different segments since there might be variation in the elasticity of segments depending on their place and manner of articulation. In total, the corpus includes 191 different segment types, which are mapped from their X-Sampa representation in DoReCo to the Cross-Linguistic Transcription Systems standard85,86.
Model fitting and evaluation
The reason for choosing a Bayesian approach is the wide range of tools to include prior knowledge of the world in the model and to develop a transparent and reliable model output that is explicit about any uncertainty involved in the inference87,88. The goal of our analysis is to determine the effect size of the word-initial position of phones in speech. Given that we know quite a lot about speech sounds in general, such as expected duration and known causal influences, we can add this prior knowledge directly into the model. Bayesian regression offers several well-designed measures for enabling transparency of the workflow89,90. We report on all relevant points of the Bayesian Analysis Reporting Guidelines42 either in the main text or in the Supplementary Information. We did not include a large-scale sensitivity analysis for our prior distributions, due to the large and energy-intensive computing times. We hope that the prior predictive checks provide sufficient information for the credibility of our prior distributions. We further excluded the points that relate to hypothesis testing with Bayes factors since no model comparison was done in our study. Instead of doing a model comparison or null-hypothesis significance test, we analysed the effect size of our target parameter while controlling for known causal factors.
The model was fit using brms91,92, a package in R93 that uses cmdstanR as a backend. The model was run with 4,000 MCMC iterations (2,500 for warm-up) on four parallel chains. A computational and visual confirmation of model convergence as well as prior and posterior predictive checks are presented in Supplementary Information section B.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
For this study, we used data from the DoReCo corpus (v.1.2) and converted them to a CLDF dataset (v.1.2.1)29,94. While the data are available as Open Access, some files come with a non-derivative restriction. We have therefore added instructions for an automated workflow of downloading the data and converting it to an SQlite database via CLDF instead of providing the data directly71,72, thereby adhering to the non-derivative restrictions. To reproduce the exact steps, please follow the instructions provided in our GitHub repository (https://github.com/FredericBlum/initial_lengthening/blob/v1.0/README.md).
Code availability
The current version of the code (v.1.0) is available via Zenodo at https://doi.org/10.5281/zenodo.13141902 (ref. 95) and curated on GitHub (https://github.com/FredericBlum/initial_lengthening/tree/v1.0). We provide full instructions to reproduce our results in a README.md in the shared repository. The models have been uploaded to an OSF directory (https://doi.org/10.17605/OSF.IO/TC9ZX) since we could not upload them to GitHub due to their large file size.
Change history
04 October 2024
In the version of the article initially published, "WEIRD" was defined as "Western, European, industrial, rich and democratic". This has now been corrected to "Western, educated, industrial, rich and democratic" in the HTML and PDF versions of the article.
References
Cutler, A. in Lexical Representation and Process (ed. Marslen-Wilson, W.) 342–356 (MIT Press, 1989).
Brent, M. R. Speech segmentation and word discovery: a computational perspective. Trends Cogn. Sci. 3, 294–301 (1999).
Mattys, S. L., White, L. & Melhorn, J. F. Integration of multiple speech segmentation cues: a hierarchical framework. J. Exp. Psychol. Gen. 134, 477–500 (2005).
Gong, X. L. et al. Phonemic segmentation of narrative speech in human cerebral cortex. Nat. Commun. 14, 4309 (2023).
Henrich, J., Heine, S. J. & Norenzayan, A. Most people are not WEIRD. Nature 466, 29–29 (2010).
Blasi, D. E., Henrich, J., Adamou, E. & Kemmerer, D. Over-reliance on English hinders cognitive science. Trends Cogn. Sci. 26, 1153–1170 (2022).
Ladefoged, P. & Maddieson, I. The Sounds of the World’s Languages (Blackwell, 1996).
Evans, N. & Levinson, S. C. The myth of language universals: language diversity and its importance for cognitive science. Behav. Brain Sci. 32, 429–448 (2009).
Skirgård, H. et al. Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Sci. Adv. 9, 6175 (2023).
Wedel, A., Ussishkin, A. & King, A. Incremental word processing influences the evolution of phonotactic patterns. Folia Linguist. 53, 231–248 (2019).
Norris, D., Mcqueen, J. M., Cutler, A. & Butterfield, S. The possible-word constraint in the segmentation of continuous speech. Cogn. Psychol. 34, 191–243 (1997).
Kingston, J. Lenition. In Selected Proc. 3rd Conference on Laboratory Approaches to Spanish Phonology (eds Colantoni, L. & Steele, J.) 1–31 (Cascadilla Proceedings Project, 2008).
Lavoie, L. M. Consonant Strength: Phonological Patterns and Phonetic Manifestations (Routledge, 2015); https://doi.org/10.4324/9780203826423
Katz, J. Lenition, perception and neutralisation. Phonology 33, 43–85 (2016).
Topintzi, N. Onsets: Suprasegmental and Prosodic Behaviour Cambridge Studies in Linguistics Vol. 125 (Cambridge Univ. Press, 2010); https://doi.org/10.1017/CBO9780511750700
Easterday, S. Highly Complex Syllable Structure: A Typological and Diachronic Study (Language Science Press, 2019); https://doi.org/10.5281/zenodo.3268721
Paschen, L., Fuchs, S. & Seifart, F. Final lengthening and vowel length in 25 languages. J. Phon. 94, 101179 (2022).
Kentner, G., Franz, I., Knoop, C. A. & Menninghaus, W. The final lengthening of pre-boundary syllables turns into final shortening as boundary strength levels increase. J. Phon. 97, 101225 (2023).
Fletcher, J. in The Handbook of Phonetic Sciences 2nd edn (eds Hardcastle, W. J. et al.) 521–602 (Blackwell, 2010); https://doi.org/10.1002/9781444317251.ch15
Klatt, D. H. Linguistic uses of segmental duration in English: acoustic and perceptual evidence. J. Acoust. Soc. Am. 59, 1208–1221 (1976).
White, L., Benavides-Varela, S. & Mády, K. Are initial-consonant lengthening and final-vowel lengthening both universal word segmentation cues? J. Phon. 81, 100982 (2020).
Quené, H. Durational cues for word segmentation Dutch. J. Phon. 20, 331–350 (1992).
Fougeron, C. & Keating, P. A. Articulatory strengthening at edges of prosodic domains. J. Acoust. Soc. Am. 101, 3728–3740 (1997).
Cho, T. Prosodic boundary strengthening in the phonetics–prosody interface. Lang. Linguist. Compass 10, 120–141 (2016).
Cho, T. & McQueen, J. M. Prosodic influences on consonant production in Dutch: effects of prosodic boundaries, phrasal accent and lexical stress. J. Phon. 33, 121–157 (2005).
White, L. Communicative function and prosodic form in speech timing. Speech Commun. 63-64, 38–54 (2014).
Souza, R. in Prosodic Boundary Phenomena (eds Schubö, F. et al.) 35–86 (Language Science Press, 2023); https://doi.org/10.5281/zenodo.7777469
White, L. English Speech Timing: A Domain and Locus Approach. PhD thesis, Univ. Edinburgh (2002); https://era.ed.ac.uk/handle/1842/23256
Seifart, F., Paschen, L. & Stave, M. Language Documentation Reference Corpus (DoReCo) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/NKL.7CBFQ779
Paschen, L. et al. Building a time-aligned cross-linguistic reference corpus from language documentation data (DoReCo). In Proc. 12th Language Resources and Evaluation Conference (eds Calzolari, N. et al.) 2657–2666 (European Language Resources Association, 2020); https://aclanthology.org/2020.lrec-1.324
Winter, B. Statistics for Linguists: An Introduction Using R (Routledge, 2019); https://doi.org/10.4324/9781315165547
Vasishth, S. & Nicenboim, B. Statistical methods for linguistic research: foundational ideas—part I. Lang. Linguist. Compass 10, 349–369 (2016).
McElreath, R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (Chapman and Hall/CRC, 2020); https://doi.org/10.1201/9780429029608
Gelman, A. et al. Bayesian Data Analysis (Chapman and Hall/CRC, 2013); https://doi.org/10.1201/b16018
Gelman, A. & Tuerlinckx, F. Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Stat. 15, 373–390 (2000).
Kruschke, J. K. Rejecting or accepting parameter values in Bayesian estimation. Adv. Methods Pract. Psychol. Sci. 1, 270–280 (2018).
Makowski, D., Ben-Shachar, M. S., Chen, S. H. A. & Lüdecke, D. Indices of effect existence and significance in the Bayesian framework. Front. Psychol. 10, 2767 (2019).
Bentz, C. & Ferrer-i-Cancho, R. Zipf’s law of abbreviation as a language universal. In Proc. Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics (eds Bentz, C., Jäger, G. & Yanovich, I.) 1–4 (Univ. Tübingen, 2016); https://doi.org/10.15496/publikation-10057
Kanwal, J., Smith, K., Culbertson, J. & Kirby, S. Zipf’s law of abbreviation and the principle of least effort: language users optimise a miniature lexicon for efficient communication. Cognition 165, 45–52 (2017).
Strunk, J. et al. Determinants of phonetic word duration in ten language documentation corpora: word frequency, complexity, position, and part of speech. Lang. Doc. Conserv. 14, 423–461 (2020).
Stave, M., Paschen, L., Pellegrino, F. & Seifart, F. Optimization of morpheme length: a cross-linguistic assessment of Zipf’s and Menzerath’s laws. Linguist. Vanguard 7, 20190076 (2021).
Kruschke, J. K. Bayesian analysis reporting guidelines. Nat. Hum. Behav. 5, 1282–1291 (2021).
Guzmán Naranjo, M. & Becker, L. Statistical bias control in typology. Linguist. Typol. 26, 605–670 (2021).
Guzmán Naranjo, M. & Mertner, M. Estimating areal effects in typology: a case study of African phoneme inventories. Linguist. Typol. 27, 455–480 (2022).
Chun, Y. & Griffith, D. A. Spatial Statistics and Geostatistics: Theory and Applications for Geographic Information Science and Technology (Sage, 2013).
Donegan, C. geostan: an R package for Bayesian spatial analysis. J. Open Source Softw. 7, 4716 (2022).
Tiefelsdorf, M. & Boots, B. The exact distribution of Moran’s I. Environ. Plan. A 27, 985–999 (1995).
Griffith, D. A. A linear regression solution to the spatial autocorrelation problem. J. Geogr. Syst. 2, 141–156 (2000).
Griffith, D. A. & Chun, Y. Some useful details about the Moran coefficient, the Geary ratio, and the join count indices of spatial autocorrelation. J. Spat. Econom. 3, 12 (2022).
Hammarström, H., Forkel, R., Haspelmath, M. & Bank, S. Glottolog v.5.0 (Max Planck Institute for Evolutionary Anthropology, 2024); https://doi.org/10.5281/zenodo.10804357
Guzmán Naranjo, M. & Jäger, G. Euclide, the crow, the wolf and the pedestrian: distance metrics for linguistic typology. Open Res. Eur. 3, 104 (2023).
Byrd, D. & Krivokapić, J. Cracking prosody in articulatory phonology. Annu. Rev. Linguist. 7, 31–53 (2021).
Norris, D. & McQueen, J. M. Shortlist B: a Bayesian model of continuous speech recognition. Psychol. Rev. 115, 357–395 (2008).
Byrd, D. & Saltzman, E. The elastic phrase: modeling the dynamics of boundary-adjacent lengthening. J. Phon. 31, 149–180 (2003).
Zec, D. in The Cambridge Handbook of Phonology (ed. Lacy, P.) 161–194 (Cambridge Univ. Press, 2007); https://doi.org/10.1017/CBO9780511486371.009
Blevins, J. in Forty Years On: Ken Hale and Australian Languages (eds Simpson, J. et al.) 481–492 (Pacific Linguistics, 2001); https://doi.org/10.15144/PL-512.481
Green, A. D. in The Syllable in Optimality Theory (eds Féry, C. & van de Vijver, R.) 238–253 (Cambridge Univ. Press, 2003); https://doi.org/10.1017/CBO9780511497926.010
Miceli, L. & Round, E. Where have all the sound changes gone? Examining the scarcity of evidence for regular sound change in Australian languages. Linguist. Vanguard 8, 509–518 (2022).
Marley, A. H. Sound change in Aboriginal Australia: word-initial engma deletion in Kunwok. Linguist. Vanguard 8, 645–659 (2022).
Blevins, J. in The Oxford Handbook of Historical Phonology (eds Honeybone, P. & Salmons, J.) 485–500 (Oxford Univ. Press, 2015); https://doi.org/10.1093/oxfordhb/9780199232819.013.006
Sun, Y. & Poeppel, D. Syllables and their beginnings have a special role in the mental lexicon. Proc. Natl Acad. Sci. USA 120, 2215710120 (2023).
Wedel, A., Kaplan, A. & Jackson, S. High functional load inhibits phonological contrast loss: a corpus study. Cognition 128, 179–186 (2013).
Wedel, A., Ussishkin, A. & King, A. Crosslinguistic evidence for a strong statistical universal: phonological neutralization targets word-ends over beginnings. Language 95, 428–446 (2019).
Yarkoni, T. The generalizability crisis. Behav. Brain Sci. 45, e1 (2020).
Winter, B. & Grice, M. Independence and generalizability in linguistics. Linguistics 59, 1251–1277 (2021).
Salesky, E. et al. A corpus for large-scale phonetic typology. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 4526–4546 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.acl-main.415
Lingua Libri (Wikimédia France, 2020–2023); https://lingualibre.org/wiki/LinguaLibre:Main_Page
Ardila, R. et al. Common voice: a massively-multilingual speech corpus. In Proc. 12th Language Resources and Evaluation Conference (eds Calzolari, N. et al.) 4218–4222 (European Language Resources Association, 2020); https://aclanthology.org/2020.lrec-1.520
Hawkins, S. Roles and representations of systematic fine phonetic detail in speech understanding. J. Phon. 31, 375–405 (2003).
Kisler, T., Schiel, F. & Sloetjes, H. Signal processing via web services: the use case WebMAUS. In Proc. Digital Humanities (ed. Meister, J. C.) 30–34 (Hamburg University Press, 2012).
Forkel, R. et al. Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Sci. Data 5, 180205 (2018).
Forkel, R. & List, J.-M. CLDFBench: give your cross-linguistic data a lift. In Proc. 12th Language Resources and Evaluation Conference (eds Calzolari, N. et al.) 6995–7002 (European Language Resources Association, 2020); https://aclanthology.org/2020.lrec-1.864
Turk, A., Nakai, S. & Sugahara, M. in Methods in Empirical Prosody Research (eds Sudhoff, S. et al.) 1–28 (De Gruyter, 2006); https://doi.org/10.1515/9783110914641.1
Zipf, G. K. The Psycho-biology of Language: An Introduction to Dynamic Philology (George Routledge & Sons, Houghton, Mifflin, 1935).
Zipf, G. K. Human Behavior and the Principle of Least Effort (Addison-Wesley, 1949).
Sigurd, B., Eeg-Olofsson, M. & Weijer, J. Word length, sentence length and frequency—Zipf revisited. Stud. Linguist. 58, 37–52 (2004).
Jurafsky, D., Bell, A., Gregory, M. & Raymond, W. D. in Frequency and the Emergence of Linguistic Structure (eds Bybee, J. & Hopper, P.) 229 (John Benjamins, 2001); https://doi.org/10.1075/tsl.45.13jur
Gahl, S., Yao, Y. & Johnson, K. Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech. J. Mem. Lang. 66, 789–806 (2012).
Piantadosi, S. T., Tily, H. & Gibson, E. Word lengths are optimized for efficient communication. Proc. Natl Acad. Sci. USA 108, 3526–3529 (2011).
Evans, N. & Levinson, S. C. The myth of language universals. Behav. Brain Sci. 32, 429–448 (2009).
Bickel, B. Statistical modeling of language universals. Linguist. Typol. 15, 401–413 (2011).
Baayen, H., Davidson, D. J. & Bates, D. M. Mixed-effects modeling with crossed random effects for subjects and items. J. Mem. Lang. 59, 390–412 (2008).
Yu, A. C. L. & Zellou, G. Individual differences in language processing. Annu. Rev. Linguist. 5, 131–150 (2019).
Barth, D. et al. in Doing Corpus-Based Typology with Spoken Language Data: State of the Art (eds Haig, G. et al.) 179–232 (Univ. Hawai’i Press, 2021); http://hdl.handle.net/10125/74661
Anderson, C. et al. A cross-linguistic database of phonetic transcription systems. Yearb. Poznan Linguist. Meet. 4, 21–53 (2018).
List, J.-M., Anderson, C., Tresoldi, T., Rzymski, C. & Forkel, R. CLTS: Cross-Linguistic Transcription Systems. Zenodo https://doi.org/10.5281/zenodo.10997741 (2024).
Vasishth, S., Nicenboim, B., Beckman, M. E., Li, F. & Kong, E. J. Bayesian data analysis in the phonetic sciences. J. Phon. 71, 147–161 (2018).
Vasishth, S. & Gelman, A. How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics 59, 1311–1342 (2021).
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M. & Gelman, A. Visualization in Bayesian workflow. J. R. Stat. Soc. A 182, 389–402 (2019).
Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 27, 1413–1432 (2016).
Bürkner, P.-C. brms: an R package for Bayesian multilevel models using Stan. J. Stat. Softw. 80, 1–28 (2017).
Bürkner, P.-C. Advanced Bayesian multilevel modeling with the R package brms. R J. 10, 395–411 (2018).
R Core Team. R: A Language and Environment for Statistical Computing https://www.R-project.org/ (R Foundation for Statistical Computing, 2018).
Seifart, F., Paschen, L., Stave, M., Forkel, R. & Blum, F. CLDF dataset derived from the DoReCo core corpus v1.2.1. Zenodo https://doi.org/10.5281/zenodo.10990565 (2024).
Blum, F., Paschen, L., Forkel, R., Fuchs, S. & Seifart, F. Code accompanying the submission for ‘Consonant lengthening marks the beginning of words across a diverse sample of languages’. Zenodo https://doi.org/10.5281/zenodo.11198843 (2024).
Rose, F. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.cbc3b4xr
Ozerov, P. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.0dbazp8m
Cowell, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.36f5r1b6
Griscom, R. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.f77c7m72
Cobbinah, A. Y. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.a332abw8
Vanhove, M. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.edd011t1
Seifart, F. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.6eaf5laq
Quesada, J. D., Skopeteas, S., Pasamonik, C., Brokmann, C. & Fischer, F. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.ebc4ra22
Reiter, S. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.a8f9q2f1
Krifka, M. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.efeav5l9
Ponsonnet, M. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.fae299ug
Däbritz, C. L., Kudryakova, N., Stapert, E. & Arkhipov, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.f09eikq3
Schiborr, N. N. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.9c271u5g
Kazakevich, O. & Klyachko, E. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.5e0d27cu
Franjieh, M. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.02084446
Avanzi, M., Béguelin, M.-J., Corminboeuf, G., Diémoz, F. & Johnsen, L. A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.3520l685
Hellwig, B. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.b93664ml
Harvey, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.a4b4ijj2
Hartmann, I. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.b57f5065
Burenhult, N. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.6a71xp0p
Kim, S.-U. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.06ebrk38
Vydrina, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.d5aeu9t6
Gusev, V., Klooster, T., Wagner-Nagy, B. & Arkhipov, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.cdd8177b
Döhler, C. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.c5e6dudv
O’Shannessy, C. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.7452803q
Bartels, H. & Szczepański, M. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.6c6e4e9k
Haude, K. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.da42xf67
Thieberger, N. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.ba4f760l
Aznar, J. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.2801565f
Garcia-Laguia, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.efea0b36
Haig, G., Vollmer, M. & Thiele, H. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.ca10ez5t
Güldemann, T., Ernszt, M., Siegmund, S. & Witzlack-Makarevich, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.f6c37fi0
Ring, H. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.5ba1062k
Seifart, F. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.ffb96lo8
Witzlack-Makarevich, A., Namyalo, S., Kiriggwajjo, A. & Molochieva, Z. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.fde4pp1u
Xu, X. & Bai, B. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.3db4u59d
Forker, D. & Schiborr, N. N. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.81934177
Wegener, C. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.b74d1b33
Gippert, J. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.9ba054c3
Teo, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.5ad4t01p
Hellwig, B., Schneider-Blum, G. & Ismail, K. B. K. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.eea8144j
Bogomolova, N., Ganenkov, D. & Schiborr, N. N. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.ad7f97xr
Mosel, U. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.9322sdf2
Wichmann, S. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.c50ck58f
Skopeteas, S., Moisidi, V., Tsetereli, N., Lorenz, J. & Schröter, S. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.ac166n10
Schnell, S. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.3e2cu8c4
O’Shannessy, C. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.042dv614
Riesberg, S. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.9d91nkq2
Michaud, A. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.abe65p95
Skopeteas, S. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.9cbb3619
Gipper, S. & Ballivián Torrico, J. in Language Documentation Reference Corpus (DoReCo) v.1.2 (eds Seifart, F. et al.) (Leibniz-Zentrum Allgemeine Sprachwissenschaft & Laboratoire Dynamique du Langage, UMR5596, CNRS & Université Lyon 2, 2022); https://doi.org/10.34847/nkl.7ca412wg
Acknowledgements
We thank L. Dees, J. Krivokapić, J. Mansfield, A. Wedel and S. Wichmann for their helpful comments. We thank C. Rzymski for an extensive review of our code and for providing support with the HPC cluster. We thank M. Mertner for suggestions on analysing possible spatial dependencies. All remaining errors are our responsibility. This study was partially supported by the Max Planck Society Research Grant ‘Beyond CALC: Computer-Assisted Approaches to Human Prehistory, Linguistic Typology, and Human Cognition (CALC3)’ (F.B.), awarded to J.-M. List (2022–2024), and DFG grants SE 1949/3-1 and SE 1949/5-1 awarded to F.S. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Funding
Open access funding provided by Max Planck Society.
Author information
Authors and Affiliations
Contributions
F.B. conceptualized the study under the supervision of F.S. and S.F. F.B. designed and analysed the statistical model. R.F. provided the conversion of the raw data to CLDF as well as the preprocessing of the data. F.B., S.F. and F.S. wrote the initial draft of the Introduction. F.B. wrote the initial draft of the Results. L.P. wrote the initial draft of the Discussion. F.B. and L.P. wrote the initial draft of the Methods. R.F. wrote the usage guide (Supplementary Information section A). F.B. and L.P. wrote Supplementary Information section B. All authors have read, commented on and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Human Behaviour thanks Gerrit Kentner, Laurence White and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Supplementary information
Supplementary Information
Supplementary Information sections A and B, including figures.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Blum, F., Paschen, L., Forkel, R. et al. Consonant lengthening marks the beginning of words across a diverse sample of languages. Nat Hum Behav (2024). https://doi.org/10.1038/s41562-024-01988-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41562-024-01988-4