Emergence of linguistic laws in human voice

Linguistic laws constitute one of the quantitative cornerstones of modern cognitive sciences and have been routinely investigated in written corpora, or in the equivalent transcription of oral corpora. This means that inferences of statistical patterns of language in acoustics are biased by the arbitrary, language-dependent segmentation of the signal, and virtually precludes the possibility of making comparative studies between human voice and other animal communication systems. Here we bridge this gap by proposing a method that allows to measure such patterns in acoustic signals of arbitrary origin, without needs to have access to the language corpus underneath. The method has been applied to sixteen different human languages, recovering successfully some well-known laws of human communication at timescales even below the phoneme and finding yet another link between complexity and criticality in a biological system. These methods further pave the way for new comparative studies in animal communication or the analysis of signals of unknown code.


Data
For this work we have used two databases: the main one is a TV broadcast speech database named KALAKA-2 64 , and additionally a complementary database named LRE has been used to confirm results on a larger set of languages. Originally designed for language recognition evaluation purposes, the former consists of wide-band TV broadcast speech recordings (4 hours per language sampled using 2 bytes at a rate of 16 kHz) ranging six different languages: Basque, Catalan, Galician, Spanish, Portuguese and English and encompassing both planned and spontaneous speech throughout diverse environmental conditions, such as studio or outside journalist reports but excluding telephonic channels. The second one complements KALAKA with an additional set of 12 languages, taken from the NIST Language Recognition Evaluation (LRE′96) corpus and including Japanese, Vietnamese, Mandarin, Korean, Arabic, Hindi or Tamil to cite some (see SI for details).

The Method
The objects under study, speech waveforms or otherwise any generic acoustic signal, are fully described by an amplitude time series A(t) (see Fig. 1 for an illustration of the method). In order to unambiguously extract a sequence of elements -the equivalent to words and phrases-from such signal without the need to perform ad hoc segmentation, we start by considering the semi-definite positive magnitude ε(t) = |A(t)| 2 which, dropping irrelevant constants, has physical units of energy per time (SI for additional details on speech waveform statistics). By defining an energy threshold Θ in this signal 47 we will unambiguously separate voice events (above threshold) from silence events (below threshold). More concretely, Θ is defined as a relative percentage of the signal and its actual value in energy units depends on the signal variability range: for example Θ = 80% means that 20% of the data falls under this energy level. It has been shown that Θ decimates the signal similarly to a real-space Renormalization Group (RG) transformation 47,66 , in such a way that increasing Θ induces a flow in RG space. Systems operating close to a critical state lie in an unstable fixed point of this RG flow and its associated signal statistics are therefore shown to be threshold-invariant. Now, Θ not only works as an energy threshold that filters out background or environmental noise (noise filtering being a key aspect that species have learned to perform 27,67 ) but, as previously stated, enable us to unambiguously define what we call a token or voice event, that is, a sequence of consecutive measurements ε(t) > Θ , from a silence event of duration τ. Each token is in turn characterized by a duple (E v , T v ) where T v is the duration of the event and E v corresponds to the total energy released during that event obtained summing up the instantaneous energy over the duration of the event. Accordingly, the signal is effectively transformed to an ordered sequence of tokens {(E v (i), T v (i))}, each of these being separated by silence events of highly heterogeneous durations τ which, incidentally, are known to be power law distributed 47 . Finally, by linearly binning the scale of integrated energies we can assign an energy label (the bin) to each token, hence mapping the initial acoustic signal into a symbolic sequence of fundamental units which we call types. Note that two tokens whose integrated energy fall within the range of the same energy bin are mapped to the same type even if their duration can be different, so in principle several tokens could map into the same type (see SI for a table of Type/Token ratios). The linear binning of the energy scale has a bin size b = 0.01, in such a way that the first bin agglutinates tokens with energies between E0 and E0 + b, the second bin between E0 + b and E0 + 2b, and so on (note that results reported later will be robust against changes in b). The set of bins can be understood as an abstraction of a universal language vocabulary, and accordingly some bins might be empty and in general each bin will occur with uneven frequencies. As such, types can be understood as acoustically-based universal abstractions of a fundamental unit, an abstract version of words or phonemes that appear intertwined in a signal with characteristic patterns.
To synthesize, with this methodology we are able to map an arbitrary acoustic signal into a sequence of types separated by silence events (Fig. 1). Standard linguistic laws can then be directly explored in acoustic signals without needs to have an a priori knowledge neither of the signal code nor of the adequate segmentation process Waveform series A(t) are sampled at 16 KhZ from the system. In (a) we consider an excerpt from A(t) that coincides with an individual articulating the words "the house". In the middle panel we plot the instantaneous energy per unit time ε(t) = |A 2 |(t) from an excerpt of the top panel. The energy threshold Θ , defined as the instantaneous energy level for which a fixed percentage of the entire data remains above-threshold, helps us to unambiguously define a token or voice event (a subsequence of time stamps for which ε(t) > Θ ) from silence events of duration τ 47 . The energy released E i in voice event i (where i ∈ N) is computed from the integration of the instantaneous energy over the duration of that event T i (dark area in the figure denotes the energy released in a given voice event). Subsequently, by performing a linear binning tokens are classified into bins that we call types (in the plot, E A , E B ,… are different bins). The vocabulary V agglutinates those types that appear at least once.
or the particular syntax of the language underlying the signal. This protocol is thus independent of the communication system and can be used to make unbiased comparisons across different systems and signals. Needless to say, results could in principle depend on the particular value of Θ , as this scans the signals at different energy thresholds. However human voice has been recently shown 47 to be invariant under changes in Θ -an evidence of self-organized criticality (SOC) 68 in this system-and, accordingly, parameter-free laws can be extracted using a proper collapse theory as it will be shown in the results section. Also, in order to guarantee that the emergence of linguistic laws is only due to the structure and correlations of the signal and not due to the process of symbolization we will compare the results obtained from speech signals to properly defined null models which randomize the signal ε(t). These null models thus maintain the marginal instantaneous energy distribution and remove any other correlation structure, yielding non-Gaussian white noise with a fat-tailed marginal distribution.
It should be highlighted that, according to the abovementioned protocol, the types extracted do not share in principle a direct relation with real words or with any other linguistic units, but only a formal one as these types constitute a base of fundamental acoustic units of any oral communication system. As a matter of fact, we consider this possibility to be unlikely, as our basic units of study, while ranging several scales, are in most of the cases smaller than the typical timescales of words. Further work should explore to what extent there is a connection between tokens, or sequences of them, and phonemes, words or other linguistic units. Notwithstanding, this a priori lack of matching is indeed required and desired if one aims at developing a method which is universally applicable in communication signals of arbitrary origin, not necessarily oral communication based on words. Such is the case, for instance, in the realm of comparative analysis across different species. In any case, note that our method relies on retrieving voice events via energy thresholding. As many of the detected events typically belong to the intraphoneme range, one could argue that the method focus on the rich composition of voice and silence events within a given linguistic unit (phonemes or words).
Despite not being the purpose of this work, this method may be seen as an unsupervised Voice Activity Detection (VAD) algorithm. It is well known that VAD algorithms based on temporal domain analysis or on energy features are highly affected by noise but note that VAD systems are typically optimized through reference transcriptions, manually annotated at the word level 69,70 . The purpose of our approach is however not to perform word or phoneme level segmentation, but rather to explore organisational patterns of language-independent acoustic units taking place at different scales in oral communication, and quantify them via classical linguistic analysis. That being said, assessing how informative are those cues in (unsupervised) word segmentation or speech recognition tasks [71][72][73] is an interesting question that will deserve further investigation.

Results
Gutenberg-Richter law. The energy E released during voice events is a direct measure of the vocal fold response function under air pressure perturbations, and its distribution P Θ (E) could in principle depend both on the threshold Θ and on the language under study. In the inset of Fig. 2 we observe that P Θ (E) is power law distributed over about six decases, saturated by an exponential cut-off. This distribution has been interpreted before as the analogue of a Gutenberg-Richter law in voice, as the precise shape of energy release fluctuations during voice production parallel those occurring in earthquakes 47 . As increasing Θ induces a flow in RG space, systems which lie close to a critical point (unstable fixed point in RG space) show scale invariance under Θ and hence the distributions can be collapsed into a Θ -independent shape, thereby eliminating the trivial dependence on Θ . This has been shown to be the case for human voice and accordingly (technical details can be found in the SI) we can express the collapsed energy distribution as where E l is the lower limit beyond which this law is fulfilled,  is a scaling function and the relevant variable is φ, the scaling exponent. In order to In the outset of Fig. 2 we show the result of this analysis for the case of Spanish language, where we find that φ ≈ 1.15, this exponent being approximately language-independent (see Table 1 for other languages and SI for additional details). Interestingly, these exponents are compatible with those found in rainfall, another natural system that has been shown to be compatible with SOC dynamics 74 , and cannot be explained by simple null models 47 . While the precise values of the exponents are not extremely relevant, one should take these with caution as thresholding is known to introduce in some cases finite size biases in the form of a second scaling region for short sizes which is only removed when the series is long enough 75 . In what follows we explore the emergence of classical linguistic laws in these acoustic signals.
Zipf's Law. The illustrious George Kingsley Zipf formulated a statistical observation which is popularly known as Zipf 's law 76 . In its original formulation 3 , it establishes that in a sizable sample of language the number of different words (vocabulary) N(n) which occur exactly n times decays as N(n)~n −ζ , where the exponent ζ varies from text to text 77 but is usually close to 2. An alternative and perhaps more common formulation of this law 4 is defined in terms of the rank, such that if words are ranked in decreasing order by their frequency of appearance then the number of occurrences of words with a given rank r goes like n(r)~r −z , where it is easy to see that both exponents are related via = ζ − z 1 1 and thus z approximates to 1 16, 78 . Here for convenience we make use of the former and explore N Θ (n) applied to the statistics of types. Interestingly, while Zipf 's (and Heaps's) laws were originally proposed to quantify order in the word arrangement on written texts, these laws indeed describe the arrangement of informational units that go beyond words in texts (see for instance 7,11 ) and there exist several theoretical justifications for the presence of these laws which do not require resorting to linguistics but to more fundamental concepts such as principles of compression in information theory 12,26 which don't necessarily require the units to adhere to any particular type of communication. As a matter of fact, a mathematical relation between the energy release statistics and Zipf 's law can be found. First, note that the frequency of type r is proportional to where the asymptotic approximation is easily found expanding − φ − r ( 1) 1 up to second order. Now, since P(E) is monotonically decreasing, then there is a formal equivalence between type r and rank r: the most frequent type corresponds to type '1' and thus has rank 1, the second most frequent type is '2' , and so on: type with label r will have rank r, as frequencies of types are naturally ordered in a monotonically decreasing way by construction. Therefore, the frequency of rank r is proportional to eq. 1. Accordingly ≡ f r n r ( ) ( ) so we predict z ≈ φ and thus ζ φ ≈ + 1 1/ . We now consider experimental results. Again N Θ (N) could in principle depend on the threshold but assuming that the signal complies with the scale-invariance mentioned above, one can collapse all threshold-dependent curves into a universal shape and thus remove any dependence on this parameter by rescaling n → n/LV, N Θ (n) → N Θ (n)VL where V is the total number of different types present in the signal and L as the total number of tokens (see SI for technical details). Results are shown in the case of Basque language in Fig. 3, where a clear threshold-independent decaying power law emerges with a scaling exponent ζ ≈ 1.77. Analogous results with compatible exponents for other languages can be found in SI and Table 1. The relation between exponents ζ and φ is on good agreement with our theoretical prediction. Null models systematically deviate from these results, and neither display the characteristic power law decay nor any invariance under variation of the energy threshold (SI).  89 and goodness-of-fit test and confidence interval are based on 2500 Kolmogorov-Smirnov (KS) tests. The p-value for the evaluation is defined as the fraction of synthetic sample distributions with a KS distance to the best-fit power law that is larger than the KS distance between the empirical distribution and its best-fit power law model. In all cases, the bootstrap p-value of the Kolmogorov-Smirnov test is greater than 0.99, meaning that 99% out of 2500 times the synthetic sampled distribution is closer to the empirical data, hence implying that the power law hypothesis can not be rejected. Exponents associated to energy release are compatible with those found in rainfall 74 . Results are compatible with the hypothesis of language-independence.
Scientific RepoRts | 7:43862 | DOI: 10.1038/srep43862 Heaps' law. Together with Zipf 's law and connected mathematically (see ref. 2 and references therein), the second classical linguistic law is Heaps' law, the sublinear growth of the number of different words V in a text with text size L (measured in total number of words): V ~ L α , α < 1 14,79 (a constant rate for appearance of new words leads to α = 1). Here the vocabulary V is defined as the total number of different types that appear in the signal, whereas L is defined as the total number of tokens found for a given threshold. Results are shown for a specific language in Fig. 4 (see SI for the rest). In the outset panel we present the collapsed, threshold-independent curves, where again we find a scaling law with an effective exponent α′ related to the original exponent α′ = α/(1 + α). In this case equivalent computation on the null model yield a Heaps law with the trivial exponent α ≈ 1 (SI). These results are quantitatively consistent with previous results on written texts 16 . In particular, several authors 18 point out that, at least asymptotically, the relation ζ = 1 + α holds with good approximation, and this is on reasonably good agreement with our findings in human voice as well. Interestingly, a recent work 80 has found that, as opposed to Indoeuropean (alphabetically based) languages, Zipf 's law breaks down and Heaps' law reduces to the trivial case for written texts in Chinese, Japanese, Korean and other logosyllabic languages. Applying our methodology in the database of such logosyllabic languages (Japanese, Mandarin, Korean) our results are at odds with 80 as we still find evidence of Zipf and Heaps laws holding for these languages in terms of the energetic voice fluctuations. We conclude that those differences found when written texts are analyzed do not arise when we study directly human voice.   Table 1.
Scientific RepoRts | 7:43862 | DOI: 10.1038/srep43862 Brevity Law. The tendency of more frequent words to be shorter 3,4,21 can be generalized as the tendency of more frequent elements to be shorter or smaller, and its origin has been suggested to be related to optimization and information compression arguments in connection with other linguistic laws 81 . In acoustics, spontaneous speech indeed tends to obey this law after text segmentation 82 , and has been found also in other non-human primates 28,81 . Here we can test brevity law in essentially two different ways. First, note that voice events (tokens) map into types according to the logarithmic binning of their associated energy, hence voice events with different duration might yield the same type as previously noted. Thus for each type we can compute its mean duration t averaging over all voice events that fall within that type, and then plot the histogram M Θ (t) that describes frequency of each type versus its mean duration. Brevity law would require M Θ (t) to be a monotonically decreasing function. These results are shown for a particular language in log-log scales in Fig. 5, finding initially a power law decaying relation which is indicative of a brevity law (results are again found to be language independent, see SI for additional results). The inset provide the threshold-dependent distributions and the outset panel provides the collapsed, threshold-independent shape M(t) (see Table 1 for scaling exponents). Again in this case results in null models deviate from such behavior (SI) and are clearly different from the random typing 12 . Alternatively, one can also directly observe the duration frequency at the level of voice events, finding similar results (see SI).

Discussion
In this work we have explored the equivalent of linguistic laws directly in acoustic signals. We have found that human voice manifests the analog of classical linguistic laws found in written texts (Zipf 's law, Heaps' law and the brevity law or law of abbreviation). These laws are found to be invariant under variation of the energy threshold Θ , and can be collapsed under universal functions accordingly. As Θ is the only free parameter of the method, this invariance determines that the results are not afflicted by ambiguities associated to arbitrarily defining unit boundaries. Results are robust across a list of 16 different languages (indoeuropean and non-indoeuropean, including some logosyllabic), across timescales (extending all the way into the intraphoneme range, where no cognitive effects operate, and invariant under energy threshold variation) and across conversational modes (one or more speakers, both planned and spontaneous speech throughout diverse environment conditions, such as studio, outside journalist reports, and telephonic channel). Interestingly, an equivalent analysis performed on null models defined by randomizing the signal ε(t) (yielding white noise with the same instantaneous energy distribution of the original signal) fail to reproduce this phenomenology (SI). The concrete range of exponents found for both Zipf and Heaps laws are compatible between each other and somewhat similar -but not identical-to the typical ones observed in the literature for written texts 2,7,76,80 , whereas to the best of our knowledge this is the first observation of scaling behavior with a clear exponent in the case of brevity law in speech. Actually, our finding of a power law in brevity law differs from the case of random typing where a power law doesn't conform 12 .
The specific and complex alternation of air stops (silences) intertwined with voice production are at the core of the microscopic voice fluctuations. During voice production, acoustic communication is governed by the so-called biphasic cycle (breath and glottal cycle, see ref. 83 for a review) that together with some other acoustic considerations (pitch period, voice onset time, the relation between duration, stress and syllabic structure 82 ) determines the microscopic structure of human voice, including silence stops. However, these timescales are in general very large: as previously stated, this current study focus and scans voice properties even at intraphonemic timescales, where the statistical laws of language emerge directly from the physical magnitudes that govern acoustic communication.
Our results therefore open the possibility of speculating whether the fact that these laws have been found in upper levels of human communication might be a result of a scaling process and a byproduct In the inner panel we plot, for different thresholds, the histogram M Θ (t) that describes the relative frequency of a type of mean duration t. In every case we find a monotonically decreasing curve which yields a brevity law. In the outset panel we present the collapsed, threshold-independent curve M(t), that evidences an initial power law decay with an exponent β ≈ 2.9.
of the physics rather than derived from the choice of the typical units of study on the analysis of written corpus (phonemes, syllabus, words, … ), like differences between analysis of Indoeuropean and logosyllabic languages demonstrates 80 . As a matter of fact, in a previous work human voice has been framed within self-organized criticality (SOC), speculating that the fractal structure of lungs drives human voice close to a critical state 47 , this mechanism being ultimately responsible for the microscopic self-similar fluctuations of the signal. This constitutes a new example of the emergence of SOC in a physiological system, different in principle from the classical one found in neuronal activity 84 . In the same line, note that the set of critical exponents characterizing the SOC nature of human voice have been shown to be very similar to the ones found in another SOC system: rainfall 74 . The celebrated theory of critical phenomena thoroughly explains that many physical processes, when poised close to critical point, essentially reduce to the same phenomenon despite of having different physical origins, and can be described by a set of critical exponents that classifies the core phenomenon in universality classes. As rainfall and voice production share the same critical exponent for energy release, this suggests that in essence a similar mechanism could be triggering both types of threshold-like dynamics.
In any case, one could thus speculate that the emergence of linguistic scaling laws is just a consequence of the presence of SOC, what in turn would support the physical origin of linguistic laws. From an evolutionary viewpoint, under this latter perspective human voice, understood as a communication system which has been optimized under evolutionary pressures, would constitute an example where complexity (described in terms of robust linguistic laws) emerges when a system is driven close to criticality, something reminiscent of the celebrated edge of chaos hypothesis 85 . Furthermore, under this interpretation the onset of complexity in language (as reported by linguistic laws in written texts) could be a direct byproduct of similar patterns already emanating at purely physiological grounds (the physics of energy fluctuation in voice production): in this respect the role played by cognitive processes would be unclear, as the timescales involved in the physiological mechanisms of voice production (intraphonemic) are typically shorter than those associated to cognitive ones. More succinctly, the classical explanation for the origin of scaling laws in language in terms of scaling phenomena in cognitive processes could be challenged by this new evidence. Certainly, we are not claiming that statistical laws of language are only a consequence of human physiology, among other reasons because the clearest evidence leads us to consider the presence of these laws in texts 2,7 , but the plausible causal link between scaling laws in acoustics and the analogous ones at the language level is suggestive and should be studied in depth.
On more practical grounds, the method used and proposed here also addresses the longstanding problem of signal segmentation. It has been acknowledged that there is no such thing as a 'correct' segmentation strategy 86 . In written corpus white space is usually taken as an easy marker of the separation between words, however this is far from evident in continuous speech where separation between words or concepts is technologically harder to detect, conceptually vague and probably ill-defined. Few exceptions that used oral corpus for animal communication still require to define ad hoc segmentation algorithms 28 , or manual segmentation strategies which usually give an arbitrary or overestimated segmenting times or windows 81 , what might even raise epistemological questions. As such, this segmentation problem unfortunately has prevented wider, comparative studies in areas such as animal communication or the search for signs of possible extraterrestrial intelligence from radio signals (in this line only few proposals have been made 61 ). By varying the energy threshold the method presented here automatically partitions and symbolizes the signal at various energy scales, providing a recipe to establish an automatic, general and systematic way of segmenting and thus enabling comparison of across acoustic signals of arbitrary origin for which we may lack the syntax, code or exact size of its constituents.
To round off, we hope that this work paves the way for new research avenues in comparative studies. Open questions that deserve further work abound; just to name a few: in the light of this new method, what can we say about the acoustic structure in other animal communication systems? Can we find evidence of universal traits in communication that do not depend on a particular species but are only physically and physiologically constrained, or on the other hand are linguistic universals a myth 87 ? How these laws evolve with aging 6,88 ? Are they affected by cognitive or pulmonary diseases? What is the precise relation between SOC and linguistic laws in this context? And in particular, can we find mathematical evidence of a minimal, analytically tractable SOC model that produce these patterns? These and other questions are interesting avenues for future work.