Scaling Laws in Human Language

Zipf's law on word frequency is observed in English, French, Spanish, Italian, and so on, yet it does not hold for Chinese, Japanese or Korean characters. A model for writing process is proposed to explain the above difference, which takes into account the effects of finite vocabulary size. Experiments, simulations and analytical solution agree well with each other. The results show that the frequency distribution follows a power law with exponent being equal to 1, at which the corresponding Zipf's exponent diverges. Actually, the distribution obeys exponential form in the Zipf's plot. Deviating from the Heaps' law, the number of distinct words grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. This work refines previous understanding about Zipf's law and Heaps' law in language systems.

The Zipf's law in language systems could result from a rich-get-richer mechanism as suggested by the Yule-Simon model [23,24], where a new word is added to a text with probability q and an appeared word is randomly chosen and copied with probability 1 − q. A word appears more frequently thus has high probability to be copied, leading to a power-law word frequency distribution p(k) ∼ k −β with β = 1 + 1/(1 − q). Dorogovtsev and Mendes modeled the language processing as evolution of a word web with preferential attachment [25]. * Electronic address: zhutou@ustc.edu Zanette and Montemurro [26] as well as Cattuto et al. [27] accounted for the memory effects, say the recently used words have higher probability to be chosen than the words occurred long time ago. These works can be considered as variants of the Yule-Simon model. Meanwhile, the Heaps' law may originate from the memory and bursty nature of human language [28][29][30].
Real language systems to some extent deviate from these two scaling laws and display more complicated statistical regularities. Wang et al. [31] analyzed representative publications in Chinese, and showed that the character frequency distribution exhibits an exponential feature. Lü et al. [32] pointed out that in a growing system, if the appearing frequencies of elements obey the Zipf's law with stable exponent, then the number of distinct elements grows in a complicated way with the Heaps' law only an asymptotical approximation. This deviation from the Heaps' law was further emphasized and mathematically proved by Eliazar [33]. Empirical analyses on real language systems showed similar deviation [34]. Via extensive analysis on individual Chinese, Japanese and Korean books, as well as a collection of more than 5 × 10 4 Chinese books, we found even more complicated phenomena: (i) the character frequency distribution follows a power law yet it decays exponentially in the Zipf's plot; (ii) with the increasing of text length, the number of The History of the Three Kingdoms, respectively. The power-law exponent β is obtained by using the maximum likelihood estimation [35,36], while the exponent in the Zipf's plot is obtained by the least square method excluding the head (i.e., r > 500 for Chinese books and r > 200 for Japanese and Korean books). distinct characters grows in three different stages: linear, logarithmical and saturated. All these unreported regularities may result from the finite vocabulary size, which is further verified by a simple theoretical model.
We first show some experimental results about the statistical regularities on Chinese, Japanese and Korean literatures, which are typical examples generated from a vocabulary of very limited size if we look at the character level. There are in total more than 9 × 10 4 Chinese characters, yet only 3000 to 4000 of which are used frequently (Taiwan and Hong Kong respectively identify 4808 and 4759 frequently used characters, while mainland China has two versions of the list of frequently used characters, one contains 2500 characters and the other contains 3500 characters), and the number of Japanese and Korean characters are even smaller. We start with four famous books, the first two are in Chinese, the third one is  the total number of characters appeared in the text. As shown in figure 1, the character frequency distributions are power-law, meanwhile the frequency decays exponentially in the Zipf's plot, which is in conflict to the common sense that a power-law probability density function always corresponds to a power-law decay in the Zipf's plot. Actually, there exists a relation between two exponents α and β as α = 1 β−1 [32], and thus when β gets close to 1, the exponent α will diverge and thus the decaying function in Zipf's plot could not be well characterized by a power law. Therefore, if we observe a non-powerlaw decaying in the Zipf's plot, we cannot immediately deduce that the corresponding probability density function is not a power law -it is possibly a power law with exponent close to 1. Note that, in the Zipf's plots, the turned-up head contains a few hundreds of characters, majority of which play the similar role to the auxiliary words, conjunctions or prepositions in English. Figure 1 also indicates that the growth of distinct characters cannot be described by the Heaps' law. Indeed, there are two distinguishable stages: In the early stage, N t grows approximately linearly with the text length t, and in the later stage, N t grows logarithmically with t. Figure 3 presents the growth of distinct characters for a large collection of 57755 Chinese books consisting of about 3.4 × 10 9 characters and 12800 distinct characters. In addition to those observed in figure 1 and figure 2, N t displays a strongly saturated behavior when the text length t is much bigger than the total distinct characters in the vocabulary. In summary, the experiments on Chinese, Japanese and Korean literature show us some unreported phenomena: the character frequency obeys a power law with exponent close to 1 yet it decays exponentially in the Zipf's plot, and the number of distinct characters grows in three distinguishable stages. We next propose a theoretical model to explain these observations.
Consider a vocabulary with finite number, V , of distinct characters or words. At each time step, one character will be selected from the vocabulary to form the text. Motivated by the rich-get-richer mechanism of the Yule-Simon model, at time step t, if the character i has been used k i times, it will be selected with the probability where ε is the initial attractiveness of each character. Assuming that at time t, there are N t distinct characters in the text, and we first investigate the dependence of N t on the text length t. The selection at time t + 1 can be equivalently divided into two complementary yet repulsive actions: (i) to select a character from the original vocabulary with probability proportional to ε, or (ii) to select a character from the N t words in the created text with probability proportional to its appeared frequency. Therefore the probability to choose a character from the original vocabulary is V ε V ε+t , whereas t V ε+t from the created text. A character chosen from the created text is always old, while a character chosen from the vocabulary could be new with probability 1 − Nt V . Accordingly, the probability that a new character appears at the t+1 time step, namely the growing rate of N t , is With the boundary conditions N 0 = 0 and N ∞ = V , we derive the solution of Eq. 2 as ( This solution embodies three stages of growth of N t . (i) In the very early stage, when t is much smaller than V ε, ( V ε V ε+t ) ε ≈ 1 − t V and thus N t ≈ t, corresponding to a short period of linear growth. (ii) When t is of the same order of V ε, if ε is very small, N t could be much smaller than V . Then Eq. 2 can be approximated as leading to a logarithmical solution Indeed, expanding ( V ε V ε+t ) ε by Taylor series as (6) and neglecting the high-order terms (m ≥ 2) under the condition ε ≪ 1, one can also arrive to the solution Eq. 5. (iii) When t gets larger and larger, N t will approach to V and thus both V ε V ε+t and 1 − Nt V are very small, leading to a very slow growing of N t according to Eq. 2. These three stages predicted by the analytical solution are in good accordance with the above empirical observations. Figure 3 reports the numerical results on Eq. 3. In accordance with the analysis, when t is small, N t grows in a linear form as shown in Fig. 3(a) and 3(c), and from Fig. 3(b) and 3(d), straight lines appear in the middle region, indicating a logarithmical growth predicted by Eq. 5.
Denote by n(t, k) the number of distinct characters that appeared k times until time t, then n(t, k) = N t p(k). According to the master equations, we have n(t + 1, k + 1) = n(t, k + 1) [1 − f (k + 1)] + n(t, k)f (k).
(7) Substituting Eq. 1 into Eq. 7, we obtain (8) Via continuous approximation, it turns to be the following differential equation Substituting N t+1 − N t = dN t /dt and Eq. 2, we get the solution where B is the normalized factor. The result shows that the character frequency follows a power-law distribution with exponent changing in time. Considering the finite vocabulary size, in the large limit of t, N t → V and thus the power-law exponent, β = 1 + ε V Nt − 1 , approaches 1. Under the continuous approximation, the cumulative distribution of character frequency can be written as where k min is the smallest frequency. When β → 1, k 1−β ≈ 1 + (1 − β)lnk, and thus where B ≈ ln kmax+ε kmin+ε −1 according to the normalization condition kmax kmin p(k)dk = 1 and k max is the highest frequency. According to Eq. 12, there are 1 − Bln k+ε kmin+ε N t characters having appeared more than k times. That is to say, a character having appeared k times will be ranked at r = 1 + 1 − Bln k+ε kmin+ε N t . Therefore and Z(1) = k max , Z(N t ) = k min . In a word, this simple model accounting for the finite vocabulary size results in a power-law character frequency distribution p(k) ∼ k −β with exponent β close to 1 and an exponential decay of Z(r) in the Zipf's plot, which perfectly agree with the empirical observations on Chinese, Japanese and Korean books. Figure 4 reports the simulation results for typical parameters. The power-law frequency distribution, the exponential decay of frequency in the Zipf's plot and the linear to logarithmic transition in the growth of the distinct number of characters are all clearly observed in the simulation. The simulation results agree very well with the analytical solutions presented in Eq. 3, Eq. 10 and Eq. 13.
Previous statistical analyses about human language overwhelmingly concentrate on Indo-European family, where each language consists of a huge number of words. In contrast, languages consisting of characters, though cover more than a billion people, obtained less attention. These languages include Chinese, Japanese, Korean, Vietnamese, Jurchen language, Khitan language, Makhi language, Tangut language, and many others. Empirical studies here show remarkably different scaling laws of character-formed from word-formed languages. Salient features include an exponential decay of character frequency in the Zipf's plot associated with a power-law frequency distribution with exponent close to 1, and a multi-stage growth of the number of distinct characters. These findings not only complement our understanding of scaling laws in human language, but also refine the knowledge about relationship between the power law and the Zipf's law, as well as the applicability of the Heaps' law. As a result, we should be careful when applying the Zipf's plot for a power-law distribution with exponent around 1, such as the cluster size distribution in two-dimensional self-organized critical systems [37], the inter-event time distribution in human activities [38], the family name distribution in Korea [39], species lifetime distribution [40], and so on. Meanwhile, we cannot deny a possibly power-law distribution just from a non-powerlaw decay in the Zipf's plot [31].
The currently reported scaling laws can be reproduced by considering finite vocabulary size in a rich-get-richer process. Different from the well-known finite-size effects that vanish in the thermodynamic limit, the effects caused by finite vocabulary size get stronger as the increasing of the system size. Finite choices must be a general feature in selecting dynamics, but not a necessary ingredient in growing systems. For example, also based on the rich-get-richer mechanism, neither the linear growing model [41] nor the accelerated growing model [42] (treating total degree as the text length and nodes as distinct characters, the accelerated networks grow in the Heaps' manner [32]) has considered such ingredient. The present model could distinguish the selecting dynamics from general dynamics for growing systems.
This work is partially supported by the Swiss National Science Foundation (Project 200020-132253) and the Fundamental Research Funds for the Central Universities.