Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes

Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. A theoretical model for writing process is proposed, which embodies the rich-get-richer mechanism and the effects of limited dictionary size. Experiments, simulations and analytical solutions agree well with each other. This work refines the understanding about Zipf's and Heaps' laws in human language systems.

Via extensive analysis on Chinese, Japanese and Korean books, we found even more complicated phenomena: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf 's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf 's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. All these unreported phenomena result from the combination of the rich-get-richer mechanism and the limited dictionary sizes, which is verified by a theoretical model.

Results
Experiments. We first show some statistical regularities on Chinese, Japanese and Korean literatures, which are representative languages with very limited dictionary sizes if we look at the character level. There are only around 4000 characters being frequently used in Chinese texts (4808, 4759 and 3500 frequently used characters are identified in Taiwan, Hong Kong, and mainland China, respectively), and the number of Japanese and Korean characters are even smaller. Note that, a Korean character is indeed a single syllable consisting of 2-4 letters. We use the term character for convenience and consistence, while one should be aware of the fact that the Korean characters are totally different from Chinese characters: the former are phonographies while the latter are ideographies.
We start with four famous books, the first two are in Chinese, the third one is in Japanese and the last one is in Korean (see data description in Methods and Materials). Figure 1 reports the character frequency distribution p(k), the Zipf 's plot on character frequency Z(r) and the growth of the number of distinct characters N t versus the text length t. As shown in figure 1, the character frequency distributions are power-law, meanwhile the frequency decays exponentially in the Zipf 's plot, which is in conflict with the common sense that a power-law probability density function always corresponds to a power-law decay in the Zipf 's plot. Actually, for a power-law probability density distribution p(k) , k 2b , usually, a and (d1-d4) are for the books The Battle Wizard, Into the White Night and The History of the Three Kingdoms, respectively. The power-law exponent b is obtained by using the maximum likelihood estimation 36,37 , while the exponent in the Zipf's plot is obtained by the least square method excluding the head (the majority of characters in the head play the similar role to the auxiliary words, conjunctions or prepositions in English). We fit the data r . 500 for Chinese books and r . 200 for Japanese and Korean books. power-law decay can be observed in its corresponding Zipf 's plot, say Z(r) , r 2a . There exists a relation between two exponents a and b as a~1 b{1 33 , and when b gets close to 1, the exponent a diverges.
Under such case, we could not say the corresponding Zipfs distribution is power-law. In principle, the Zipfs distribution can be exponential or in other forms. Therefore, if we observe a nonpower-law decaying in the Zipf 's plot, we cannot immediately deduce that the corresponding probability density function is not a power law -it is possibly a power law with exponent close to 1. Figure 1 also indicates that the growth of distinct characters cannot be described by the Heaps' law. Indeed, there are two distinguishable stages: In the early stage, N t grows approximately linearly with the text length t, and in the later stage, N t grows logarithmically with t. Figure 2 presents the growth of distinct characters for a large collection of 57755 Chinese books consisting of about 3.4 3 10 9 characters and 12800 distinct characters. In addition to those observed in figure 1, N t displays a strongly saturated behavior when the text length t is much larger than the dictionary size. In summary, the experiments on Chinese, Japanese and Korean literatures show us some novel phenomena: the character frequency obeys a power law with exponent close to 1 while it decays exponentially in the Zipf 's plot, and the number of distinct characters grows in three distinguishable stages (figure 2 also shows the crossover between linear growth and logarithmic growth).
Model. Text generation was usually described as a rich-get-richer process like the aforementioned Yule-Simon model 24 . Before establishing the model, we first test whether the rich-get-richer mechanism works for writing process. We denote w(k) the average probability that a character appeared k times will appear again (see Methods and Materials how to measure w(k)). As shown in figure 3, w(k) , k c for all the four books with c < 1, indicating a linearly richget-richer effect like the preferential attachment in evolving scalefree networks 38 .
In the model, we consider a language with finite dictionary size, V, of distinct characters. At each time step, one character in the dictionary will be selected to form the text. Motivated by the rich-getricher mechanism, at time step t 1 1, if the character i has been used k i times, it will be selected with the probability proportional to k i (according to the approximately linear relation between w(k) and k), as where e is the initial attractiveness of each character (e . 0 ensures that every character has chance to be selected). This growing dynamics can be analytically solved as (see Methods and Materials) which embodies three stages of growth of N t : (i) In the very early stage, when t is much smaller than Ve, Ve Vezt e <1{ t V and thus N t < t, corresponding to a short period of linear growth. (ii) When t is of the same order of Ve, if e is very small, N t could be much smaller than V. Expanding Ve Vezt e by Taylor series as and neglecting the high-order terms (m $ 2) under the condition e = 1, one can obtain a logarithmical solution As indicated in figure 2, there is a crossover between the first two stages. (iii) When t gets larger and larger, N t will approach to V and thus both Ve Vezt and 1{ N t V are very small, leading to a very slow growing of N t according to Eq. 7 (see Methods and Materials). These three stages predicted by the analytical solution are in good accordance with the above empirical observations (see figure 2). Figure 4 reports the numerical results on Eq. 2. Agreeing with the analysis, when t is small, N t grows in a linear form as shown in Fig. 4(a) and 4(c), and in Fig. 4(b) and 4(d), the linear part in the middle region indicates a logarithmical growth as predicted by Eq. 4. According to the master equation (see Methods and Materials), the character frequency distribution can be analytically solved as where B is the normalization factor. The result shows that the character frequency follows a power-law distribution with exponent varying in time. Considering the finite dictionary size, in the large limit of t, N t R V and thus the power-law exponent, b~1ze , approaches one. The corresponding frequency-rank relation in the Zipf 's plot is (see Methods and Materials) where k min 5 Z(N t ) is the smallest frequency. In a word, this simple model accounting for the finite dictionary size results in a power-law character frequency distribution p(k) , k 2b with exponent b close to 1 and an exponential decay of Z(r), perfectly agreeing with the empirical observations on Chinese, Japanese and Korean books.     Figure 5 reports the simulation results. The power-law frequency distribution, the exponential decay of frequency in the Zipf 's plot and the linear to logarithmic transition in the growth of the distinct number of characters are all observed and in good accordance with the analytical solutions. Figure 6 directly compares the analytical predictions and the real data. They agree with each other quantitatively. In comparison, predictions from known models are qualitatively different from the present observations. For example, in the Yule-Simon model 24 , the predicted power-law exponent is larger than 2 and the number of distinct characters grows linearly with the text size. In the Yule-Simon model with memory 28 , the growth of distinct words follows a linear process and the word frequency distribution is not a powerlaw. Dorogovtsev and Mendes 26 proposed a word-web-based model, where the growth of distinct words follows the Heaps law with exponent 0.5 and the power-law exponent can be either 1.5 or 3, also far different from the current results.

Discussion
Previous statistical analyses about human languages overwhelmingly concentrated on Indo-European family, where each language consists of a huge number of words. In contrast, languages consisting of characters, though cover more than a billion people, received less attention. These languages include Chinese, Japanese, Korean, Vietnamese, Jurchen language, Khitan language, Makhi language, Tangut language, and many others. Significant differences between these two kinds of languages lie in many aspects. Taking English and Chinese as examples. Firstly, the number of words in English is more than 100 times larger than the number of characters in Chinese. Secondly, no dictionary contains all possible words in English. Basically, everyone could create some new words. New words may result from new techniques, new biological species, or new names. Old words connected by -is also counted as a new one. Instead, generally we cannot give birth to a new Chinese character. Therefore, for English text, absolute saturation cannot appear since it is very possible to find a piece of new words even after a large collection of English literatures. Thirdly, the number of words in English grows very quickly. The Encyclopedia Americana (Volume 10, Grolier, 1999) said ''The vocabulary has grown from 50000 to 60000 words in Old English to the tremendous number of entries 2 650000 to 750000 2 in an unabridged dictionary of today. In December 2010 a joint Harvard/Google study found the language to contain 1022000 words and to expand at the rate of 8500 words per year. In contrast, the number of characters in Chinese decreases from 47035 characters in 1716 (the 42-volume Chinese dictionary compiled during the reign of Emperor Kang Xi in the Qing Dynasty) to about 8000 characters in 1953 according to the New Chinese Dictionary. Therefore, in the future, we are not expected to see the saturation of distinct English words either.
The above-mentioned distinctions lead to remarkably different statistical regularities between character-formed languages and word-formed languages. Newly reported features for characterformed languages include an exponential decay of character frequency in the Zipf 's plot associated with a power-law frequency distribution with exponent close to 1, and a multi-stage growth of the number of distinct characters. These findings not only complement our understanding of scaling laws in human languages, but also refine the knowledge about the relationship between the power law and the Zipf 's law, as well as the applicability of the Heaps' law. As a result, we should be careful when applying the Zipf 's plot for a power-law distribution with exponent around 1, such as the cluster size distribution in two-dimensional selforganized critical systems 39 , the inter-event time distribution in human activities 40 , the family name distribution 41 , the species lifetime distribution 42 , and so on. Meanwhile, we cannot deny a possibly power-law distribution just from a non-power-law decay in the Zipf 's plot 32 .
The currently reported regularities, deviating from the wellknown Zipf 's and Heaps' laws, can be reproduced by considering finite dictionary size in a rich-get-richer process. Different from the well-known finite-size effects that vanish in the thermodynamic limit, the effects caused by finite dictionary size get stronger as the increasing of the system size. Finite choices must be a general condition in selecting dynamics, but not a necessary ingredient in growing systems. For example, also based on the rich-get-richer mechanism, neither the linear growing model 38 nor the accelerated growing model 43 (treating total degree as the text length and nodes as distinct characters, the accelerated networks grow in the Heaps' manner 33 ) has considered such ingredient. The present model could distinguish the selecting dynamics from general dynamics for growing systems.  Table 1. In addition, we investigate a corpus of 57755 Chinese books consisting of about 3.4 3 10 9 characters and 12800 distinct characters.

Methods
Measuring the strength of rich-get-richer mechanism. Similar to the method measuring the preferential attachment in evolving networks 44 , for each book under investigation, we divide it into two parts: Part I contains a fraction r of characters appeared early and Part II contains the remain fraction 1 2 r of characters. For each character i in Part II, if i did not appear in Part I, nothing happens, while if i appeared k times in Part I, we add one to w9(k) whose initial value is zero. Note that, i may appear more than once in Part II and thus contribute more than one to w9(k). Accordingly, w9(k) is the number of characters in Part II that appeared just k times in Part I. Dividing w9(k) by the number of distinct characters that appeared k times in Part I, we obtain w(k). We have checked that the results are not sensitive to r unless r is too small or too large, therefore we only show the results for r 5 0.5.
Growing dynamics of distinct characters. Assuming that at time t, there are N t distinct characters in the text. The selection at time t 1 1 can be equivalently divided into two complementary yet repulsive actions: (i) to select a character from the dictionary with probability proportional to e, or (ii) to select a character from the N t characters in the created text with probability proportional to its appeared frequency.
Therefore the probability to choose a character from the dictionary is Ve Vezt , whereas t Vezt from the created text. A character chosen from the created text is always old, while a character chosen from the dictionary could be new with probability 1{ N t V . Accordingly, the probability that a new character appears at the t 1 1 time step, namely the growing rate of N t , is With the boundary conditions N 0 5 0, one can arrive to the solution Eq. 2.
Character frequency distribution. Denote by n(t, k) the number of distinct characters that appeared k times until time t, then n(t, k) 5 N t p(k Via continuous approximation, it turns to be the following differential equation dp p~{ 1z Substituting N t11 2 N t 5 dN t /dt and Eq. 7 into Eq. 10, we get the solution where B is the normalization factor. Under the continuous approximation, the cumulative distribution of character frequency can be written as where k min is the smallest frequency. When b R 1, k 12b < 1 1 (1 2 b)lnk, and thus where B is obtained by the normalization condition ð kmax kmin p k ð Þdk~1 and k max is the highest frequency. According to Eq. 13, there are 1{B ln kze k min ze N t characters having appeared more than k times. That is to say, a character having appeared k times will be ranked at r~1z 1{B ln kze k min ze N t . Therefore