Abstract
The prevailing maximum likelihood estimators for inferring power law models from rankfrequency data are biased. The source of this bias is an inappropriate likelihood function. The correct likelihood function is derived and shown to be computationally intractable. A more computationally efficient method of approximate Bayesian computation (ABC) is explored. This method is shown to have less bias for data generated from idealised rankfrequency Zipfian distributions. However, the existing estimators and the ABC estimator described here assume that words are drawn from a simple probability distribution, while language is a much more complex process. We show that this false assumption leads to continued biases when applying any of these methods to natural language to estimate Zipf exponents. We recommend that researchers be aware of the bias when investigating power laws in rankfrequency data.
Similar content being viewed by others
If we take a book and rank each word based on how many times it appears, we will find that the number of occurrences of each word is approximately inversely proportional to its rank^{1}. The second most frequent word will appear approximately \(\frac{1}{2}\) as often as the most frequent word, the third around \(\frac{1}{3}\) as frequently. This describes a power law relationship between the frequency of a word, n, and the word’s rank in terms of its frequency, \(r_e\), with exponent \(\gamma \approx 1\)^{2}.
This is known as Zipf’s law and is consistent, in a general sense, across human communication^{3,4}. We do not have a satisfactory reason why this is^{2} and the exponent, \(\gamma\), is not always 1 but varies between different speakers^{3} and texts^{3,5}. Sound analytical tools are needed to investigate these research areas.
Equation (1) describes an observed empirical relationship. It is tempting to assume that this is equivalent to a probability distribution for words (an early example is of Shannon using Zipf’s law to estimate the entropy of English^{6}). Indeed, Zipf’s law is often expressed as a relationship between a word’s probability of occurrence^{7,8} and the word’s rank in the probability distribution, \(r_p\).
The conflation of Eqs. (1) and (2) causes the prevailing maximum likelihood estimators to miscalculate \(\lambda\) in Eq. (2) with a positive bias^{2,9,10} (Fig. 1). This bias applies specifically to rankfrequency distributions, where the ranks of events are not known a priori and instead are extracted from the frequency distribution, as is the case with word frequencies. The root of the bias is that the existing estimators make the assumption that the observed empirical frequency rankings of data [\(r_e\) in Eq. (1)] are equivalent to rankings in an underlying probability distribution [\(r_p\) in Eq. (2)]^{2}. The nth most frequent word is assumed to be the nth most likely word, which is not necessarily the case^{2}. This is often overlooked in the literature^{2}.
In the 2000s there were a series of papers^{8,11,12} describing a method of maximum likelihood estimation that gave more accurate (lower bias) estimates for power law exponents than graphical methods^{8}. The most influential of these is Clauset et al.’s paper^{8}. The estimators had been derived and presented before^{11} (as early as 1952 in the discrete case^{13}) but Clauset et al.’s paper popularised the idea and provided a clear methodology including techniques to perform goodness of fit tests^{8}. In all of these papers, the derivation of the likelihood function assumes that there is some a priori ordering on an independent variable. This works very well for power laws with some natural way to order events, such as the size vs frequency of earthquakes^{8}. However, it does not work so well with rankfrequency distributions, where the rank is extracted empirically from the frequency distribution, so that the empirical rank and frequency are correlated variables^{2}, both dependent on the same underlying mechanism. This difference was not addressed by Clauset et al., who include examples of applying their estimator to Zipf’s law in language^{8}. The same data can look very different depending on whether we know it’s true rank or not, as shown in Fig. 2.
Recently Clauset et al.’s estimator has been shown, empirically, to be biased for some rankfrequency distributions^{2,9,10}. In particular, Clauset et al.’s method overestimates exponents with rankfrequency data generated from known power law probability distributions with exponents below about 1.5^{10} (Fig. 1). The problem is related to low sampling in the tail^{9,10}, so that the observed empirical ranks tend to “bunch up” above the line of the true probability distribution before decaying sharply at the end of the observed tail (Fig. 2). To our knowledge this bias has not been adequately explained or solved.

In 2014 Piantadosi et al.^{2} explained the problem and suggested splitting a corpora and calculating ranks of words from one part of the split and frequencies from the other, breaking the correlation of errors. However the method does not take into account uncorrelated errors in the ranks. In particular, the empirical ranks of events in the tail will almost certainly be lower than the actual ranks in the probability distribution as many events in the tail will not be observed at all.

Hanel et al.^{10} identified the problem and suggested using a finite set of events instead of Clauset et al.’s unbounded event set^{8}. This gives more accurate results in the limited case that the number of possible events, W, is finite and known^{10}. Often W is not known and the choice of W can substantially change the results. With Zipf’s law in language, W represents the writer’s vocabulary and is usually modelled as unbounded^{2,8,12}. This seems appropriate given that Heaps’ Law suggests that the number of unique words in a document continues to rise indefinitely as the document length increases^{14}.

In 2019 Corral et al.^{9} examined the problem and explored a technique of transforming the data to a distribution of frequencies representation, f(n), which is also a power law type distribution that they call the Zipf's law for sizes^{9}. This distribution has an a priori known independent variable of frequency sizes, so the bias does not apply to this representation. However there is still difficulty in estimating the rankfrequency exponent, as a power law in the rankfrequency distribution, \(n(r_e)\), will only approximately map to a power law in the distribution of frequencies, f(n), for realworld sample sizes^{9}.
Overall these adhoc methods can remove the bias to some extent but not completely. The methods also introduce a host of somewhat arbitrary choices for the researcher to resolve.
We derive a new maximum likelihood estimator that does not make the false assumption that the empirical ranks, \(r_e\), are equivalent to the probability ranks, \(r_p\). The new estimator considers all the possible ways that the events could be ranked in the underlying probability distribution to generate the observed empirical data. Unfortunately this new likelihood function is computationally intractable for all but the smallest data sets. In order to estimate parameters for larger data sets, we turn to approximate Bayesian computation (ABC), a method that is designed for situations where likelihood functions cannot be computed^{15}. We show that this method has much lower bias than Clauset et al.’s estimator for rankfrequency data generated from simple power laws. We further explore two different implementations of ABC and find that they give different results when applied to word distributions in books because ABC and Clauset et al.’s method both assume an underlying power law probability model, while natural language arises from a more complex model. We suggest that this false assumption means that maximum likelihood estimation with simple models will always have some arbitrary bias when studying rankfrequency data in natural language, including ABC and Clauset et al.’s method.
Model
Likelihood function: general case with no a priori ordering
A vector of data, \(\varvec{d} = [d_1, d_2, \ldots d_N]\), represents N observations of a random variable X. Each of these observations are one of a discrete set of W events, with no a priori ordinality. An example is words in a book.
We can transform the vector \(\varvec{d}\) to counts of each event, ordered from most to least frequent, \(\varvec{n} = [n(x_{(1)}), n(x_{(2)}), \ldots , n(x_{(W)})]\). \(\varvec{n}(x_{(r_e)})\) represents the count of the \(r_e\)th most common event, where \(r_e\) is the event’s ranking in the empirical frequency distribution. For ease of notation we will refer to \(\varvec{n}(x_{(r_e)})\) as \(\varvec{n}(r_e)\).
We assume a simple model where each of these events has some unknown fixed probability of being observed, \(p(x_{r_p}) = Pr(X=x_{r_p})\), where \(r_p\) is the event’s rank in the underlying probability distribution.
The key insight is that given an event’s empirical rank, we do not know that event’s rank in the underlying probability distribution. We can describe the mapping of events from the data generating probability ranking to the empirical ranking with a vector \(\varvec{s}\), so that \(\varvec{s}(r_p) = r_e\). For example \(\varvec{s} = [2,1,3]\) would mean that the second most probable event was observed empirically the most number of times, the most probable event was seen the second most number of times, and the third most likely seen third most. For any valid mapping, \(\varvec{s}\) must be a permutation of the integers from 1 to W. Figure 3 shows an example mapping.
We assume that the probability distribution is parameterised by \(\varvec{\theta }\). Considering Bayes’ rule
The likelihood can be written as (ignoring constants of proportionality)
This likelihood equation is in terms of the events’ empirical rank, \(r_e\), whereas the underlying probability model is in terms of probability rank, \(r_p\). To convert the likelihood to be in terms of \(r_p\) we condition on the mapping vector, \(\varvec{s}\).
Using the law of total probability we sum over all possible mappings of probability rankings onto empirical rankings. S(W) is the set of all possible permutations of the numbers 1 to W, known as the symmetric group.
Equation (6) is the likelihood for any data that represents observations of discrete events, where the events have no a priori ordering in relation to the underlying model. The equation generalises to \(W \rightarrow \infty\), suitable to describe models with unbounded event sets, as is the case in many Zipf type models.
Likelihood function: power laws with no a priori ordering
A common model applied to rankfrequency distributions is the power law, used by Zipf in his study of words^{1}. A power law probability distribution is of the form
where \(\lambda\) is the power law exponent, \(Z_{\lambda }\) is a normalising factor. We use the simplest form of Zipf’s law for ease of analysis. The method described here can be used with other models such as the Zipf–Mandelbrot law^{16}. The normalising factor is:
W is the number of possible events. In the limit \(W \rightarrow \infty\), \(Z_{\lambda }\) becomes the Riemann zeta function, \(\zeta (\lambda )\)^{8}.
Considering Eq. (6), the likelihood can be written as
And the differential of the likelihood with respect to \(\lambda\) is
\(Z_{\lambda }'\) is the differential of the normalising factor with respect to \(\lambda\). To find the maximum likelihood estimator, we can use numerical methods to either (a) maximise Eq. (9) or (b) find the root of Eq. (10) (Fig. 4).
The prevailing estimators from the literature (often implicitly) assume that the empirical ranks match the probability ranks^{2,8,12}, so that they only consider the leading term in the main sum in both Eqs. (9) and (10) (associated with the identity permutation \(\varvec{s}_I = [1,2,\ldots ,W]\)). This is the source of the bias in the existing estimators.
The number of terms in the likelihood function (Eq. 6) scales as O(W!), so that naive computation of the likelihood is impractical even at \(W \approx 10\). The computation can be shown to be equivalent to the computation of the permanent of a matrix with entries \(a_{ij}=p(x_j)^{\varvec{n}(i)}\). The best known algorithm for exactly computing the permanent of a matrix is Ryser’s algorithm^{17,18} with complexity \(O(W2^W)\). This is computationally intractable for real world data sets such as text corpora with vocabularies of \(W>1000\). A more indepth discussion on the computational complexity can be found in the Supplementary Information.
Approximate Bayesian computation
Approximate Bayesian computation is a technique for approximating posterior distributions without calculating a likelihood function^{19,20,21}. Instead, we assume a model, \({\mathcal {M}}\), simulate data, \(\varvec{n}_i\), from possible parameters, \(\lambda _i\), and observe how close that simulated data is to the empirical data using a distance measure \(\rho (\varvec{n}_i, \varvec{n}_{obs})\)^{19,21}. The ABC rejection algorithm is based upon the principle that we can approximate the actual posterior by estimating the probability of \(\lambda\) given that the data is within some small tolerance, \(\epsilon\), of the observed empirical data^{19,22}. This assumes that the model, \({\mathcal {M}}\), is a good representation of the actual data generating process.
The ABC rejection algorithm begins by sampling parameter values from the prior. For each of these parameter values, data is then generated from the model and tested on the condition \(\rho (\varvec{n}_i, \varvec{n}_{obs}) < \epsilon\)^{19}. With enough samples, the density of successful parameters will approximate the right hand side of Eq. (12), and an approximation for the posterior distribution^{19}. If we use a uniform prior then this will be a proportional estimate to the likelihood.
An ideal distance measure, \(\rho (\varvec{n}_i, \varvec{n}_{obs})\), would involve comparing Bayesian sufficient summary statistics from the data^{21}. Usually in practice Bayesian sufficiency cannot be achieved^{19,21}, and some information will be lost so that the approximation of the posterior includes some error^{19}. A common technique is to summarise the data sets with summary statistics, \(\varvec{S}(\varvec{n})\), and define the distance as the difference between those, \(\rho (\varvec{n}_i, \varvec{n}_{obs}) = \varvec{S}(\varvec{n}_i)  \varvec{S}(\varvec{n}_{obs})\)^{15,19,21}. Recently the Wasserstein distance, a metric between distributions, has been shown to work well as a distance measure^{23}. This is a principled approach that avoids the difficult selection of summary statistics^{23}, and this is the measure that we use here.
The ABC rejection algorithm requires a small tolerance in order to find a good estimate for the posterior^{22}. This in turn requires a high density of samples in order to have enough successful parameters to build the posterior approximation. To sample at a high density across a reasonable parameter space with a uniform prior would be prohibitively computationally expensive. Instead, we use population Monte Carlo to sample from a proposal distribution that focuses on areas of high posterior probability while avoiding areas of negligible probability^{24}. At each time step, the results are weighted using principles from importance sampling to account for the fact that we are sampling from the proposal distribution instead of the prior^{24}. This algorithm, adapted from^{25}, is shown in Algorithm 1 and Fig. 5 (the 2 parameter algorithm is equivalent, with the variance replaced by a covariance matrix). The parameters in the algorithm were set following trial and error to balance computation time and accuracy.
We also investigated an alternative approach to approximate Bayesian computation known as ABC regression. Instead of the Wasserstein distance, we used the mean of the log transformed event counts as a summary statistic with this method. Full details are in the Supplementary Information.
ABC results
Approximate Bayesian computation with Zipf distributions
Rankfrequency data was generated (\(N=\)10,000) from an unbounded power law with exponents ranging from 1 to 2. For each generated data set, the exponent was estimated using a) Clauset et al.’s estimator and b) ABCPMC with the Wasserstein distance. This was repeated 100 times to find the mean bias and variance. The ABC method has much lower bias and similar variance to Clauset et al.’s method, (Fig. 6).
We also investigated how the bias changes with varying sample size. Rankfrequency data was generated with \(\lambda =1.1\) and varying sample size up to \(N=\)1,000,000. Clauset et al.’s estimator shows positive bias at all values of N, although it decreases with large N. ABC shows much lower bias for all values of N. The variance of ABC is higher for \(N \lessapprox 1000\). Overall the variance is still very low, and is insignificant compared to the positive bias showed by Clauset et al.’s estimator (Fig. 7).
In addition to the results shown here, we explored a variation of the algorithm using ABC rejection with the mean of the logged event counts as a summary statistic. This method has similarly low bias and variance as the results shown here. See the Supplementary Information for full details.
Approximate Bayesian computation with Zipf–Mandelbrot model
The Zipf–Mandelbrot law is a modification of Zipf’s law derived by Mandelbrot that accounts for a departure from a strict power law in the head of the rankfrequency distribution^{16}.
We tested the ABC PMC algorithm with this 2 parameter model. The algorithm is of the same form as Algorithm 1, with the variance replaced with a covariance matrix. The algorithm is demonstrated with one generated data set with \(q=\)4, \(\lambda\)=1.2 and \(N=\)100,000. ABC PMC performs well, with close estimates to the true parameters (see Fig. 8). The approximated likelihood function gives negligible probability for \(q=\)0, suggesting that the algorithm can discriminate between data generated from Zipf’s law and the Zipf–Mandelbrot law.
Analysis of books
Both Clauset et al.’s method and the approximate Bayesian computation method described here assume a Zipfian data generating model. We have demonstrated that ABCPMC with the Wasserstein distance works well for data generated from a known power law, with much lower bias than Clasuet et al.’s method. In the Supplementary Information, we also describe an ABC regression method using the mean log of the word counts that has similar low bias when applied to data from a power law distribution.
It is reasonable to suggest that natural language is a more complex process than drawing words from a power law probability distribution. Indeed, deep learning language models like GPT3 use billions of parameters^{26}. As such, models that assume Zipfian data generating models are not necessarily suitable for analysing language. To demonstrate the problem, we analysed books using (a) Clauset et al.’s method, (b) ABCPMC with the Wasserstein distance (c) ABC regression with the mean of the log transformed word counts as a summary statistic (Table 1). All of the books were downloaded from Project Gutenberg^{27}. Each text sample was first “cleaned” by removing all punctuation, replacing numbers with a \(\#\) symbol, and converting all text to lowercase. The word frequencies were then counted.
The two forms of ABC give different results, which bracket the results of the Clauset et al. estimator. This does not imply that the Clauset et al. is the best approximator as we show above that it is biased upwards. What these present results indicate is that there is no correct “ground truth” because the assumed underlying models are wrong.
Discussion
We have demonstrated that the prevailing Zipf’s law maximum likelihood estimators for rankfrequency data are biased due to an inappropriate likelihood function. This bias is particularly strong in the range of natural language, with exponents close to 1. The correct likelihood function is intractable. We have presented one approach to overcoming this bias using a likelihoodfree method of approximate Bayesian computation. The ABC method is shown to work well with data generated from actual power law distributions, with lower bias than Clasuet et al.’s estimator.
ABC works well in an idealised situation where the true model is known. However when applied to analysing books, the two ABC approaches that we explored give very different estimates for the Zipf exponents. The Zipfian approaches we investigate all assume a simple bag of words probability model, whereas our results on books indicate that natural language generation is a more complex process–otherwise the two ABC methods would converge. The ABC algorithms are searching a parameter space for the closest model based on the distance measure. This works well when the parameter space includes the true data generating process. But with natural language the assumed simple Zipf model is wrong so there is no “correct” location in the parameter space (or the “correct” location is outside the parameter space). Different distance measures will prejudice different aspects of the observed data and so arrive at different estimates. This bias is arbitrary in nature and there seems to be no reasonable way to decide which distance measure is “correct”. The error lies in the assumption of an incorrect data generating model. This problem applies to ABC and Clauset et al.’s estimator, and seems to be inherent in applying maximum likelihood estimation using simple models to describe power laws in natural language.
Zipf’s law for word types is an empirical relationship between frequencies of words and ranks in that frequency distribution. The difficulty arises when a probabilistic model is used to describe the mechanism that is generating this relationship, when the actual mechanism is more complex. The main aim of this publication is to clearly show that Clauset et al.’s estimator is strongly biased for rankfrequency data. The correct likelihood function provides an unbiased framework that works well when the underlying data generating process is known. This does not appear to be the case for natural language. Graphical methods may therefore be more suitable to study Zipf’s law when investigating the empirical relationship between ranks and frequencies (Eq. 1) and not the probability distribution (Eq. 2). All Zipf estimators have some bias and the best choice will depend on the specific application.
The scripts and data used here are available at the repository https://github.com/chasmani/PUBLIC_bias_in_zipfs_law_estimators. That repository includes the approximate Bayesian computation algorithm as well as implementations of other estimators from the literature.
References
Zipf, G. K. Human Behavior and the Principle of Least Effort. (Addisonwesley press, 1949).
Piantadosi, S. T. & Piantadosi, S. T. Zipf’s word frequency law in natural language: A critical review and future directions. Psychon. Bull. Rev. 21, 1112–1130. https://doi.org/10.3758/s1342301405856 (2014).
Ferrer, R. & Cancho, R. The variation of Zipf’s law in human language. Eur. Phys. J. B 44, 249–257. https://doi.org/10.1140/epjb/e2005001218 (2005).
MorenoSánchez, I., FontClos, F. & Corral, Á. Largescale analysis of Zipf’s law in english texts. PLoS ONE 11, e0147073. https://doi.org/10.1371/journal.pone.0147073 (2016).
Montemurro, M. A. & Zanette, D. H. New perspectives on zipf’s law in linguistics: From single texts to large corpora. Glottometrics 4, 87–99 (2002).
Shannon, C. E. Prediction and entropy of printed english. Bell Syst. Tech. J. 30, 50–64 (1951).
Newman, M. E. Power laws, pareto distributions and zipf’s law. Contemp. Phys. 46, 323–351 (2005).
Clauset, A., Shalizi, C. R. & Newman, M. E. Powerlaw distributions in empirical data. SIAM Rev. 51, 661–703 (2009).
Corral, A., Serra, I. & Ferreri Cancho, R. The distinct flavors of zipf’s law in the ranksize and in the sizedistribution representations, and its maximumlikelihood fitting. arXiv preprint arXiv:1908.01398 (2019).
Hanel, R., CorominasMurtra, B., Liu, B. & Thurner, S. Fitting powerlaws in empirical data with estimators that work for all exponents. PLoS ONE 12, e0170920. https://doi.org/10.1371/journal.pone.0170920 (2017).
Goldstein, M. L., Morris, S. A. & Yen, G. G. Problems with fitting to the Powerlaw distribution. Eur. Phys. J. Bhttps://doi.org/10.1140/epjb/e2004003165 (2004).
Bauke, H. Parameter estimation for powerlaw distributions by maximum likelihood methods. Eur. Phys. J. B 58, 167–173. https://doi.org/10.1140/epjb/e200700219y (2007).
Seal, H. The maximum likelihood fitting of the discrete pareto law. J. Inst. Actuar. 1886–1994(78), 115–121 (1952).
Heaps, H. S. Information Retrieval, Computational and Theoretical Aspects (Academic Press, 1978).
Beaumont, M. A. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406. https://doi.org/10.1146/annurevecolsys102209144621 (2010).
Mandelbrot, B. An informational theory of the statistical structure of language. Commun. Theory 84, 486–502 (1953).
Ryser, H. J. Combinatorial Mathematics, vol. 14 (American Mathematical Soc., 1963).
Glynn, D. G. The permanent of a square matrix. Eur. J. Comb. 31, 1887–1891. https://doi.org/10.1016/j.ejc.2010.01.010 (2010).
Sunnåker, M. et al. Approximate Bayesian computation. PLoS Comput. Biol. 9, e1002803. https://doi.org/10.1371/journal.pcbi.1002803 (2013).
Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002).
Csilléry, K., Blum, M. G. B., Gaggiotti, O. E. & François, O. Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol. 25, 410–418. https://doi.org/10.1016/j.tree.2010.04.001 (2010).
Sisson, S. A., Fan, Y. & Tanaka, M. M. Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U.S.A. 104, 1760–1765. https://doi.org/10.1073/pnas.0607208104 (2007).
Bernton, E., Jacob, P. E., Gerber, M. & Robert, C. P. Approximate bayesian computation with the wasserstein distance. arXiv preprint arXiv:1905.03747 (2019).
Cappé, O., Guillin, A., Marin, J. M. & Robert, C. P. Population Monte Carlo. J. Comput. Graph. Stat. 13, 907–929. https://doi.org/10.1198/106186004X12803 (2004).
Beaumont, M. A., Cornuet, J.M., Marin, J.M. & Robert, C. P. Adaptive approximate Bayesian computation. Biometrika 96, 983–990. https://doi.org/10.1093/biomet/asp052 (2009).
Brown, T. B. et al. Language models are fewshot learners. arXiv preprint arXiv:2005.14165 (2020).
Project Gutenberg (2020). [Online; accessed 16. Jul. 2020].
Acknowledgements
The study was funded by the EPSRC grant for the Mathematics for RealWorld Systems CDT at Warwick (Grant No. EP/L015374/1). T.T.H. was supported on this work by the Royal Society Wolfson Research Merit Award (WM160074) and a Fellowship from the Alan Turing Institute.
Author information
Authors and Affiliations
Contributions
C.P. conceived of the presented idea and carried out the analyses. T.T.H. supervised C.P. and offered guidance, suggestions and support throughout. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pilgrim, C., Hills, T.T. Bias in Zipf’s law estimators. Sci Rep 11, 17309 (2021). https://doi.org/10.1038/s4159802196214w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159802196214w
This article is cited by

The rising entropy of English in the attention economy
Communications Psychology (2024)