Coherent oscillations in word-use data from 1700 to 2008

In written language, the choice of specific words is constrained by both grammatical requirements and the specific semantic context of the message to be transmitted. To a significant degree, the semantic context is in turn affected by a broad cultural and historical environment, which also influences matters of style and manners. Over time, those environmental factors leave an imprint in the statistics of language use, with some words becoming more common and other words being preferred less. Here we characterize the patterns of language use over time based on word statistics extracted from more than 4.5 million books written over a period of 308 years. We find evidence of novel systematic oscillatory patterns in word use with a consistent period narrowly distributed around 14 years. The specific phase relationships between different words show structure at two independent levels: first, there is a weak global phase modulation that is primarily linked to overall shifts in the vocabulary across time; and second, a stronger component dependent on well defined semantic relationships between words. In particular, complex network analysis reveals that semantically related words show strong phase coherence. Ultimately, these previously unknown patterns in the statistics of language may be a consequence of changes in the cultural framework that influences the thematic focus of writers.


Introduction
T he application of quantitative methods to the analysis of language has disclosed layers of non-trivial statistical structure ranging from word frequencies (Ferrer-i-Cancho and Solé, 2001a;Montemurro, 2001), to correlations between hundreds or thousands of words (Montemurro and Pury, 2002;Alvarez-Lacalle et al., 2006), and to the complex network organization of the whole lexicon (Ferrer-i-Cancho and Solé, 2001b;Sigman and Cecchi, 2002). Likewise, the statistical structure of large collections of texts revealed novel quantitative linguistic universals that depend on the ordering of words (Montemurro and Zanette, 2011). Far from being static, language is a dynamic entity showing patterns of change over time that range from the scale of a few years, as in a wave of fashion, to episodes of birth and death of language families that may take place over many thousand years. The study of these processes suggested that language can be understood as an evolutionary system (Nowak et al., 2002) bearing strong similarities to mechanisms underlying the evolution of biological species (Pagel, 2009), which had been originally recognized by Charles Darwin (Darwin, 1871). The application of phylogenetic methods originally devised to characterize the evolution of biological organisms has yielded insight into the transformation of languages at the macro-scale of thousands of years (Gray and Atkinson, 2003), allowing to make inferences about language evolution reaching back to the the Upper-Palaeolithic period (Pagel et al., 2013). On shorter time scales, the analysis of grammatical and morphological changes has shed light on the dynamics of language over the course of half a millennium (Lieberman et al., 2007).
The availability of large amounts of linguistic data after the expansion of the Internet opened a new era for quantitative studies in language dynamics. The analysis of large volumes of digitized text provided insights into the statistics of word use in social communication groups (Altmann et al., 2011), changes in literary style over a historical corpus of literary works (Hughes et al., 2012), and sentiment analysis in Twitter (Twenge et al., 2012), among others.
In a broader context, language usage can be regarded as an important component among the factors that contribute to complex interactions within social systems (Castellano et al., 2009). As such, language is both an input and an output within and across groups in human societies. Whereas notable contingencies like important historical events can exert a strong influence on subsets of the vocabulary used at a particular time, slower processes of purely linguistic nature or cultural change can manifest themselves as overall trends in the use of certain words. Until recently, the study of this type of patterns in language evolution had to rely on scattered and sparse evidence from a reduced number of written sources. This situation suddenly changed in 2010 when Google Inc. made available the Google Ngram database, consisting of word frequency counts from nearly 5 million digitized books covering a range of more than 500 years. The initial studies carried out on this large database have allowed scholars to address unprecedented questions about language usage. In particular, for the first time it was possible to study quantitatively aspects of cultural change as reflected in language (Michel et al., 2011;Greenfield, 2013), and rigorously assess overall vocabulary drift over the time span of two centuries (Bochkarev et al., 2014). Moreover, methods inspired in statistical mechanics of complex systems were used to study the dynamics of word birth and death (Petersen et al., 2012a), long-range fractal correlations in word frequencies over centuries (Gao et al., 2012), and the scaling behaviour of word frequencies over time as represented by Zipf's (1949) and Heaps' (1978) laws (Petersen et al., 2012b;Gerlach and Altmann, 2013).
In the present study we employed data from the Google Ngram database to analyse the temporal evolution in the use of words over the span of more than three centuries, from 1700 to 2008. In particular, we focused on systematic patterns of change in the relative prevalence of common nouns over time. The noun class was chosen because we expected it to be the most semantically relevant group of words and, as such, to bear more information on patterns of cultural change over the time scales considered. However, we also obtained results for verbs which, although affected to a lesser degree, also showed systematic variation over time.
As a natural hierarchical measure of the prevalence of each individual word i within a given set of words at time t, we propose to consider the word's rank r i (t) in a list where the set is ordered by decreasing frequencies. This measure allows to quantify the relative changes in importance-or popularityamong words along time (Cocho et al., 2015). An additional reason for focusing on the rank is that the instantaneous frequency of words are, in fact, strongly dependent on corpus size-which, in our case, grows by orders of magnitude from 1700 to 2008. For instance, the English corpus of 1-grams-that is single words-changes from a total of 3.7 million words in 1700 to more than 19 billions in 2008. The rank, on the other hand, is limited to positive integers up to the lexicon size, and is therefore expected to grant a more robust measure of word relevance (see Supplementary Information for a detailed analysis of this point).
Our analysis reveals the presence of persistent oscillations in word use, shared by all the words studied and with a well defined period of around 14 years. Moreover, these oscillations present a rich relationship in their phases, in the form of two largely independent features: one consists of a global word-independent phase modulation related to overall shifts in the vocabulary at specific times, while the other is given by a word-dependent modulation that induces coherent phase behaviour for semantically related groups of words.
To quantify the relative change in the hierarchical position of a word, we introduce the following quantity: where the time t is measured in years. This quantity gives the logarithmic rank variation per year, and is closely related to the log frequency return used by Petersen et al. (2012a). For any given word i, ρ i (t) represents a time series quantifying changes in relative word prevalence. Our results concern the ensemble information present in the temporal behaviour given by Equation (1) for all i's. The details of our analysis are presented for the English language, but evidence of similar behaviour is provided for French, German, Italian, Russian and Spanish. The specific data for our study come from the 2012 version of the Google Ngram database. We focus on 1-grams data, which consist of single word frequency counts for every year independently. The 2012 version of the database is annotated with parts-of-speech tags (Lin et al., 2012), which allow the extraction of particular word classes. In all cases, we considered common nouns, defined as those that appeared at least 50,000 times over the whole period considered. The rank for each noun in each year was defined from the subset of extracted common nouns. Capitalization was ignored and, in the case of English, all nouns were converted to singular form in order to increase the statistics. Finally, Equation (1) requires that the rank is defined for all years considered. Thus, we only kept a core vocabulary of 5,630 common nouns that were used in every year from 1700 to 2008.

Methods
Data preparation. We used the 2012 version of the Google Ngram database for 1-grams, consisting of word frequency counts per year in the interval 1520-2008(Google Inc., 2013. For English, our starting database was the 1-gram counts from 1700 because, although the available data start at 1520, the data for the first 200 years is rather sparse. Using the parts-of-speech tag included in the database (Lin et al., 2012), we extracted the 1-gram information for the noun and verb classes. In order to increase statistics, all nouns were converted to singular form and verbs to their infinitive forms. From these words we only kept those that had an accumulated occurrence of at least 50,000 times in the interval 1700-2008 and that had been used in every single year in that interval. A further restriction was to use words that were written only using letter characters-thus avoiding numbers and other special characters. This procedure finally left a core vocabulary of 5,360 nouns and 2,342 verbs. For other languages, because of higher data sparseness in the 18th century, we carried out the same procedure starting from 1800 (see Supplementary Table S1).
Wavelet analysis and pseudo-period. The wavelet transform of a real function g(t) is defined as where u is the time shift and s is the wavelet scale. In our analysis we use the Mexican Hat Wavelet, given by with σ = 1. A natural way to determine a period T corresponding to the scale s is to find the extrema over s of the wavelet transform w(u, s) of a periodic function g(t) = cos(2πt/T). By computing the transform using the kernel defined in Equation (3), the period that maximizes the absolute value of the wavelet coefficients at scale s is given by T = (8π 2 /5) 1/2 s.
Clustering and network structure. In order to apply the clustering algorithm, every word i was represented by the time series ρ i (t), and the correlation between any two time series was used to quantify similarity between the corresponding words. The correlation between ρ i (t) and ρ i (t) is defined as where the averages are taken over time, and σ i,j represent the respective standard deviations. Then, a distance can be defined between words i and j as D(i, j) = 1 − C (i, j). The clustering algorithm proceeds by progressive agglomeration. It starts assuming as many initial clusters as words, and then groups the pair of words having the closest distance. It then proceeds iteratively, merging the closest clusters into larger ones. In our particular implementation, the distance between two clusters was taken as the average of all the distances between the elements belonging to the two clusters.
The word network is built by establishing links between words whose correlation equals or exceeds a given threshold θ. Thus, the corresponding adjacency matrix is defined as The division of the networks into communities is based on the maximization of network modularity (Clauset et al., 2004;Newman, 2006), which is defined as with m the total number of links in the network. In this expression, the sum runs over all the pairs of nodes {i, j}, and k i,j are the respective degrees. The Kronecker symbol δ(c i , c j ) equals 1 if i and j belong to the same community, and 0 otherwise.

Results
As is the case of frequency, the rank of words undergo changes year on year, reflecting their relative prevalence in usage (Cocho et al., 2015). Fig. 1a and c show the evolution of the rank as a function of time for two groups of semantically related words. The selected words occupy a wide range of rank positions. While, for instance, the word king fluctuates over relatively low ranks (corresponding to high frequencies), duchess occupies always a rank higher than 1,000. A similar situation is found for the food related terms, where the ranks of food and chicken differ from each other by some thousands. As for their time variations, although there is some correspondence in the positions of local maxima and minima within each group, correlations over longer time spans are generally weak. However, as shown in Fig. 1b and d, a coherent oscillatory pattern is revealed when we look at the logarithmic rank variation ρ i (t), with different curves in each group following closely similar behaviour. All the words in each group show a remarkably consistent common pattern, in which they systematically increase their popularity over certain intervals and decline in the intervening years. In the following, we quantify this observation at levels ranging from individual words to semantically related groups.
Periods and phases of oscillations. To characterize the periodicity of the oscillations we first estimated the periods of the individual words by means of a wavelet analysis of their respective ρ i time series. Figure   we used to obtain the periods for every word in the core vocabulary. Briefly, we computed a scalogram and then obtained the set of local extrema. Each of these extrema can be associated to a pseudo-period, from which the histogram of periods is then computed (see Methods for details). Figure 3 shows the resulting distribution of periods, for the whole time range (Fig. 3a), and discriminated per century (Fig. 3b to d). Fig. 3a shows a narrow peak at around 14 years with a small kink close to 50 years. In the figures corresponding to the individual centuries, it is apparent that oscillatory modes with longer periods increase in importance from century to century. In particular, the kink around 50 years is clearly a contribution of the 20th century. The effect can also be noticed in the individual time series of Fig. 1b and d as a tendency of the oscillations to slow down towards the present.
In addition to the period of signals, another aspect related to the specific timing structure of the oscillations is given by their phase. While the data in Fig. 1b and d show that the two groups of words exhibit similar oscillation periods, the phases between them are different. For instance, towards the year 1900 the first group is changing downwards while the second group follows an opposite trend.
The study of phase relationships across the whole core vocabulary reveals two independent modulations affecting the phase of oscillations. Figure 4a shows the time evolution of all 5,360 nouns arranged in a matrix-like structure following a random order. Yellow and blue shades respectively indicate high (positive) and low (negative) values of ρ i (t). As put in evidence by the vertical strips of either yellowish or bluish tonalities, the presence of a global modulation in the phase, affecting all words more or less uniformly, is apparent. There are specific time ranges in which the words in the core vocabulary move preferably towards higher ranks, while in other intervals they tend to move down. These events signal major shifts in the overall lexicon: the fact that all nouns in the core vocabulary move towards higher ranks means that other nouns, which are not part of the core, become temporarily more important. It is striking that these events occur repeatedly with effects all across the core nouns. The curve on top of the matrix is the average of ρ i (t) over the core vocabulary, representing the mean modulation of variations in its usage.
Clusters and networks of nouns. To analyse relationships in the time evolution that are more dependent on specific words, the b a mean modulation over the core vocabulary was subtracted from the time evolution of all words. The resulting time series were then grouped by means of a hierarchical clustering algorithm (see Methods for details) leading to a hierarchical tree structure, where closer topological distance across the dendrogram means greater similarity in the time series of the words. The first levels of the resulting tree are depicted in Fig. 4b together with the corresponding reordering of the time series. The most remarkable feature in the reordered dataset is the presence of numerous word groups that share specific phase relationship patterns over time. For many of the words, the time evolution is similar to the general trend observed in Fig. 4a, while others exhibit very different behaviours. By inspecting the sequence of words in the ordering given by the clustering, it is apparent that most of the structure in the dataset is given by groups of semantically related nouns. Table 1 shows a few examples of contiguous groups extracted from the ordering produced by clustering (Fig. 4b). The clear semantic relationship between the words in each group emphasizes the close connection between meaning and changes of relative prevalence in the vocabulary.
While clustering highlights the existence of well defined groups of words with similar time behaviour-and, additionally, with close semantic relationships-the capture of more complex structure within each group requires characterizing detailed relations between word pairs. In Fig. 5a we show the correlation matrix for the time evolution of all the words in the core vocabulary (see Methods). The ordering of indices in the matrix is the same as for the result of clustering. As a consequence, most positive correlations (yellow shades) are distributed along the diagonal. However, the still very significant off-diagonal structure suggests that a description in terms of network topology would be more appropriate to represent the relationships between words. To extract a network from the correlation matrix we introduce a threshold θ. Two words i and j will be connected in the network is their correlation C(i, j) is larger than or equal to the threshold.
For a given value of θ the resulting network is generally not connected, but instead consists of a number of mutually disconnected components. Figure 5b shows the fraction of nodes (that is, words) in the largest component as a function of the threshold. As expected, for sufficiently small values of θ the largest component is comparable in size to the total network, containing a fraction of nodes equal or very close to one. As the threshold grows, however, there is a narrow range around θ≈0.65 where the fraction of nodes in the largest component drops rather abruptly, indicating that the network splits into a large number of small components. We verified that at the critical threshold θ* = 0.65 the degree distribution of the network, shown in Fig. 5c, approximately follows a power law, suggesting scale-free structure (Barabási and Albert, 1999). Figure 5d shows a diagram of the largest connected component at θ*, comprising 2,670 nodes.
The largest component of the noun network also presents small-world features (Watts and Strogatz, 1998). At the critical threshold, its mean topological distance (diameter) is 6.77, while its mean clustering coefficient is 0.29. These values have to be compared with those obtained for a random graph. We have found that, in the noun network, the largest component has a diameter only 50% larger than the corresponding random graph while, on the other hand, the clustering coefficient is 100 times that of the random counterpart. These values indicate a smallworld structure, with both locally strong connectivity and long-range connections joining distant parts of the network.   Networks that share these features may have a structure where different parts of the network are naturally segregated into communities. Within each community, nodes are preferentially connected to other members of the same community, with only a few links directed towards other communities. To test this possibility on the noun network, we extracted its communities using a modularity optimization algorithm (Clauset et al., 2004;Newman, 2006) (see Methods) as implemented in Mathematica (Wolfram Research, 2016). Figures 6 and 7 show two examples of the communities obtained from the noun network. To reveal more structure within each community, we proceeded to further divide them into sub-communities, by simply iterating once more the same algorithm. The community shown in Fig. 6  | Example of a community in the noun network. As in Fig. 6, for words related to astronomy. thematic link, consisting almost exclusively of nouns relating to Ancient Rome. Interestingly, the further subdivision into subcommunities shows another layer of structure implicit in the correlations between words. Particularly, the sub-community in which nodes are represented by red disks has a clear pre-Imperial flavour, while the sub-network that consists of yellow disks is strongly Imperial. As the panels including the time series for the members of those two sub-communities show, oscillations have distinct temporal features, similar for words within the same sub-community and different across sub-communities. The community shown in Fig. 7, in turn, is thematically linked with astronomy. As noted above, further subdivision into subcommunities clearly shows that time evolution is strongly correlated for words with strong semantic relationship. The focus of our analysis has been on the noun class of words, since it represents the most semantically informative group of words. However, we have also checked that similar oscillations are found among verbs, albeit with typically smaller amplitudes compared to those found for nouns. Figure 8 shows histograms obtained from pooling all the values of ρ i (t) for all nouns and verbs in each dataset.
We verified that oscillatory patterns similar to those found for English are also observed for nouns in French, German, Italian, Russian and Spanish. For these languages, however, the analysis was limited to the last two centuries because of their scarce representation in the Google database during the 1700s. The respective distributions of oscillation periods, in particular, are strikingly coincident with each other (see Supplementary  Information).

Discussion
Using data extracted from the Google Ngram database, we have disclosed a systematic oscillatory pattern in the use of common nouns over the last three centuries. This regularity, which has been quantified in terms of the relative prevalence of different words in the vocabulary, was consistently confirmed to occur in a set of several thousand nouns, and was also observed for verbs. Characterization of the oscillations revealed a well-defined period of around 14 years, with a tendency to become longer and more spread towards the 20th century. The phase of oscillations, on the other hand, can broadly vary from word to word, but high correlation between phases was a typical signature of semantic affinity between the respective words. This trend made it possible, on the basis of comparing the oscillations of different words, to build a network whose communities contain nouns related by their meaning.
A preliminary analysis of other languages of the Indo-European family revealed oscillations with similar characteristics. Specifically, the distributions of oscillation periods over the last two centuries were closely similar to that found for English.
At present, we do not have an explanation for the oscillatory behaviour of word prevalence. As advanced in the Introduction, however, we expect that this behaviour is related to changes in the cultural environment that, in turn, stir the thematic focus of the writers represented in the Google database. Oscillatory dynamics, moreover, have been demonstrated in other areas of social sciences, such as in economics (Morgan, 1990), where the quantification of cyclic behaviour is more direct than for cultural changes.
On the other hand, the inference of cultural evolution features from the analysis of Google Ngrams time series has recently been criticised mainly on the basis that a database built from book digitization may be strongly biased towards certain thematic areas, or by a handful of influential writers (Pechenick et al., 2015). However, although this warning is conceptually well grounded, objective evidence that the biases observed in the Google database do not respond to genuine trends in cultural focus has not been produced yet. Leaving aside certain variations ascribable to linguistic or stylistic evolution, reported biases correspond either to localized (inter-decade) frequency changes of a few significant words, or to long-range, quasi-monotonic thematic drifts-in particular, towards science and technology. None of these biases, nor their putative causes, point to the possibility of oscillatory behaviour in word usage along the database. In contrast, our observation of sustained oscillations in word prevalence, and the fact that they consistently occur over vocabularies comprising thousands of words in different languages, confer statistical significance to our results, beyond presumptive distortions in the Google three-century-long selection of books.