The dynamics of correlated novelties

Novelties are a familiar part of daily life. They are also fundamental to the evolution of biological systems, human society, and technology. By opening new possibilities, one novelty can pave the way for others in a process that Kauffman has called “expanding the adjacent possible”. The dynamics of correlated novelties, however, have yet to be quantified empirically or modeled mathematically. Here we propose a simple mathematical model that mimics the process of exploring a physical, biological, or conceptual space that enlarges whenever a novelty occurs. The model, a generalization of Polya's urn, predicts statistical laws for the rate at which novelties happen (Heaps' law) and for the probability distribution on the space explored (Zipf's law), as well as signatures of the process by which one novelty sets the stage for another. We test these predictions on four data sets of human activity: the edit events of Wikipedia pages, the emergence of tags in annotation systems, the sequence of words in texts, and listening to new songs in online music catalogues. By quantifying the dynamics of correlated novelties, our results provide a starting point for a deeper understanding of the adjacent possible and its role in biological, cultural, and technological evolution.

1 Urn model with triggering 1

.1 Model definition
In the main text we introduced the urn model with triggering. Briefly, an ordered sequence S was constructed by picking elements (or balls) from a reservoir (or urn) U initially containing N 0 distinct elements. Both the reservoir and the sequence increased their size according to the following procedure. At each time step: (i) an element is randomly extracted from U with uniform probability and added to S; (ii) the extracted element is put back into U together with ρ copies of it; (iii) if the extracted element has never been used before in S (it is a new element in this respect), then ν + 1 different brand new distinct elements are added to U.
Note that the number of elements N of S, i.e. the length |S| of the sequence, equals the number of times t we repeated the above procedure. If we let D denote the number of distinct elements that appear in S, then the total number of elements in the reservoir after t steps is |U| t = N 0 + (ν + 1)D + ρt.
In the following, we shall also consider a second and slightly different version of the model, in which the reinforcement does not act when an element is chosen for the first time. Hence, point (ii) of the previous rules will be changed into: (ii.a) the extracted element is put back in U together with ρ copies of it only if it is not new in the sequence.

Computation of the asymptotic Heaps' and Zipf's laws
We discuss here the asymptotic behaviour of both the number of distinct elements D(t) appearing in the sequence and the frequency-rank distribution f (R) of the elements in the sequence S. We will show that both versions of the urn model above predict a Heaps' law for D(t) and a frequency-rank distribution f (R) with a fattail behavior. Our calculations yield simple formulas for the Heaps' law exponent and the exponent of the asymptotic power-law behavior of the frequency-rank distribution in terms of the model parameters ρ and ν.
Strictly speaking, Zipf's law requires an inverse proportionality between the frequency and rank of the considered quantities [1]. In the following, however, we shall always refer instead to a generalized version of Zipf's law, in which the dependence of the frequency on the rank is power-law-like in the tail of the distribution, i.e. at large ranks.

Heaps' law
In the first version of the model, the time dependence of the number D of different elements in the sequence S obeys the following differential equation: where U D (t) is the number of elements in the reservoir that at time t have not yet appeared in S, and U (t) = |U| t is the total number of elements in the reservoir at time t. The term νD in the numerator of the rightmost expression comes from the fact that each time a new element is introduced in the sequence, U D (t) is increased by ν elements (since ν + 1 brand new elements are added to U, while the chosen element is no longer new). Due to the inherently discrete character of D and t, Eq. (1) is valid asymptotically for large values of D and t.
To analyze both versions of the model simultaneously, it is convenient to define a parameter a ≡ ν + 1 for the first version and a ≡ ν + 1 − ρ for the second version.
In order to obtain an analytically solvable equation, and since we are interested in the behaviour at large times t N 0 , we approximate equation (1) by By introducing the auxiliary variable z = D t and performing some straightforward algebra we obtain the asymptotic behaviour of D(t) for large t: log t , For completeness, we note that both versions of the model can be regarded as the coarse-grained equivalent of a two-color asymmetric Polya urn model [2]. In particular, within that finer framework the substitution matrices (denoted M 1 for the first version of the model and M 2 for the second) would be: In this interpretation, the elements that have already appeared in S are represented by balls of one color, while those that have not appeared yet correspond to balls of the other color.
Zipf's law Making the same approximations as above, the continuous dynamical equation for the number of occurrences n i of an element i in the sequence S can be written as Two cases can be distinguished: 1. ν ≤ ρ, when lim t→+∞ D/t = 0. By considering only the leading term for t → +∞, one has dn i dt Let t i denote the time at which the element i occurred for the first time in the sequence. Then the solution for n i (t) starting from the initial condition n i (t i ) = 1 is given by Now consider the cumulative distribution P (n i ≤ n). From Eq. (5), we can . This leads to the estimate: 2. ν > ρ, when D ν−ρ a t. Again considering t N 0 , we write: which yields the solution Proceeding as in the previous case, we find P (n i ≤ n) = P (t i ≥ t n − ν ρ ) = 1 − P (t i < t n − ν ρ ), and thus obtaining the same functional expression of the asymptotic power-law behavior of the frequency-rank distribution as in the previous case.
The probability density function of the occurrences of the elements in the sequence is therefore P (n) = ∂P (n i <n) ∂n ∼ n −(1+ ν ρ ) , which corresponds to a frequencyrank distribution f (R) ∼ R − ρ ν . Note that the estimates in equations (6) and (9) have been derived under the assumption that t/n 1, i.e. in the tail of the frequency-rank distribution. In this respect, it is important to recognize that Zipf's and Heaps' laws are not trivially and automatically related, as is sometimes claimed. We certainly agree that Heaps' law can be derived from Zipf's law by the following random-sampling argument: if one assumes a strict power-law behaviour of the frequency-rank distribution f (R) ∼ R −α and constructs a sequence by randomly sampling from this Zipf distribution f (R), one recovers Heaps' law with the functional form D(t) ∼ t β with β = 1/α [3,4]. But the assumption of random sampling is strong and sometimes unrealistic. If one relaxes the hypothesis of random sampling from a power-law distribution, the relationship between Zipf's and Heaps' law becomes far from trivial. In our model, and in work by others [4], the relationship β = 1/α holds only asymptotically, i.e. only for large times, with α measured on the tail of the frequency-rank distribution.
In the main text we presented numerical results confirming the above analytical predictions for the first version of our model. Here we report numerical results for the second version of the model (employing the definition (ii.a)), summarized in the top-left panels of Fig. S0 and Fig. S1. The robustness of the results with respect to fluctuations of the model parameters ν and ρ was checked as follows. At each time step both ρ and ν were sampled from a uniform distribution (top-right), an exponential distribution (bottom-left) and a fat-tailed distribution with diverging variance, all with the same mean valuesρ = 8 andν = 5. For the uniform distribution, ρ and ν were sampled from the intervals [0, 2ρ] and [0, 2ν], while for the fat-tailed distribution, the chosen exponents were α ρ = 2ρ−1 ρ−1 and α ν = 2ν−1 ν−1 , which ensured the desired average values by choosing 1 as the minimum value.
In the case ρ < ν we recover the results of the well-known Yule-Simon Model (YSM) [5], originally proposed in the context of linguistics. In YSM, new words are added to a text (more generally a stream) with constant probability p at each time step, while with complementary probability (1 − p), a word that has already occurred is chosen uniformly from within the text (or stream) generated so far. YSM leads to a Zipf's law with an exponent −(1 − p) compatible with a linear growth in time of the number of different words. In the framework of our urn model with triggering we recover the same Zipf's exponents as well as the linear growth of D(t) if p = 1 − ρ ν , with ρ < ν 1 . The YSM is a paradigmatic example of a model that generates a fat-tail frequency-rank distribution f (R) ∼ R −α by using a rich-gets-richer mechanism. But it has the drawback that it does not reproduce both an f (R) obeying a power-law behavior and a sublinear Heaps' exponent at the same time. Moreover, the YSM cannot reproduce values of α larger than 1 (which are found empirically in the frequency-rank distribution of words in certain texts). These problems were at the basis of the famous Simon-Mandelbrot dispute [6,7,8,9,10]. In our model the introduction of the parameter ν (describ- 1 We note that if ν 1 when a = ν + 1 (first version of the model) or ν ρ and ν 1 when a = ν + 1 − ρ (second version of the model) our model also reproduces the same prefactor of the linear growth of D(t) as in the YSM. This is evident by setting a = ν in Eq. (2).  ing the expansion of the adjacent possible) heals these problems by confining the phenomenology of the YSM to the special case ρ < ν.

Heaps' and Zipf's laws for the urn model with semantic triggering
We turn now to the counterparts of Heaps' and Zipf's laws for the urn model with semantic triggering. For the sake of completeness we recall the model's definition. One starts with an urn U with N 0 distinct elements, divided in N 0 /(ν + 1) groups, the elements in the same group sharing a common label. After choosing the first element at random, the sequence S is constructed according to the following scheme: (i) a weight 1 is given to: (a) each element in U with the same label, say A, as s t−1 , (b) to the element that triggered the enter in the urn of the elements with label A, and (c) to the elements triggered by s t−1 ; a weight η ≤ 1 is given to any other element in U; (ii) an element s t is chosen from U with a probability proportional to its weight and appended to the sequence; (iii) the element s t is put back into U along with ρ additional copies of it; (iv) if the chosen element s t is new (i.e., it appears for the first time in the sequence S) ν + 1 brand new distinct elements, all with a common brand new label, are added to U. These ν + 1 new elements are given a weight η = 1 at the next time step t + 1 and each time the same mother element s t is picked.
Note that if η = 1 this model corresponds to the simple urn model with triggering introduced earlier. Figures S2 and S3 report numerical results for the Heaps' and Zipf's laws respectively, for some values of the parameters of the model ν, ρ and η. For this modified model with semantic triggering, the relation between the exponent β of the Heaps' law and the exponent α = 1/β of the Zipf's law continues to hold asymptotically, i.e. for large times, with α measured on the tail of the frequencyrank distribution. In particular, the time at which the above relation starts to hold depends on the exponent β of the Heaps' law. Larger times are needed for smaller β. The existence of a pre-asymptotic regime for the Zipf's law is observed also in real datasets both for aggregated (see Fig. 1 of the main text) and for nonaggregated data (see the corresponding Section below). It is interesting to outline that this feature is captured only by the model with semantic triggering. This suggests that taking into account correlations is crucial to explain the apperance of different regimes in the statistics of real datasets.
We now outline the analysis leading to an estimate for the Heaps' exponent as a function of the model parameters ν, ρ and η. Observe that if we know the label of the last added element to the sequence S, say s, we can write for the number of distinct elements D(t) appearing in the sequence S: and Ns D (t) denote respectively the number of elements with label s, the number of new (never used in the sequence S) elements with label s, the number of elements with label different from s, and the number of new elements with label different from s, that are present in the reservoir U at time t.
The following relations hold: where U (t) is the number of total elements in the reservoir. It is worth remarking that if η = 1 one recovers Eq. (1). We now drop the hypothesis of knowing the label of the last added element, and write a general equation for D(t) of the form: where the sum is over all the labels k present at time t in the reservoir U and P (k) is the probability that the last added element to the sequence S at time t had the label k.
In order to close the equation (12), we should estimate N k (t) and N k D (t) for a generic label k. Let us start by observing that N k D (t) ≤ ν + 1, and this term can be neglected in the large t limit with respect to D(t).
We now leave the more complex problem of estimating N k (t) and we consider instead the probability P (n) that N k (t) ≡ n, substituting the sum over k in equation (12) with the sum over the labels with the same number of occurrences n in the reservoir. We can thus write (asymptotically): We do not explicitly compute P (n), but we consider two opposite limits: 1. We retain in the sum of equation (13) only the terms n U (t). This approximation is sufficiently good when the frequency-rank distribution for the elements in S is sufficiently steep, corresponding to a high Zipf's exponent. Solving the equation (13) within this approximation, we obtain the result for the Heaps' exponent β = min( νη ρ , 1).
2. When the probability P (n) is large only for n U (t), we can neglect in the sum of equation (13) the term n(1 − η) with respect to ηU (t). Solving the equation (13) within this approximation, we obtain: β min( ν ρ , 1).

The random walk model for the dynamics of novelties
Our urn model with triggering, both with and without semantics, can be mapped in the framework of the exploration of an evolving graph G through a random walker (RW). In particular, the RW dynamics can be constructed as follows (see also figure S5). We start with a graph G of N 0 nodes, divided in N 0 /(ν + 1) cliques, each node in the same clique sharing a common label. We then draw a link between each pair of nodes belonging to different cliques with probability η ≤ 1. Starting with the RW in a random position, and with a weight w j = 1 for each node j, at each time step: (i) move the RW to a neighbour node or keep it on the present node (self-loops allowed) with a weight-dependent probability;   figure S6). In each realization the sequence S has length N = 10 7 . Right: Results for the time intervals distribution for the same data as for the entropy. The color code is red for the actual sequence, green for the global reshuffle of the sequence S, and blue for the local reshuffle (see text).
In the inset a zoom of the first intervals' lengths is shown.
(ii) reinforce the selected node weight w i → w i + ρ; (iii) if the node visited is new (i.e., it is visited for the first time) add a clique with ν + 1 new nodes connected to the just visited node, each node in the new clique sharing a common label, different from all the preexisting ones. In addition draw a link between each node in the newly added clique and all the preexisting nodes of the network with probability η.
If η = 1 this model maps one-to-one to the urn model with triggering introduced in the main text. When η < 1 the correspondence with the urn model with semantic triggering is not one-to-one: in the case of the graph the connections between two nodes are fixed (or quenched), i.e. either they are there or they are not, whether the possibility of going from one element to each of the others in the urn model is always probabilistic (one can imagine that this corresponds to an annealed version of the graph model, where links are continuously re-drawn according to a fixed probability). Despite this difference, the statistical properties of the two models Random Walker η Clique S :

Reinforcement with labels
Reinforcement with labels Adjacent possible with labels Adjacent possible with labels In this case one adds this element to S (depicted at the center of the figure) and, at the same time, put ρ additional gray balls to U, all with the same label A of the parent gray ball. On the right panel we illustrate a generic adjacent possible step of the dynamics. Here, upon drawing a new ball (red) from U, ν + 1 brand new balls are added to U, all sharing a brand new label C, along as the ρ red balls of the reinforcement step that takes place at each time step. Bottom: scheme of the random walk (RW) based model for the dynamics of novelties. Whenever a RW visits an already visited node (gray node on the left panel) one adds a gray element to S and reinforce the node's weight according to the formula w i → w i + ρ. Whenever the RW visits for the first time a node i (red node in the right panel), a new clique (representing the newly created adjacent possible) with ν + 1 nodes is added to the graph, all the nodes sharing a brand new label C. Each node of the clique is connected to the red node, and with a probability η to the other already existing nodes. At the same time one adds the red element to S, always reinforcing the node's weight according to the formula w i → w i + ρ. turn out to be equivalent from a qualitative point of view also in the case η < 1.
In figure S6 we report some examples of the Heaps' and Zipf's laws for the RW model, for different values of the parameters ν, ρ and η, while in figure S4 we give an example of the triggering events as measured by the entropy S associated to the labels and the distribution f (l) of triggering time intervals between two successive appearance in the sequence S of the same label (see Section Methods in the main text).
As a final remark, we note that the RW modeling scheme allows one to more naturally extend the structure of the semantic relations between the different elements. The semantic relations are in fact encoded in the growing graph topology, and one can imagine different ways of linking the new nodes, corresponding to more complex and realistic semantic structures.

Details of the datasets used 3.1 Gutenberg Corpus
The corpus of English texts used in the analysis was collected by a crawl of the material available at the Gutenberg Project ebook collection [11]. The crawl was carried on February 2007 and resulted in a set of about 7500 non-copyrighted ebooks in plain ASCII format. After a filtering procedure used to remove from the analysis all non-English texts, we came up with ca. 4600 texts, dealing with diverse subjects and including both prose and poetry. In total, the corpus consisted of about 2.8 × 10 8 words, with about 5.5 × 10 5 different words. In the analysis we ignored capitalization. Words sharing the same lexical root were considered as different, i.e., the word tree was considered different from trees. Homonyms, as for example the verbal past perfect saw and the substantive saw, were treated as the same word. The aggregated analysis is performed by putting all the books in a random order one after the other in a single text. The texts used in the non aggregated analysis are listed in Table S1.

Delicious
Delicious [12] is an online social annotation platform of bookmarking where users associate keywords (tags) to web resources (URLs) in a post, in order to ease the process of their retrieval. The dataset used for the present analysis [13] consists of approximately 5 × 10 6 posts, comprising about 650,000 users, 1.9 × 10 6 resources   Table S1: Texts from the Gutenberg site used in the non-aggregated analysis.
For each text we report the total number of words, total number of distinct words and the estimated values of the (minus) the Zipf's exponent and Heaps' exponent. Note that 1/α > β since the single texts are not sufficiently long to allow the asymptotic regime to be visible, and the frequency-rank distribution curve has not yet gone through the crossover visible around 10 4 ∼ 10 5 in the analogous curve of the whole Gutenberg dataset, showed in the main article. and 2.5 × 10 6 distinct tags (for a total of about 1.4 × 10 8 tags), and covering almost 3 years of user activity, from early 2004 up to November 2006. Since Delicious is case-preserving but not case sensitive, we ignored capitalization in tag comparison, and counted all different capitalization of a given tag as instances of the same lower-case tag. The time stamp of each post was used to establish post ordering and determine the temporal evolution of the system. In the non-aggregated analysis we extracted from the Delicious dataset the posts of the three most active users (RangerRick, hidekii, PeterPeter) and two random ones (Vitelot, AndreaB).

Last.fm
Last.fm [14] is a music website equipped with a music recommender system. Last.fm builds a detailed profile of each user's musical taste by recording details of the songs the user listens to, either from Internet radio stations, or the user's computer or many portable music devices. The data set we used [15,16] contains the whole listening habits of 1000 users till May, 5th 2009, recorded in plain text form. It contains about 1.9 × 10 7 listened tracks with information on user, time stamp, artist, track-id and track name.
For the non-aggregated analysis we consider only the data of the five most active listeners.

English Wikipedia
The English Wikipedia database we analyzed consists of 323 compressed files summing up to a total of 48 GB of disk space. The uncompressed overall size is around 20 TB. The Wikipedia database we collected [17], dates back to March 7th, 2012. Due to the database huge dimension, we had to develop a special procedure to extract the information we needed. The computer we used to process the database is a multi-core machine mounting 8 Intel(R) Xeon(R) X3470 CPU, with a 2.93 GHz working clock frequency, with a RAM of 16 GB. The database contains a copy of all pages with all their edits in plain text by using the XML structure.
In order to perform the analysis related to the detection of triggering events, we extracted from the database the following information. First of all, we identified for each new born page, say B, the page, say A, that internally linked the new born page for the first time. We call the page A the mother page of B and we identify for each edit its mother page as its label (note that several edits can have the same mother page, i.e., the same label). We then follow the steps below: (1) To each edit event we associate: (i) the wikipedia page exclusive identification number (ID), (ii) the user (wikipedia contributor) ID (UID), (iii) the edit ID (EID), (iv) its time stamp (TS), (v) the PID of its mother page; (2) from the list of all edits endowed with the information discussed in (1), we removed the multiple edits of the same page done by the same user, retaining his/her first edit; (3) we sorted the list (2) according to increasing time stamp.
For the non-aggregated analysis we focused on seven randomly chosen editors. Special care was needed to understand whether a selected user was human. In fact, the most active editors of Wikipedia are robots performing minor changes routinely.

Results for non aggregated data
The analysis performed in the main text, involving the previously described datasets as a whole, is here repeated for some of their selected records. In case of the Gutenberg dataset, we chose texts; in Wikipedia, Last.fm and Delicious, we chose editors, listeners and tagging users respectively.

Heaps' and Zipf's law
The analysis of Heaps' law is displayed in Fig. S7 and shows an asymptotic sublinear power-law behaviour in the case of texts (see Table S1) and a possible linear behavior for Wikipedia editors (see Table S2). In the case of Last.fm and Delicious, the sublinear behavior can still be spotted but the dictionary curves are less smooth than those of Wikipedia and Gutenberg. The reason is that in both Last.fm and Delicious, users may import large blocks of music tracks and website bookmarks from their local storage, thus introducing a sort of discontinuity in time. This discontinuity is obviously less appreciable in figure S8, were we show the frequency-rank distribution of words in selected texts, lyrics in selected listeners using Last.fm, wiki-articles for selected editors in Wikipedia and tags for selected users of Delicious. In fact, the frequency-rank is insensible to the temporal ordering of the elements, being a global statistical property of the sample. Note how the more inflected ancient Greek language results in a smaller Zipf's exponent than that of English texts and correspondingly in a larger Heaps' exponent (see Table S1). It is also worth noting that the measured exponent β of the Heaps' law in the selected texts does not happen to be the reciprocal of the measured Zipf's exponent α. In the main text we have shown that the frequency-rank curve of the whole Gutenberg corpus displayed two main behaviors with different exponents (an analogous observation was shown in Ref. [18]) so that, when inferring α from texts containing 10 4 ∼ 10 5 distinct words, one tends to underestimate it. The Heaps' law, instead, is already sufficiently sensible to sample the tail of  Figure S7: Growth of the number of distinct elements (Heaps' law). Top-left: Selected masterpieces from the Gutenberg dataset (words as elements); Top-right: most active users in Last.fm (lyrics as elements); Bottom-left: selected (human) random editors of Wikipedia with appreciable activity (wiki-articles as elements); Bottom-right: Selected users of Delicious (tags as elements). The linear growth is indicated by the straight line. The discontinuities in both right panels can be ascribed to a data import from other sources (local playlists to Last.fm, local bookmarks to Delicious). the distribution so that the measured α and β are such that 1/α > β.
It is interesting to observe that the asymptotic validity of the relation between the Zipf's and Heaps' exponents is also captured by our model with semantic triggering. Fig.s S1 and S6 display the asymptotic correspondence β = 1/α along as the existence of at least another regime at lower ranks whose extension depends on the combination of parameters ν, ρ and η.
Another feature is worth to be mentioned. By looking at Fig. S7 we find that the growth of the number of distinct article edited in Wikipedia by users is linear. Our Polya's urn model accounts for this possibility as well, by predicting a connection between the Zipf's exponent and the slope of the linear dictionary growth.  Figure S8: Frequency-rank distribution (Zipf's law). Top-left: Selected masterpieces from the Gutenberg dataset (words as elements); Top-right: most active users in Last.fm (lyrics as elements); Bottom-left: selected (human) random editors of Wikipedia with appreciable activity (wiki-articles as elements); Bottomright: Selected users of Delicious (tags as elements). The straight line shows the strict Zipf's law with α = 1 as a guide for the eye.

Triggering events
To detect whether in a sequence there is a triggering mechanism in play, we make use of the definition of entropy (see Eq. 2 of the main text) and look at the distribution of time intervals between elements of the same class (see Section Methods in the main text).
For example, when listening to a certain lyric of a given artist, we could be tempted to listen to other of her lyrics. In that case, the occurrences of the lyrics' artist will be clusterized in the sequence more than an uncorrelated poissonian process. At the same time, we expect that the distribution of time intervals between the lyrics of the same artist will be more biased toward small time intervals than a poissonian process. In the case of lyrics, the class of elements is given by their artist, in Wikipedia by the wiki-article (mother page) that first linked to a new wiki-page, while in texts we considered each word as bearing its own class, lacking of a satisfactory classification of words in semantic areas.
In order to distinguish between sequences ruled by a random poissonian process from sequences featuring triggering events, we show (we already reported the corresponding results for Gutenberg texts in the main text) in figures S9 and S10 the entropy and interval distribution curves of selected Last.fm listeners and wiki editors (red dots), together with the correspondingly randomly shuffled sequences (blue dots) and the locally shuffled sequences (green dots). The latter are achieved by shuffling the subsequence that goes from the element following the first occurrence of a given element, to the end. These figures confirm that also at the user level one obtains the same results of the whole datasets. In particular, the drop of the entropy around the value of 10 in the three selected Last.fm listeners can be a consequence of the typical number of songs in a song album: who listens one song of an album, tends to browse all of it, so that a dozen of songs with the same artist appear heavily clusterized at short times, thus dropping the associated entropy value.
The interest of looking at triggering events on single books (we already reported about individual texts of the Gutenberg corpus in the main text), or considering a single contributor of Wikipedia or a single Last.fm user is to investigate the nature of the correlations observed in the whole databases. In particular, the question is whether the statistical signatures we detected emerge as an effect of a collective process or are present also at the single user level. The results reported in figures S9 and S10 show that the adjacent possible mechanism plays a role also on the individual level, and its effect is enhanced in collective processes.