Introduction

Innovation enters a wide variety of human activities and natural processes, from artistic and technological production to the emergence of new behaviours or genomic variants. At the same time, the encounter with novelty permeates our daily lives more extensively than we typically realise. We continuously meet new people, learn and incorporate new words into our lexicon, listen to new songs, and embrace new technologies. Although innovation and novelties (i.e., new elements at the individual or local level) operate at different scales, we can describe their emergence within the same framework, at least in certain respects1. Shared statistical features, including the well-known Heaps’2, Taylor’s3,4,5,6 and Zipf’s7,8 laws, suggest a common underlying principle governing their emergence. In this respect, an intriguing concept is the expansion into the adjacent possible9. The adjacent possible refers to the set of all the potential innovations or novelties attainable at any given time. When one of these possibilities is realised, the space of the actual enlarges, making additional possibilities achievable and thus expanding the adjacent possible. The processes introduced in1 provide a mathematical formalisation of these concepts, extending Polya’s urn model10 to accommodate infinitely many colours. They generate sequences of items exhibiting Heaps’, Zipf’s, and Taylor’s laws. The most general formulation of the modelling scheme proposed in ref. 1, the urn model with semantic triggering, also captures correlations in the occurrences of novelties, as observed in real-world systems. Further generalisations have been explored to capture the empirical phenomenology in diverse contexts: network growth and evolution11, the varied destinies of different innovations12, and mutually influencing events13. Additionally, the proposed modelling scheme can be cast within the framework of random walks on graphs, offering further intriguing perspectives and broadening its scope of applications14,15,16,17.

We now want to address the question of whether these generative models can also be successfully used in inference problems. This question is further motivated by the precise connection that has been established5,6 between the urn models in ref. 1 and seminal processes in Bayesian nonparametrics. The latter is a powerful tool for inference and prediction in innovation systems, where possible states or realisations are not predefined and fixed once and for all. Nonparametric Bayesian inference enables us to assign probabilities to unseen events and to deal with an ever-increasing number of new possibilities. Various applications have been proposed in diverse fields, including (but not limited to) estimation of diversity18,19,20,21,22, classification problems23,24, Bayesian modelling of complex networks25,26 and they take a considerable role in Natural Language Processing27,28.

The simplest model described in1, the urn model with triggering (UMT), reproduces, with a specific parameter setting, the conditional probabilities that define the two-parameter Poisson-Dirichlet process29, referred to as PD hereafter, that generalises the Dirichlet process30. The PD and the Dirichlet processes have gained special relevance as priors in Bayesian nonparametrics due to their generality and manageability31, and the PD process predicts the Heaps, Zipf and Taylor laws, making its use more convenient in linguistically motivated problems.

Here, we aim to explore the potential of the outlined connection between urn models for innovation and priors for Bayesian nonparametric inference. As a sample application, we address the authorship attribution task32.

The PD and Dirichlet processes have already been considered as underlying models for natural language processing and for authorship attribution purposes. The proposed procedures interpret the outputs of PD (or Dirichlet) processes as sequences of identifiers for distributions over words (i.e., topics)33 and measure similarity among texts or authors based on topics’ similarity34,35. We briefly discuss topic models in the Methods section. It is worth stressing here that these approaches have led to hierarchical formulations that require efficient sampling algorithms for solving the problem of computing posterior probabilities28,33,36,37. Moreover, these methods strongly rely on exchangeability, mainly due to the property of conditional independence it implies, through the de Finetti and Kingman theorems38,39, and for guaranteeing the feasibility of the Gibbs sampling procedure27,28. Exchangeability refers to the property of the joint probability of a sequence of random variables being invariant under permutations of the elements. Notwithstanding the powerful tools it provides, this assumption is often unrealistic when modelling real-world processes.

We take a different perspective by interpreting the outputs of the underlying stochastic processes directly as sequences of words in texts or, more generally, tokens. Language serves as a paradigmatic example where novelty enters at different scales, ranging from true innovation–creation and diffusion of new words or meanings–to what we denote as novelties–the first time an individual adopts or encounters (or an author uses in their production) a word or expression. We thus borrow from information theory40,41 the conceptualisation of a text as an instance of a stochastic process and consider urn models for innovation processes as underlying generative models. Specifically, we here consider the UMT model in its exchangeable version, which is equivalent to the PD process. We opt out of a fully Bayesian approach and use a heuristic method to determine the base distribution of the process–that is, the prior distribution of the items expected to appear in the sequence.

The overall change in perspective we adopt allows us to avoid the Monte Carlo sampling required in hierarchical methods. Moreover, while we consider here an exchangeable model, exchangeability is not crucial in our approach, paving the way for an urn-based inferential method that considers time-dependent correlations among items.

When comparing our method to various approaches used in authorship attribution tasks, we find promising results across different datasets (ranging from literary texts to blogs and emails), demonstrating that the method can scale to large, imbalanced datasets and remains robust to language variation.

Results

The authorship attribution task

To demonstrate a possible application of the UMT generative model for an inference problem, we used the probabilities of token sequences derived from the process to infer the authorship of texts. In the authorship attribution task, one is presented with a set of texts with known attribution – the reference corpus – along with a text T from an unknown author. The goal is to attribute T to one of the authors represented in the corpus (closed attribution task) or more generally, to recognise the author as one of those represented in the corpus or possibly as a new, unidentified author (open attribution task)42. Here we explicitly consider the case of the closed attribution task, although several strategies can be adopted to apply the method in open attribution problems as well.

Following the framework of Information Theory40,41, we can think of an author as a stochastic source generating sequences of characters. In particular, a written text is regarded as a sequence of symbols, which can be dictionary words or, more generally, short strings of characters (e.g., n-grams if such strings have a fixed length n), with each symbol appearing multiple times throughout the sequence. Each symbol constitutes a novelty the first time it is introduced.

We evaluate the similarity between two symbolic sequences by computing the probability that they are part of a single realisation from the same source. More explicitly, let \({x}_{1}^{n}\) and \({x}_{2}^{m}\) be two symbolic sequences with length n and m respectively. Given their generative process—their source—we can compute the conditional probability \(P({x}_{1}^{n}| {x}_{2}^{m})\), that is, the probability that \({x}_{1}^{n}\) is the continuation of \({x}_{2}^{m}\). In the authorship attribution task, the anonymous text T is represented by a symbolic sequence xT, while an author A by the symbolic sequence xA obtained by concatenating the texts of A in the reference corpus. It is worth noting that an author A affects the probability of T both by defining the source and through the sequence xA. We will use the notation P(TA) ≡ PA(xTxA) for the conditional probability of T to continue the production of A. The anonymous text T is attributed to the author \(\tilde{A}\) that maximises such conditional probability: \(\tilde{A}={\max }_{A}P(T| A)\). We thus need to specify the processes generating the texts and the elements xi of the symbolic sequences, i.e., the tokens.

The tokens

We can make several choices for defining the variables—or tokens—xi. In what follows, we consider two alternatives: first, we consider Overlapping Space-Free N-Gram43 (OSF). These are strings of characters of fixed length N as tokens, including spaces only as the first or last characters, thereby discarding words shorter than N−2. This choice has often yielded the best results. Secondly, we explore a hybrid approach where we exploit the structures captured by the Lempel and Ziv compression algorithm (LZ77)44. We define LZ77 sequence tokens as the repeated sequences extracted through a modified version of the Lempel and Ziv algorithm, which has been previously used for attribution purposes45. For each dataset, we select the token specification that provides the best performance. In the Supplementary Results, we compare the achieved accuracy when using the token definitions discussed above as well as when using simple dictionary words as tokens.

The generative process and the posterior probabilities

We consider the UMT model in its exchangeable version, which provides an urn representation of the PD process. The latter is defined by the conditional probabilities of drawing at time t + 1 an old (already seen) element y and a new one (not seen until time t). They are given, respectively, by:

$$P({x}_{t+1}=y| {x}^{t}) = \frac{{n}_{y,t}-\alpha }{\theta + t},\,\,\,\,\qquad{{\mbox{if}}}\,\,\,\,{n}_{y,t}\, > \, 0 \\ P({x}_{t+1}=y| {x}^{t}) = \frac{\theta +\alpha {D}_{t}}{\theta +t}{P}_{0}(y),\,\,\,{{\mbox{if}}}\,\,\,\,{n}_{y,t}=0$$
(1)

where ny,t is the number of elements of type y at time t and Dt is the total number of distinct types appearing in xt; 0 < α < 1 and θ > − α are two real-valued parameters and P0() is a given distribution on the variables’ space, called the base distribution. The UMT model does not explicitly define the prior probability for the items’ identity, i.e., the base distribution P0. The latter can be independently defined on top of the process, in the same way as for the Chinese restaurant representation of the Dirichlet or PD processes46 (please refer to section UMT and PD processes in the Methods for a thorough discussion on the urn models for innovation and their relation with the PD process).

Crucially, Eqs. (1) are only valid when P0 is non-atomic, which implies that each new token can be drawn from P0 at most once with probability one. On the contrary, when P0 is a discrete probability distribution (it has atoms), an already seen value y can be drawn again from it, and the conditional probabilities no longer have the simple form shown in Eq. (1) (as detailed in the Methods). In a problem of language processing, the tokens are naturally embedded in a discrete space, which has led to the development of hierarchical formulations of the PD process47,48. In these approaches, the P0 is the (almost surely) discrete outcome of another PD process with a non-atomic base distribution. Here we follow a different approach. We regard P0 as a prior probability on the space of new possibilities. In this view, the tokens take values from an uncountable set, and thus the probability of drawing the same token y from P0 more than once is null. As a consequence, we can use the simple Eq. (1), where we need to make some arbitrary choices for the actual definition of the base distribution. In the following, we identify P0(y) with the frequency of y in each dataset, while still treating P0 as a non-atomic distribution by ensuring that each item can be drawn at most once from it. However, this raises a tricky question of normalisation, which strongly depends on the dataset, resulting in the arbitrary modulation of the relative importance of innovations and repetitions. We have addressed this problem heuristically by introducing an additional parameter δ > 0 that multiplies P0: it suppresses (δ < 1) or enhances (δ > 1) the probability of introducing a novelty in T. In addition, we consider an author-dependent base distribution by discounting the vocabulary already appearing in A (details are given in section The strategy of P0 in the “Methods” section). To summarise, the conditional probabilities P(TA) are derived from Eqs. (1), where the base distribution P0(y) is defined as discussed above. Different values of α and θ characterise the specific distribution associated with each author. We fix αA and θA for each author A to the values that maximise her likelihood (refer to the Supplementary Methods for details). We denote by DK (with K = A, T) the number of types (i.e., distinct tokens) in A and T, and by DTA − DA the number of types in T that do not appear in A. The conditional probability of a text T to be the continuation of the production of an author A reads:

$$P(T| A) = \frac{{({\theta }_{A}+{\alpha }_{A}{D}_{A}| {\alpha }_{A})}_{{D}_{T\cup A}-{D}_{A}}}{{({\theta }_{A}+m)}_{n}} {\prod }_{j=1}^{{D}_{T}}{Q}_{j},\\ {Q}_{j} \equiv \left\{\begin{array}{ll}{(1-{\alpha }_{A})}_{{n}_{j}^{T}-1}{P}_{0}({y}_{j})\quad &\,{{\mbox{if}}}\,\,{y}_{j}\notin A \\ {({n}_{j}^{A}-{\alpha }_{A})}_{{n}_{j}^{T}}\quad \hfill&\,{{\mbox{otherwise}}}\,.\end{array}\right.$$
(2)

where \({n}_{j}^{K}\) is the number of occurrences of yj in K (with K = A, T), such that \({\sum }_{j}{n}_{j}^{A}=m\) and \({\sum }_{j}{n}_{j}^{T}=n\). The Pochhammer symbol and the Pochhammer symbol with increment k are defined respectively by (z)n ≡ z(z + 1)…(z + n − 1) = Γ(z + n)/Γ(z) and \({(z| k)}_{n}\equiv z(z+k)\ldots \left(z+(n-1)k\right)\).

In practice, when attributing the unknown text, we adopt the procedure of dividing it into fragments and evaluating their conditional probability separately. The entire document is then attributed either to the author that maximises the probabilities of most fragments or to the author that maximises the whole document probability computed as a joint distribution over independent fragments (i.e., as a product of the probabilities of its fragments). We optimise this choice for each specific dataset, as described in the Supplementary Methods.

Results

We test our approach on literary corpora and informal corpora. To challenge the generality of our method versus language variation49, we consider three corpora of literary texts in three different languages, English, Italian, and Polish, belonging to distinct Indoeuropean families and bearing a diverse degree of inflection (refer to the Supplementary Note 1 for details). We further consider informal corpora mainly composed of English texts. They are particularly challenging for the attribution task due to the strong unbalance in the number of samples per author and the texts’ lengths (refer to Fig. 1, panel a). We consider, in particular, an email corpus and a blog corpus. The first is part of the Enron Email corpus proposed during the PAN’11 contest50. It is still used as a valuable benchmark, and we compare the accuracy of our method with those reported in refs. 34,35. The Blog corpus is one of the largest datasets used to test methods for authorship attribution51. This is a collection of 678,161 blog posts by 19,320 authors taken from ref. 52. Additionally, in line with refs. 53,54, we test our method on the subset of 1000 most prolific authors of this corpus. For more details on the corpora, please refer to the Supplementary Note 1.

Fig. 1: Corpora sizes and the impact of model parameters on attribution accuracy.
figure 1

In panel (a) we offer a pictorial view of various characteristics related to the size of the considered corpora. The size of the triangles is proportional to the logarithm of the corpus size, measured as number of documents. In the x and y axes we represent for each corpus the distribution of the numbers of texts (x axis) and of the numbers of characters (y axis) per author. Specifically, the continuous line bars represent the interquartile range of the distributions and the dotted lines show the 95% interval, to highlight their long tails. Panels (bf) report the attribution accuracy varying the length of the fragments and the δ value. The colour scale refers to the difference relative to the maximum attribution accuracy obtained in each dataset. In the upper band, the considered length of fragments is of a single token. In the lower band, the text is not partitioned in fragments (full text).

In Fig. 1b–f, we illustrate the dependency of the attribution accuracy on the value of two free parameters of our model, specifically the normalisation δ and the length of the fragments in which we partition the text to be attributed. In particular, we report the accuracy achieved on each dataset in a leave-one-out experiment, where we select each text in turn and attribute it by training the model on the rest of the corpus (refer to the Supplementary Methods for more details). We note that, although simply setting δ = 1 often gives the most or nearly the most accurate results, in a few datasets using a different value of δ significantly improves the accuracy. Indeed an effect of δ is also to correct for a non-optimal choice of the length of the fragments, as is evident in the literary English dataset. When attributing an anonymous text, we optimise these two parameters—as well as the selection of P0, the definition of the tokens, and the strategy to attribute the whole document from the likelihood of single fragments—on the training and validation sets, as detailed in the Supplementary Methods.

In the case of informal corpora, we compare our method with state-of-the-art methods in the family of topic models33. Topic models are among the most established applications of nonparametric Bayesian techniques in natural language processing, and different authors’ attribution methods rely on this approach. The underlying idea is to consider each document as a mixture of topics and to compute the similarity between two documents in terms of a measure of overlap between them (as detailed in the Methods section and the Supplementary Methods). Those methods were proposed to address challenging situations, particularly in informal corpora with many reference authors and typically short texts. Moreover, they have a similar ground to the method we propose. We consider the Latent Dirichlet Allocation plus Hellinger distance (LDA-H)53, the Disjoint Author-Document Topic model in its Probabilistic version (DADT-P)34 and the Topic Drift Model (TDM)35 since their performances are available on the informal corpora. LDA-H is a straightforward application of topic models to the authorship attribution task. The DADT-P algorithm is a generalisation of the LDA-H characterising both the topics associated with texts and with authors. TDM merges topic models with machine learning methods55,56 to account for dynamical correlations between words.

For the literary corpora, there is no direct comparison available in the literature. In the family of topic models, we considered the LDA-H approach, whose implementation is available with the need for minor intervention (please refer to the Supplementary Methods for details on our implementation). In addition, we consider a cross-entropy (CE) approach57,58 in the implementation used in previous research45. Compression-based methods are general and powerful tools to assess similarity between symbolic sequences and have been at the forefront of authorship attribution for considerable time59.

When comparing the aforementioned methods and ours, we optimise the free parameters of our model (i.e., δ, length of fragments, attribution criterion, type of tokens, and P0) on the training set, as detailed in the Supplementary Methods. The email corpus already provides training, validation, and test sets. For the remaining corpora, we use ten-fold stratified cross-validation34,53,54: in turn, one-tenth of the dataset is treated as a test set and the other nine-tenths as training, and the number of samples per author is kept constant across the different folds. In Fig. 2, we report the accuracy obtained on each of the ten partitions, as well as the average value over them. We show the results obtained by either switching off the parameter δ (that is, by fixing it to 1) or optimising it on each specific corpus. The first scenario is denoted by CP2D (Constrained Probability 2-parameters Poisson-Dirichlet), the latter by δ-CP2D. The second procedure yields better performances in all the datasets except for the Polish literary dataset, where the number of texts per author is too low to prevent overfitting in this simple training setting. In the literary corpora, the attribution accuracy is overall high, and that of our method consistently higher than that of the other techniques. In the informal corpora, our method achieves an accuracy slightly lower than the best-performing algorithm on the email corpus, while it is the most accurate on the blog corpus. This latter corpus presents a very large number of candidate authors, and our approach appeared more robust in these extreme conditions. In Table 1, we present the numerical value of the average accuracy over the ten partitions, as shown in Fig. 2 (additional evaluation metrics can be found in the Supplementary Results). We also add the attribution accuracy on the training set. We observe that in the literary corpora, only in the Polish dataset, the accuracy on the test set is significantly lower than that in the training set, pointing to overfitting, as discussed above. For the informal corpora, we conversely notice an increase in attribution rate from the training to the test corpora. For the email corpus, also other methods exhibit a similar behaviour34,54. This is probably related to the particular partition considered. For the Blog corpus, the attribution accuracy on the test set is not available for the other methods. Our method features a slightly greater accuracy on the test set than on the training, suggesting that, on the one hand, the corpus is sufficiently large to prevent overfitting. On the other hand, the method increases accuracy when increasing the length of the reference authors’ sequences.

Fig. 2: Attribution accuracy.
figure 2

For each of the considered datasets and attribution methods, thick lines show the average accuracy in the ten-fold stratified cross-validation experiment, while shaded circles refer to the attribution accuracy on each of the ten test sets separately. An exception is the E-mail dataset, where a unique test set is considered (see main text). We compare the accuracy achieved by our method (the Constrained Probability 2-parameters Poisson-Dirichlet, in both its versions with and without including the parameter δ: the CP2D and δ-CP2D) with the Cross-Entropy based approach (CE), the Latent Dirichlet Allocation plus Hellinger distance (LDA-H), the Disjoint Author-Document Topic model in its Probabilistic formulation (DADT-P), and the Topic Drift Model (TDM). On the literary corpora, the LDA-H accuracy is computed using our implementation; please refer to Supplementary Methods for details. For the informal corpora, the results are are available from a previous study [ref. 54, Table 1]. Results for the DADT-P and the TDM algorithms were available in the works by Seroussi et al.34 [Tables 4 and 5] and Yang et al.35 [Table 1], respectively.

Table 1 Attribution results

Conclusion

We present a method for authorship attribution based on urn models for innovation processes. We interpret texts as instances of stochastic processes, where the generative stochastic process represents the author. The attribution relies on the posterior probability of the anonymous text being generated by a particular author and continuing their production. We consider the UMT model1 in its exchangeable version5,6, which is equivalent to the two-parameter Poisson-Dirichlet process. While the latter process is widely used in Bayesian nonparametric inference, it is often employed in a hierarchical formulation. In the case of attribution tasks, this approach has led to topic models, where the output of the stochastic process is a sequence of topics, i.e., distribution over words. Here, we follow a more direct approach, where the stochastic process directly generates words. By relying on a heuristic approach, we can explicitly write posterior probabilities that can be computed exactly. Besides its computational convenience, the method we propose is easily adaptable to incorporate more realistic models for innovation processes.

For instance, one avenue we intend to explore in future research is leveraging the urn model with semantic triggering1.

We evaluate the performance of our approach by employing the simple UMT exchangeable model against various related approaches in the field. Specifically, we compare it with information theory-based methods45,57,58 and probabilistic methods based on topic models34,35. Our method achieves overall better or comparable performance in datasets with diverse characteristics, ranging from literary texts in different languages to informal texts.

We acknowledge that our method may not compete with deep learning-based models (DL) when large pre-training datasets are available60,61. Nonetheless, it exhibits robustness in challenging situations for DL, for example, when only a few texts are available for many authors61 or in languages where pre-training is less extensive62. A deeper comparison with deep learning-based approaches, perhaps by concurrently exploring more sophisticated urn models in our approach, is in order but beyond the scope of the present work (refer to the Supplementary Results for a more detailed discussion and a preliminary analysis).

As a final remark, we also note that we have here considered the so-called closed-set attribution32, where the training set contains part of the production of the author of the anonymous text. In open-set attribution63,64, the anonymous text may be of an author for which no other samples are available in the dataset. Despite the conceptual differences and nuances between the two tasks, approaches based on closed-set attribution64 are sometimes used also in open-set problems, for instance, by assigning the text to an unknown author if a measure of confidence falls below a given threshold. Similar strategies can be employed with our method by leveraging the conditional probabilities of documents.

We finally note that the method presented here is highly general and can be valuable beyond authorship attribution tasks. Although we expect it to be particularly suitable when elements take values from an open set and follow an empirical distribution close to that produced by the model, it can be applied to assess the similarity between any class of symbolic sequences.

Methods

UMT and PD processes

In1, a family of urn models with infinitely many colours was proposed to reproduce shared statistical properties observed in real-world systems featuring innovations. In this context, a realisation of the process is a sequence xt = x1, …, xt of extractions of coloured balls, where xt is the colour of the element drawn at time t, and the space of colours available at a given time – the urn – represents the adjacent possible space. The urn model with triggering (UMT)1 (and in a more general setting in refs. 5,6) operates as follows: the system evolves by drawing items from an urn initially containing a finite number N0 of balls of distinct colours. At each time step t, a ball is randomly selected from the urn, its colour registered into the sequence, and returned to the urn. If the colour of the drawn ball is not in the sequence \({x}^{t},\tilde{\rho }\) balls of the same colour and ν + 1 balls of entirely new colours, i.e., not yet present in the urn, are added to the urn. Thus, the occurrence of new events facilitates others by enlarging the set of potential novelties. Conversely, if the colour of the drawn ball already exists in xt, ρ balls of the same colour are added to the urn. Given the history of extractions xt, the probabilities bt and qc,t that the drawing at time t results in a new colour or yields a colour c already present in xt are easily specified for this model:

$${b}_{t} = \frac{{N}_{0}+\nu {D}_{t}}{{N}_{0}+\rho t+a{D}_{t}}\\ {q}_{c,t} = \frac{\rho {n}_{c,t}+a-\nu }{{N}_{0}+\rho t+a{D}_{t}}$$
(3)

where Dt and nc,t are the number of distinct colours and the number of extractions of colour c in the sequence xt, respectively, and \(a=\tilde{\rho }-\rho +\nu +1\). Different choices of the parameters \((\rho ,\tilde{\rho },\nu )\) lead to different scenarios, enabling the UMT model to capture the empirical properties summarised by Heap’s, Zipf’s and Taylor’s laws. In the original formulation1, only two values for the parameter \(\tilde{\rho }\) were discussed: \(\tilde{\rho }=\rho \) or \(\tilde{\rho }=0\); the special setting \(\tilde{\rho }=\rho -(\nu +1)\), which makes the model exchangeable, was later pointed out5. We remind that exchangeability refers to the property that the probability of drawing any sequence xt ≡ x1, …, xt of any finite length t does not depend on the order in which the elements occur: P(x1, …, xt) = P(xπ(1), …, xπ(t)) for each permutation π and each sequence length t. In this case, upon a proper redefinition of the parameters, namely ν/ρ ≡ α and N0/ρ ≡ θ, the UMT model reproduces the conditional probabilities associated with the PD process (expressed in Eqs. (1)). We note here that such probabilities include the Dirichlet process as a special case, where α = 0 and Dt grows logarithmically with t. In the framework of urn models, the Dirichlet process finds its counterpart in the Hoppe model65 and in the exchangeable version of the UMT model with the additional choice ν = 0. The PD process is defined by 0 < α < 1 and predicts the asymptotic behaviour Dt ~ tα46. We note that the probabilities in Eqs. (3) coincide, when renaming the parameters as stated above, with those in Eq. (1).

The strategy for P 0

When P0 is a discrete probability distribution (it has atoms), an already seen value y can be drawn again from it, and the conditional probabilities no longer have the simple form as in Eq. (1). In this case, the conditional probabilities depend not only on the sequence xt of observable values but also on latent variables indicating, for each element in xt, whether it has been drawn from P0 or arose from the reinforcement process66. In particular, we can define, for each type yi (i = 1, …, Dt) in xt, a latent variable λi,t that counts the number of times yi is drawn from the base distribution P0. The probabilities conditioned on the observable sequence xt and on the latent variables sequence \({\lambda }^{{D}_{t}}\) read:

$$P({x}_{t+1} = y| {x}^{t},{\lambda }^{{D}_{t}})=\frac{{n}_{y,t}-{\lambda }_{i,t}\alpha }{\theta +t}+\frac{\theta +\alpha {\Lambda }_{t}}{\theta +t}{P}_{0}(\,y)\quad {{\mbox{if}}}\,\,\,\,{n}_{y,t} > 0\\ P({x}_{t+1} = y| {x}^{t},{\lambda }^{{D}_{t}})=\frac{\theta +\alpha {\Lambda }_{t}}{\theta +t}{P}_{0}(\,y)\,\,\,\,\,\;\,\,\qquad \qquad \quad{{\mbox{if}}}\,\,\,\,{n}_{y,t}=0$$
(4)

Where \({\Lambda }_{t}\equiv {\sum }_{i}^{{D}_{t}}{\lambda }_{i,t}\) is the total number of extractions from P0 till time t. To compute the probabilities conditioned to the observable sequence xt, we must integrate out the latent variables. This is an exponentially hard problem and efficient sampling algorithms33,36,37 have been developed for an approximate solution.

By taking the perspective of the urn model, we investigate the possibility of bypassing the problem by imposing that each element can be extracted only once from P0(), which is equivalent to fixing all the latent variables λi,t = 1 and set to zero the last term in the first equation in Eq. (4).

The latter procedure effectively replaces P0(y) with a history-dependent probability, normalised at each time over all the elements y not already appeared in xt. It reads:

$${P}_{0}^{t}(y)\equiv {P}_{0}(y| y \, \notin \, {x}^{t})=\left\{\begin{array}{ll}\frac{{P}_{0}(y)}{1-{\sum }_{\tilde{y}\in {x}^{t}}{P}_{0}(\tilde{y})}\quad &\,{{\mbox{if}}}\,\,y \, \notin \, {x}^{t}\\ 0\quad \hfill &\,{{\mbox{otherwise}}}\,\end{array}\right.$$
(5)

where the sum is over all the elements already drawn at time t. Note that this choice breaks the exchangeability of the process with respect to the order in which novel elements are introduced. In the implementation of our algorithm, we follow an even simpler and fast procedure, which yielded equivalent results. We simply introduce an author-dependent base distribution by considering, for each author A, the frequency of the tokens that do not appear in A. Such procedure translates into replacing P0(y) with \({P}_{0}^{(A)}(y)=\frac{{P}_{0}(y)}{P({A}^{{{{{\mathcal{C}}}}}})}\), where \({A}^{{{{{\mathcal{C}}}}}}\) denotes the set of all distinct tokens that do not appear in A. This author-dependent base distribution proved to be preferable to simply using the original frequency, especially in datasets with short texts and few samples for each author.

LDA and topic models

LDA is a generative probabilistic model67, which generates corpora of documents. A document is a finite sequence of words w1, w2, . . . , wN and it is represented as a random mixture over latent topics. Each topic corresponds to a categorical probability distribution over the set of all possible words. Topics can be shared by different documents. The total number k of topics is fixed a priori and to each topic i in each document d is associated a probability θi,d, extracted independently for each document from a k-dimensional Dirichlet distribution D(α1, …, αk). Each document d is generated as follows: first, its length Nd is extracted from a Poissonian distribution with a given mean. Then, the document is populated with words using the following procedure: a topic i is extracted with probability θi,d and a word w is extracted from i with the probability associated to it in topic i. The probabilities pi(w) of a word w in the topic i is in turn extracted independently from a W-dimensional Dirichlet distribution D(β1, …, βW), where W is the total number of words W in the corpus.

As in Eqs. (4), we can introduce latent variables67, now with a different meaning. To each word wi,d in document d, i = 1, …, Nd, we associate a latent variable λi,d that is the identifier of the topic j from which the word wi,d is extracted. The joint distribution of the sequence of words \({w}^{{N}_{d}}\equiv {w}_{1,d},\ldots ,{w}_{{N}_{d},d}\) and latent variables \({\lambda }^{{N}_{d}}\equiv {\lambda }_{1,d},\ldots ,{\lambda }_{{N}_{d},d}\) in a document d thus read:

$$P({w}^{{N}_{d}},{\lambda }^{{N}_{d}})=\mathop{\prod }_{n=1}^{{N}_{d}}p({w}_{i,d}| {\lambda }_{1,d})p({\lambda }_{i,d})$$
(6)

where p(λi,d) ≡ θi,d. To compute the posterior probability of the observable sequence \({w}^{{N}_{d}}\) we must integrate out the latent variables. This is an exponentially hard problem and is solved with methods for numerical approximation by using, for instance, Markov Chain Monte Carlo algorithms. A more flexible approach is to use the Dirichlet or PD processes instead of the Dirichlet distributions over topics. This allows the number of topics k to remain unspecified a priori.

The probabilities θi,d are the elements of a sequence generated by a Dirichlet or PD process, for each document d. The processes characterising each document share the same discrete base distribution, which is, in turn, generated by a Dirichlet or PD process with a non-atomic P0. Again, efficient sampling algorithms for computing the posterior distributions33,36,37 have been developed in this framework.

In the framework of authorship attribution, methods relying on LDA are more widely adopted than those based on the Dirichlet or PD processes, primarily due to their simplicity and comparable accuracy34.

The procedure followed by the LDA-H algorithm to address the author attribution task is described in the Supplementary Methods.