Introduction

Past records of the frequency of a character, i.e., an allele or a phenotype, until present observational time are often the only source of information to infer the strength of selection on that character. Time series of ancient DNA, in particular, are becoming available thanks to modern advances in preparation and sequencing methods1,2. These past records deliver the fluctuating frequency of an allele over time. The nature of these fluctuations is characterized by the combined effect of various mechanisms, the simplest of which are natural selection and genetic drift, on which we will focus our attention here. While natural selection drives the frequency towards fixation or stabilization, genetic drift caused by a small effective population size works towards elimination of genetic diversity and, thus, towards fixation of one of the characters or alleles3. As such, if the population size is known, genetic drift is a noisy effect that changes the frequency of the alleles and masks the effect of selection.

Natural selection can be theoretically described with relatively simple population genetics models, such as the Moran and the Wright-Fisher models3,4. At the basis of these models, the effect of natural selection is often crystallized in one single parameter per locus, called the selection coefficient. One of the tasks ahead of the analysis of DNA time-series is thus the extrapolation of the underlying selection coefficient. Indeed, the selective advantage of a certain character is quite impossible to determine from first principles, e.g. from an evaluation of metabolic costs and benefits, with the exception perhaps of a few experimentally controlled cases in bacterial populations. But even in bacteria, the advantage of a certain gene compared to another is determined indirectly, mostly by competition experiments and growth rate measurements5.

Various methods, mostly based on maximum likelihood techniques have been developed to duly take both genetic drift and sampling errors into account6,7,8,9. Several limiting cases have considered the task of determining the selection coefficient in the absence of genetic drift, i.e., with large populations, thus taking in fact a deterministic approach5,10,11,12. The limiting cases that we consider here are both an haploid character with two competing alleles and a one-locus two-allele model with selection and codominance. We consider a finite population with perfect sampling. These conditions allow an analytic and precise treatment of the effect of genetic drift.

Taking apart those cases where the population size is too big for genetic drift to play any role, in the general case it is possible that the less advantageous character or allele is present at a larger frequency than a competing but more advantageous character. Nevertheless, we may inquire if and when a given time-series of the frequencies is informative of the relative selection strength of the two competing characters. Simple models of population genetics, albeit sometimes not completely realistic, provide a clear platform to derive analytical results easy to interpret and generalize. The aim of this work is to introduce a new likelihood function that works for any strength of the selection coefficient and for any value of the frequency, i.e., also for frequencies close to the fixation boundary. Accordingly, in order to understand the potentiality and the limits of such analysis we will first work with the Moran model of population genetics, which is by far one of the most intensively studied and successful metaphor of evolution under selection and drift4,13,14. We will then study the same problem with the one-locus two-allele Wright-Fisher model, which is definitively a more complex and more realistic metaphor of natural selection and drift6.

As we shall see, extracting the selection coefficient even in such a simple set-up is tricky. If one uses the wrong likelihood, apparently meaningful, self-consistent but otherwise incorrect conclusions are produced. The key point will be to understand that single time-series of processes that are per se non-stationary need to be treated as stochastic processes conditioned both in the initial and the final condition.

Results

The models that we consider have two types of alleles, A and B. In the Moran model we will have haploid individuals carrying the alleles of type A and B. In the Wright-Fisher model we will have diploid individuals carrying a pair of alleles of types A and B in one autosomal locus. Although these two models differ in structure and complexity, it is still possible to provide a common description of the underlying process of selection and drift. We start by considering a population of N alleles. To allow for a common treatment of both models we will assume that N is an even number. At any point in time, NA and NB are the number of alleles of type A and B in the population, respectively and at each time point NA + NB = N holds. We will say that NA and NB are the frequencies of alleles A and B, respectively. Throughout the whole manuscript, we assume that these frequencies can be measured exactly (no sampling errors).

We will follow the fate of the number of alleles of type B whose dynamics will be described as a Markov chain in discrete time with two absorbing states in NB = 0 and NB = N. These two absorbing states correspond to the fixation of allele A and B, respectively.

A single historical trajectory of T + 2 measurements for the frequency NB can be used to estimate the selection coefficient. The trajectory has a initial condition NB(0) = i, followed by T intermediate measurements from strictly consecutive updates and one additional, final measurement at TF. In what follows, while NB(0) is the same for all cases studied here, we consider various options for the timing TF of the final measurement and for the value of the frequency of the alleles of type B at TF, NB(TF). We will also assume that the time is measured in generations, even if, strictly speaking, in the Moran model the generations are overlapping and in the Wright-Fisher model they are non-overlapping. We consider a total of four different limiting cases (Fig. 1). On the one hand, the first two cases are when TF is just one generation after the Tth measurement, i.e., TF = T + 1. Ideally, these first two cases correspond to time series of consecutive generations finishing at present time. Case I is defined when NB(TF) is at an intermediate frequency, i.e., NB(TF) ≠ 0, N. Case II is when NB(TF) = N, namely when the allele of type B has reached fixation before or at present time. On the other hand, the second two cases are when generation TF is long after generation T, i.e., TFT. Ideally, this corresponds to trajectories where the initial time t = 0 of the temporal observation is far back in the past so that also after T generations the time-series of available data is still far back in the past. Here, generation TF is at present time and NB(TF) is known but the values of NB at times between generations T and TF are missing. We then distinguish between case III, when the present frequency NB(TF) is at any intermediate frequency, i.e., NB(TF) ≠ 0, N and case IV where the present frequency is at fixation for the allele of type B, i.e., NB(TF) = N. Obviously, cases III and IV reduce to cases I and/or II when the frequency at present time is ignored. As will become clear later, these cases are definitively different depending on the assumption that one makes for the present state. One can also recognize that case III is the most studied one in the literature so far2,6,15. Since in cases II and IV fixation can occur at any generation including generations t < T, with the Wright-Fisher model we have also considered a variant of these two cases in which NB(TF) = N − 1, i.e., very close to fixation but not yet fixed. These variants do not present substantial differences in the results and are further discussed below.

Figure 1
figure 1

Four kinds of time series.

Schematic representation of the four cases considered here. The green bars represent available data for T consecutive generations, whereas the blue dashed lines represent non available data. The time arrow goes from left to right with the present time called generation TF. The data includes an initial condition at generation zero. We follow the trajectory of the allele B, whose frequency at present time is known in all cases. In cases I and II, TF is just one generation after the measurement T, i.e., TF = T + 1, so that the available data concern the recent history of the allele. In cases III and IV the measurement TF is made a long time after the measurement T. We can think of the cases III and IV as time-series where both the ancient history and the present frequency are known in detail but data in between are missing. Within the Wright-Fisher model we consider a variant of cases II and IV, in which B is very close to fixation but not yet fixed.

For each one of these four cases we generate 100 independent time-series while keeping the selection coefficient fixed to S = 2, whose meaning is explained below for each of the two models separately. We generate such trajectories via stochastic simulations and then analyze them with the likelihood developed below to prove if we are able to reliably extract the selection coefficient. Within each of the four cases I to IV, all trajectories share the same initial and final conditions NB(0) and NB(TF), respectively, but are otherwise completely independent.

Each trajectory is fully described by the index functional δij(t)  {0, 1} such that

namely δij(t) = 1 when a transition from frequency i to frequency j of the number of B alleles occurs at time step t. The index t runs over the measurements, t = 0, 1, 2, …, T. Thanks to this functional, the selection coefficient can be estimated by means of the conditioned likelihood

where lowercase s refers to the estimated value of S and the conditional transition probability is defined as

The selection coefficient s enters into this definition through the explicit form of the model as will be discussed in detail below and in the Methods section.

The application of the conditioning at the final frequency NB at the end of the trajectory allows to explicitly write the relationship between the conditioned and the non-conditioned transition probabilities by exploiting the Markov property of the chains, as16

which in a shorthand we write as

where Pij(s) are the non-conditioned transition probabilities as defined by the model. The functions ϕij|k are instead complex functionals, determined by the Doob’s h-transform, that depend on s, i, j and TF − t (Methods).

If we could ideally access a large number of trajectories collected under the same initial condition but free to cover the available fitness landscape, only the initial condition would matter and the condition in the final state is no longer necessary. This case is what one encounters in experimental evolution. The estimation of the selection coefficient in those cases should be made by means of the unconditioned likelihood17

where n({i, j}) is the number of transitions between each pair of frequencies i and j in the ensemble of trajectories. The number of transitions can be computed through . The likelihood function L(s) and variations thereof that take sampling errors into account, is the most commonly used function to estimate the selection coefficient2. For a correct interpretation of the results presented below it is relevant to note that Lc(s) and L(s) are related through

where Φ(s) is a complex functional depending on the ϕij|k and on the specific trajectory described by δij(t).

In the following we present a comparison of the estimated value of the selection coefficient from the likelihood Lc(s) and from the likelihood L(s), for the Moran and the Wright-Fisher models. Applied to each single trajectory both likelihoods allow to derive the most likely value of s. The variation of the maximum likelihood estimates across the set of 100 time-series for each of the four cases introduced earlier (Fig. 1) gives a distribution from which the average and the 95% confidence interval can be estimated.

The Moran model

We consider the simplest version of the Moran model4,13,18, which consists of a population of N individuals split into NA individuals carrying the character A and NB individuals carrying the character B. Except for the characters A and B, the individuals are identical. Individuals of type A have fitness WA and individuals of type B have fitness WB. The selection coefficient is S = WA/WB. In the Moran model the generations are overlapping and the dynamics runs as follows. At each time point t, one of the existing individuals reproduces with a probability proportional to its fitness. The resulting offspring is identical to the parent individual and replaces one of the existing individuals chosen at random with uniform probability. At each time step, thus, the number NB of B individuals can increase or decrease by one, or stay the same with probabilities that depend on NB, N and S (Methods). The Moran model is thus a random walk on a line for the number NB, with two absorbing states in 0 and N corresponding to the fixation of the character A and B, respectively.

For this model, the 100 trajectories of type B frequencies for each of the four types (Fig. 1) have a duration of T = 500 generations. The trajectories have been generated by standard methods for conditioned processes16,19 and then both the conditioned likelihood Lc(s) and the non-conditioned likelihoods L(s) have been numerically derived as described. Surprisingly, only for the case III, i.e., time-series far back in the past with the character not yet fixed at present time, also L(s) delivers a selection coefficient close to the true one (Fig. 2, grey boxes). In all other cases I, II and IV, L(s) delivers selection coefficients that are quite different from the true one.

Figure 2
figure 2

Selection coefficient for the Moran model.

For each of the four cases, we have generated 100 independent trajectories with S = 2 (dashed horizontal line). For each such trajectories we have constructed the likelihoods Lc(s) and L(s) and found the two values of s that maximize each of them separately. From the distribution of these two sets of maximizing s we obtain the mode and the 95% confidence interval (CI) shown here. The conditioned likelihood Lc(s) always provides a good estimate of the true selection coefficient (red squares). The unconstrained likelihood L(s) delivers a poor estimate of the selection coefficient (grey squares) except for case III due to the slow dynamics of the Moran model. For each trajectory: T = 500, NB(0) = 27 and N = 54. In cases I and III we have set NB(TF) = 40. In cases II and IV we have set NB(TF) = N.

The Wright-Fisher model for one-locus two alleles

In the Wright-Fisher model we consider an autosomal locus of a diploid organism with two alleles A and B. Reproduction occurs with perfect mixing but population size is fixed to a total of N alleles (corresponding to N/2 individuals). The three possible genotypes have fitness WAA, WBB and WAB. The selection coefficient is S = WAA/WBB with codominance implying WAB/WBB = (1 + S)/2. With these choices, in the absence of genetic drift the evolutionary trajectory would deterministically lead to the fixation of the allele A. For finite populations instead, the zygotes of the next generation are sampled from the gametes from the previous generation, in which the frequency of the alleles A and B have been determined through the evolutionary dynamics. The number NB of alleles of type B in a finite adult population thus changes randomly from one generation to the next as a result of selection and drift (Methods). Also here the number of alleles NB is described as a Markov chain with two absorbing states in 0 and N, corresponding to the fixation of allele A and B, respectively.

This model is numerically more challenging than the Moran model. In particular, the time scale to fixation is shorter than for the Moran model because here the generations are non-overlapping. Here, thus, each trajectory has a duration of T = 100 generations. As for the Moran model, we have generated 100 independent time-series for each of the four types (Fig. 1). Using the transition probabilities of this model and the δij(t), we have numerically derived the conditional likelihood Lc(s) and the unconditioned likelihood L(s). The results are qualitatively similar to the ones for the Moran model (Fig. 3). For type III trajectories, however, the two likelihoods perform differently, with Lc(s) providing a good estimate of the selection coefficient and L(s) a poor estimate. As discussed below, this has to do with the very rapid time scales of the Wright-Fisher model. If T is set to 10 generations instead of 100, the estimate from L(s) becomes closer to the true value. Due to its rapid time scales, it was also convenient to set NB(TF) = N − 1, i.e., very close to fixation, in order to have relatively long trajectories.

Figure 3
figure 3

Selection coefficient for the Wright-Fisher model.

For each of the four cases, we have generated 100 independent trajectories with S = 2 (dashed horizontal line). For each such trajectories we have constructed the likelihoods Lc(s) and L(s) and found the two values of s that maximize each of them separately. From the distribution of these two sets of maximizing s we obtain the mode and the 95% confidence interval (CI) shown here. Here, the rapid dynamics of the Wright-Fisher model leads to very short trajectories in cases II and IV that leads to poor statistics. For this reason, in cases II and IV we have set NB(TF) = N − 1 (very close to fixation) instead of N. The conditioned likelihood Lc(s) always provides a good estimate of the true selection coefficient (red squares). The unconstrained likelihood L(s) delivers a poor estimate of the selection coefficient (grey squares) with a CI smaller than the box size. For each trajectory: T = 100, NB(0) = 27 and N = 54. In cases I and III we have set NB(TF) = 40.

Methods

In both models considered here, the process NB is a Markov chain in discrete time in a finite state space {0, 1, …, N}. These Markov chains are characterized by the one step transition probability matrix P whose elements Pij are independent of time and are defined as

The factors ϕij|k that enter into the definition of the conditioned transition probabilities can be explicitly written by exploiting the definition of conditional probabilities and the Markov property of the process16 as

which are non-negative functions dependent explicitly on i, j, k and TF − t for t = 0, 1, 2…, T. When TF = T + 1, as in the cases I and II (Fig. 2), the factors ϕij|k depend explicitly on time and change in such a way to realize the condition NB(TF). Nevertheless, the knowledge of the transition probabilities Pij defined in Eq. (8) allow to compute all likelihoods through Eq. (9) for any choice of the parameters. When , as in the cases III and IV (Fig. 1), the factors ϕij|k do not depend on time20 and can be computed as the mathematical limit TF → ∞ by exploiting the spectral properties of the transition probability matrix P. When k is a transient state, i.e., k ≠ 0, N, then

where λ0 is the largest non-trivial eigenvalue of P and w0(i) is the i-th component of the corresponding right eigenvector. When k represents fixation, i.e., k = 0 or N, then16

where uik is the probability of absorption in k for a process started in i. Since deciding when TF is sufficiently large to allow using these last limit cases may depend on the system15, the definition given in Eq. (9) was used to the limits of numerical precision for large powers.

For the Moran model, at each generation, each individual of type A produces a number of offspring equal to WA and each individual of type B produces a number of offspring equal to WB. At each generation, just one among the entire pool of NAWA + NBWB offspring is chosen at random. This new individual, then, replaces one randomly chosen individual in the parents’ population. With this dynamics, the population size remains constant but the frequencies NA and NB change with time. Eventually, all individuals will be either of type A or of type B.

We follow the fate of character B. At each generation and before fixation occurs, the frequency NB can increase by one, decrease by one or stay the same. Based on the dynamics described above, the probabilities associated to the changes of NB are given by

where the selection coefficient S = WA/WB is non-negative and the transition probabilities are independent of time. When 0 ≤ S < 1 the individuals of type B have a selective advantage with respect to individuals of type A (i.e., Pj > Qj) and vice versa when S > 1. The borderline case S = 1 corresponds to neutral evolution. The probabilities Pi, Qi and Ri form the elements of the transition probability matrix

The fixation probabilities as function of the initial frequencies and of the selection coefficient can be computed as absorption probabilities from this matrix4,16,18.

For the Wright-Fisher model, let NB(t) = i be the number of alleles of type B in the adult population at generation t. Then, according to the evolutionary dynamics the frequency of the allele B in the successive gamete population is3

where pB(i) = i/N, pA(i) = 1 − pB(i) and WO is the average fitness of the adult population, defined as

The frequency of the allele of type B in the new adult population is obtained through random sampling and leads to the transition probabilities

according to the binomial sampling.

Discussion

At a first sight, it may seem odd that the correct likelihood should depend on the conditional transition probabilities Pij|k(s). In fact, Lc(s) is computed on one single trajectory of a stochastic process governed by selection and genetic drift. The key point is that single trajectories of a stochastic process should be considered as representative of a bundle of trajectories starting and ending at fixed conditions. Functionals of single trajectories are thus conditioned not only in the initial condition but also in their final condition. When only one realization of ancient DNA variations is available a special form of conditioning in the future has to be included in order to correctly estimate the selection parameters. Such processes were already studied by Schrödinger21 who recognized the emergence of possible contradictory claims from the observation of diffusion trajectories conditioned in their initial and final positions. In mathematics, this kind of conditioning has been studied in the context of Brownian bridges, namely processes conditioned both at their initial and final value, a precise description of which requires the introduction of the Doob’s h-transform. More recently, the Doob’s h-transform has become an essential theoretical tool to study the statistics of rare events22 and to understand circular arguments in statistical analysis16,23. It was also shown that this transform emerges necessarily when trajectories are selected on the basis of their outcome16.

The likelihood L(s) defined in Eq. (6), based on the transition probabilities Pij given in Eq. (12) is not the one that should be used to extract a parameter like the selection coefficient from one given trajectory. Indeed, L(s) fails in almost all cases to provide a realistic interval of confidence. The reason for the failure of this method is born in the fact that a given realization does not reveal if it is an unlikely event of a process that would otherwise typically behave differently. As a matter of fact, the process behind a given realization is rather more representative of a process conditioned (in probabilistic terms) to end at the frequency observed at its final observation. If one knows, from first principles, what is the microscopic (molecular) mechanism driving the process under scrutiny then one can follow the procedure explained in this work, derive the conditional probabilities Pij|k and write the likelihood Lc(s) in terms of these conditioned quantities. This quite obviously provides a good estimate of the selection coefficient. A crucial requirement for the success of this enterprise is the knowledge of the correct model to use.

The use of the unconditioned likelihood L(s) would still give an answer, i.e., a value of s that is apparently consistent with the data. Indeed, case I, which describes a process conditioned on ending at an intermediate state k ≠ 0, N would lead to support the idea of neutral evolution or balancing selection and in fact, L(s) yields a value of s close to unity. In case II, when k = N instead, the individuals of type B would get fixed in the population and the analysis of such a trajectory by means of unconditioned likelihood L(s) would lead to support the idea of a selective advantage in favor of type B even if type A individuals have a selective advantage by construction. Moreover, the time dependence of the transition probabilities, due simply to the effect of conditioning as seen in Eq. (9), would deceptively support the idea of changing environmental conditions. We see that these conclusions, albeit logical from the point of view of explaining the observations a posteriori, are determined by conditioning, i.e., by the fact that NB(TF) takes a particular value. Given our a priori knowledge of how we have generated the trajectories, conclusions taken through the analysis with the likelihood L(s) would be therefore deprived of any foundation. But if we had no such a priori knowledge, there would be no way to confirm or reject the conclusions based on L(s). Case III, with data coming from far back in the past and no fixation, presents some peculiarity. For the Moran model L(s) gives a relatively good estimate of the true selection coefficient whereas for the Wright-Fisher model it does not. The reason relies on the different time scales associated to absorption in each of these models. One step in the Wright-Fisher model corresponds to at least N steps in the Moran model. Thus, when the duration T of the time-series is very long and no absorption takes place at the end or close to the end of the time-series, the analysis performed with L(s) leads to a value of s close to unity, compatible with the apparent neutrality of the evolutionary trajectory. When T is short, instead, also for the Wright-Fisher model L(s) delivers a value of s closer to the true value (a test done with T = 10 confirmed this assertion). Therefore, the effect of conditioning in the future combined with the typical time scale of the process and the length of the measurement T is non trivial15. Finally, in case IV conditioning can be very strong because the process can enter fixation at any time before the present, including times during the observed time-series. From the point of view of the likelihood L(s), case IV would give type B individuals a selective advantage where Lc(s) instead correctly predicts that A was in advantage. Furthermore, in the light of the relationship between Lc(s) and L(s), it emerges especially in trajectories belonging to case II that Lc(s) is bimodal, with a local maximum governed by L(s) and a second local maximum at larger values of s governed by Φ(s). This explains the larger confidence interval for this case in both models. This suggests that the ratio of the likelihoods R(s) = Lc(s)/L(s) rather than Lc(s) alone could be considered an even better functional to estimate the true value of the selection coefficient.

It had already been observed in the context of other models of population genetics that the generation of faithful trajectories of allele frequencies under the condition that fixation has occurred requires the introduction of a fictitious selection coefficient19,24. In the context of the Moran model instead, it was shown that under the condition that fixation has not occurred after long time, the transition probabilities require a correction factor20. While both these cases are included and generalized in this manuscript, we should stress here instead that extrapolating the selection coefficients from single historic records without due consideration to the peculiar conditioning associated to single trajectories gives values of the selection coefficients that are often very different from the real values.

Conclusions

An historic time-series is one trajectory whose contingency acts as a condition in the future and thus enters in the form of a bias in the elementary transition probabilities. The existence of such a bias when processes are conditioned in the future is often referred to as the Doob’s h-transform. Extracting the selection coefficient from frequency time-series using the false likelihood function has a deceptive effect: the extracted parameters seem to be meaningful and would support models that completely agree with the data used to extract them. Especially when predictions about the future outcomes are not possible because of the experimental limitations, seeking for models solely from past macroscopic data generates a false self-consistency reminiscent of circularity in data analysis16,25,26,27. When the correct model is known, it is possible to derive a likelihood function that takes the Doob’s h-transform into account and to produce reliable estimates of the selection coefficient.

Additional Information

How to cite this article: Valleriani, A. A conditional likelihood is required to estimate the selection coefficient in ancient DNA. Sci. Rep. 6, 31561; doi: 10.1038/srep31561 (2016).