Introduction

How repeatable is evolution? As the metaphor by Stephen J Gould goes ‘if we run the tape of life back from the start how likely is it that we will get the same outcome that we see around us today?’1. The pioneering work of Lenski et al. tackled this question experimentally with microbes. It is now possible to literally play back evolution from a certain starting point and see where it leads2,3,4,5,6.

Such empirical explorations made the until then theoretical concept of fitness landscapes tangible. The concept of a fitness landscape is a mapping between the genotype and the phenotype of an organism. Since selection acts on the phenotype or essentially on the fitness of the phenotype, the genotype of each phenotype can be attributed a certain fitness. Connecting the genotypes which are one mutational step away from each other leads to the concept of fitness landscapes7,8. Such empirical studies do make it clear that predictions will not be based on simple rules but complicated phenomena such as epistasis and epigenetics which play a major role in the process of evolution6,9,10.

Epistasis is any deviation from the additive effects of alleles at different loci11. Epistasis gives rise to rugged fitness landscapes which have been found to be quite common in experimental observations in a variety of model systems12,13. In particular, reciprocal sign epistasis is a necessary condition for having a rugged fitness landscape14. While in magnitude epistasis the fitness always increases (or decreases) with every additional mutation in a non-additive manner, in sign epistasis, however, valleys appear in the fitness landscape. A certain mutation might have a lower fitness than the previous state although it leads to higher fitness eventually. In such a case not all paths in the fitness landscape might be accessible by the population15. Comparing experimental systems to theoretical predictions made on the basis of the underlying fitness landscape helps elucidate the role of microscopic properties of the system in determining the macroscopic evolutionary trajectory. The details of the process such as the mutation rate, fitnesses of individual states and the global population size act as constraints on the accessibility of paths13. Using the assumption of strong selection and weak mutation rates (SSWM), the system advances on the fitness landscape in a stepwise fashion. This automatically limits the possible number of adaptive paths10.

Evolutionary predictability and the speed of the dynamics is not only determined by the molecular constraints of fitness and mutation rate but also by population dynamics14. Theoretical explorations often assume a fixed population size starting at one node of the fitness landscape and its movement is tracked over the course of time. Increasing the population size, or the mutation rate, we observe the phenomenon of clonal interference15,16. This occurs when a second step mutant arises in a population even when the first step mutation is not fixed. In other words, the SSWM assumption is no longer valid. Clonal interference has been extensively explored experimentally17,18,19 as well as theoretically16,20,21,22,23,24,25. This phenomenon removes the limit on the accessibility of non-adaptive trajectories. If the fitnesses and mutation rates align to particular conditions, i.e. the mutation rates also underlie epistatic interactions, then such valley crossings might be faster than adaptive trajectories24,26.

Populations in real systems are finite and their size can undergo fluctuations which can lead to possible extinction events. Together with the phenomena of clonal interference and epistatic interactions between mutations (correlated rugged fitness landscapes), predicting evolution through a given fitness landscape seems like an impossible task. Herein we develop a general methodology for predicting all path probabilities in a fitness landscape with epistatic interactions in a multi-dimensional fitness landscape. To reflect a realistic scenario we use a multi-type branching process (e.g. Ref. 27) to drop the assumption of a constant population size. For presentation purposes we limit ourselves to systems without back mutations. The model in its full generality is free of this assumption, although it is unclear how to define pathways when back mutations are allowed (see Supplementary Information for a detailed explanation). To introduce the framework we begin with a simple model in which the wild type can have two independent mutations leading to the fittest type. Then we increase the number of mutational events it takes to get to the corresponding type leading to a generalization of the methodology. We briefly mention an application of this approach by linking it to a cancer initiation model28 showing how mutational epistasis changes the path probabilities. Finally we provide an outline on how to extend the model to a general system where different mutations need to be acquired to reach the final mutant.

Methods and Results

Probability Generating Function

For our methodology, we are making use of extinction probabilities, more specifically the probability for different types to be present or not to be present. In a branching process this probability can be recursively obtained using probability generating functions (PGFs). Since the relation between PGFs and the probability for a type to be present is the main tool we are using, we devote this subsection to giving a short overview about this correlation, although it is rather technical and well known (e.g. Refs. 27, 30).

The PGF in discrete time for a one-type process is in general defined as

where k denotes the number of offspring and pk represents the probability of having k offspring (the focal individual dies in this context)27. For many biological processes, for example cell multiplication, it makes sense to only consider offspring numbers of 0 (death), 1 (nothing happens) and 2 (cell division). But in other biological systems it makes sense to consider many offspring at once, for example reproduction via numerous seeds in plants. Our analysis is not restricted to any particular offspring distribution. However, for the sake of simplicity, we restrict our example to the so called binary splitting, i.e. either two or no offspring. The use of the argument s is not obvious at this point. If we set s equal to 0, the probability generating function reduces to f(0) = p0, which is the extinction probability for a population of one individual in one time step. Since all individuals behave independently, is the extinction probability for a population of size N in one time step. Now looking at the extinction probability within two time steps, we note that with probability p2 we would have two individuals in the next time step originating from one individual. Hence, the extinction probability for a single individual within two time steps is,

and that of population with N individuals is,

Continuing for further time steps, we see that is the extinction probability for the system within t time steps.

As of now we assumed that individuals reproduce clonally i.e. giving rise to the same type. Now we continue investigating the extinction probability for a two-type process. Let us think of the two types A and B, where an A individual can produce any number of A or B individuals and respectively for B. Then the general PGFs if the process starts with one type A or one type B individual are defined as

where denotes the probability of one A (B) individual producing kA A and kB B individuals in the next time step. Let us try to recover the extinction probability as for the one-type process. If we set both sA and sB equal to zero and assume that we start with one A individual, we obtain a similar result as above for the total extinction probability

Oftentimes, one is rather interested in the extinction, or non-presence, of just one particular type. Let us for example assume we are only interested in the presence of B individuals. The probability of having no B individuals in time step 1 is the sum over all probabilities, where no B offspring is being produced , starting with one A (B) individual. Now looking at the probability of having no B individuals in time step 2, we need to account for the probability of having kA A and kB B individuals being produced in the first time step. This leads to

Continuing this procedure and analogous to the one-type process, the probability of having no B individual in time t is .

In a similar fashion this procedure can be extended to a multi-type process with an arbitrary number of types. For further information and detailed insights into extinction of branching processes we refer to Refs. 27, 30.

Two dimensional fitness landscape

We begin with a minimal fitness landscape. Envision a wildtype ab which can mutate at the two loci to A and B, respectively. With both mutations, the system is in the final state of AB. In such a system there are two different paths as illustrated in Figure 1.

Figure 1
figure 1

Mutational pathways for a system with two loci.

There are two different pathways to reach the final mutant. Fitness is represented by the size of the circles denoting the types. Thus the wildtype ab and Ab have a similar fitness whereas AB has a significantly greater fitness compared to the wildtype while aB is much less fit than the wildtype. When all mutation rates are the same, the pathway via aB would be not adaptive, since this type has a low fitness. If the mutation rate is large enough, especially if (indicated by the thick arrow), this pathway becomes accessible.

Traditionally, epistatic models are discussed in terms of different fitness values, whereas the mutation rates stay the same13,14. Exemplarily the fitness landscape for a system with sign epistasis is shown in Figure 1. In such a system where the mutation rates stay the same, i.e. and , it is clear that the path via Ab is the most probable one. However, if the mutation rates change, e.g. , also the path via aB can become accessible. Changing mutation rates amounts to including epistasis in the mutational landscape in addition to epistasis in the fitness landscape29.

For the four types of the above model, we need to consider four different PGFs, one for each type

where bi and di are the birth and death probabilities of type i. The exponent of 2 arises from a branching process with binary splitting. The arguments sab,…,sAB correspond to extinction probabilities of the respective type as discussed above. The functions fi correspond to the extinction probability of the whole process given that the process starts with a single individual of type i. The PGF fi at time t is recursively calculated as

Time Distribution

Using the generating functions we now approach the extinction time distribution of the binary branching process. Particularly starting with 1 wild type individual, the probability of having no AB-individual at time t is . Thus the probability of having at least 1 AB-individual at time t is 1−f(t). The probability, that at least 1 AB-individual appears exactly at time t is the probability, that there is an AB-individual at t minus the probability that there was already one at time t−1:

Starting with N wild type individuals the probability that there are no AB-individual at time t is then f(t)N. This leads to the time distribution as,

However, the arising AB should start a lineage that does not die out. Hence we are interested in the probability of having a successful AB-individual. To calculate this we use the known extinction probability of an AB-individual in place of sAB. The probability of an AB-individual going extinct is its death probability divided by its birth probability eAB: = dAB/bAB31. The modified PGFs for this purpose then read as

Note, that the PGF for the final mutant type is not necessary anymore. We can now calculate the time distribution until the first successful mutant appears the same way as described above. Figure 2 shows the perfect agreement between the recursive solution and 5000 simulations. The parameters, specified in the Figure 2's caption, are entirely arbitrarily chosen to reflect an epistatic fitness landscape as sketched in Figure 1. The reason we chose a very slightly advantageous fitness for the type Ab-individuals is solely to stress the fact, that this method holds for any fitness values, not only if some are restricted, for example to being neutral.

Figure 2
figure 2

Time distribution of reaching the final mutant for a four type fitness landscape as in Fig. 1.

Solid line represents the recursive solution and the bars represent 5000 simulations. The parameters are: Death probabilities: dab = 0.5, dAb = 0.49995, daB = 2/3, dAB = 0.25. Birth probabilities are 1 minus the corresponding death probability. Mutation probabilities are , μA = 210−5, . Population size in the beginning: N = 30000.

For a three-type continuous time branching process, as in , the time distribution was computed in Ref. 32. This was done using the analytical solution of the probability generating function for the two-type process 33 and the fact, that in continuous time mutations follow a Poisson distribution. Adding a second intermediate type, e.g. B2, would also give such a process but immediately results in unwieldy analytical calculations.

Path Probabilities

In the current example there are two possible paths by which the wildtype can reach the final mutant AB, either abAbAB or abaBAB. Experimental evidence shows that not all paths are equally probable15,34. Beginning with ab then what is the probability of the first AB mutant arising via either path and how long does it take for the different pathways?

The probability, that the first mutant arises exactly at time t via pathway Ab is (derived in the SI),

where is defined in the Supporting Information (SI) and is being computed in a similar fashion as f(t). The total probability for this path is then the summation of ρAb(t)

Computationally the sum would go up to a tmax, where (where usually machine epsilon is chosen as ). The total extinction probability of a multi-type branching process is determined by the smallest fixed point of the probability generating functions f(s*) = s*, where is the extinction probability, if the process starts with one ab-individual27. Nevertheless those total extinction probabilities are not suitable for the question, via which path the first successful AB-mutant arises. The problem lies in the time; the pathway via Ab for example could have a very low extinction probability whereas the pathway via aB might have an extinction probability of 1/2. Intuitively one would expect the path via Ab to be more frequent. However, if the path via aB is much faster (e.g. due to ) one would actually find that each path happens with probability that approaches 1/2. Therefore, it is important to do the recursive analysis to include the probability, that a successful mutant did not arise through any other path beforehand.

Figure 3 shows the probability densities for the different pathways of the minimal model. Interestingly, the pathway via aB is predominantly prominent in the beginning but overall less likely. Hence if experiments are stopped after a short time interval then they might provide conclusions which can be upended by looking at the experiments at a later time point.

Figure 3
figure 3

Probability distribution for the different pathways.

Orange represents the pathway via aB and blue the pathway via Ab. The bars are the results of simulations, the solid lines depict the computed results. In the pie charts the distribution of the pathways are illustrated up to 500 time steps (shaded area, left pie chart) and up to 5000 time steps (right pie chart). Stopping after a few lineages have reached the final mutant might lead to a false distribution: The other pathway might just need longer, but have a smaller extinction probability. The parameters are: Death probabilities: dab = 0.5, daB = 2/3, dAb = 0.49995, dAB = 0.25. Birth probabilities are 1 minus the corresponding death probability. Mutation probabilities are , μA = 210−5, . Initial Population size is N = 30000.

Multiple mutations in two dimensions

In the earlier model the wildtype had two possible mutations aA and bB. It is possible, that a to A and b to B are a multi-step process. Hence we can assume that it takes m mutations to go from a to A and n to go from b to B. Hence for m = n = 1 we recover the simple model as discussed above. The calculation of the time distribution can be directly transferred from the simple model by including all necessary probability generating functions for all available types. Increasing the length of the dimensions has a direct impact on the number of paths leading from the wildtype to the final mutant. In particular there are possible paths. Assuming in general m mutations in the A dimension and n in the B dimension we enumerate the paths as follows. Path 1 is the path where at first all A mutations and subsequently all B mutations happen. Path 2 is the path where all but one A mutations happen first, then one B, then the last A and finally all other B mutations. Figure 4 shows the different paths for a system with four mutations for type A and one mutation for type B. Thus calculating the path probability for any particular path p now takes the form,

where f(t) is the probability generating function as in Eq. A.2 and is defined analogously to Eq. A.9 in the SI

Figure 4
figure 4

Exemplary numbering of the different mutational pathways in a system with m = 4 mutations for type A and n = 1 mutation for B.

Here, the probability generating functions with a p index belong to types along the regarded path (which in total are m + n + 1 without back mutations, beginning at 0, with which we always label the subindex for the wild type). Accordingly, probability generating functions with a q index are associated with types, that do not belong to the respective path (which are in total m × n). The probability generating function for the final mutant type is again replaced by the extinction probability of this type. We use our framework with this extension on the cancer initiation model proposed in Ref. 28. Therein a model with several mutational steps to reach state A and one mutational step for state B is analyzed (cf. Fig. 4). The direct change in fitness for the A mutations is (nearly) zero and the B mutation alone is even deleterious. However, if an individual obtains all A mutations and the B mutation, the fitness is enhanced which in the model leads to rapid proliferation. Here, we provide an example on how the path probabilities change, when epistasis is not just in the fitness landscape but in the mutational landscape as well. Figure 5 compares the path probability distributions with and without epistasis in the mutational landscape. The fitness values, the birth and death probabilities respectively, as well as the “nonepistatic” mutation probabilities, are the same as in Ref. 28.

Figure 5
figure 5

Comparison between the path probability distributions of a minimal Burkitt Lymphoma model.

Top: Time distributions for the model (a) without epistatic effects on mutation probabilities and (b) with mutational epistasis. The probability to obtain an A mutation is 100 times higher, if the B mutation is present in that individual. Bottom: In (c) the path probabilities for the model without epistatic effects on mutations are illustrated, whereas in (d) the mutation probability is again increased by 100 for an A mutation if the B mutation is present. Pathway 1 corresponds the the mutational pathway, where first all necessary extra mutations have to be acquired and the B mutates last. Pathway 2 denotes the pathway, where 3 of 4 extra mutations have been obtained, then the B mutation happens and at last the final extra mutation is acquired. Respectively for the other pathways (cf. Figure 4). The parameters are the same as in Ref. 28: The birth probability for an individual with j passenger mutations and without the B mutation is b0,j = 0.5(1 + 10−5)j and with the B mutation . The mutation probability for the B mutation is μD = 510−6, for an A mutation without the B mutation being present μP = 210−5 and with the B mutation being present (only necessary for (b) and (d)) . The population size in the beginning is N = 500000.

Multi dimensional fitness landscapes

The cancer landscape discussed above is a two dimensional system. In principle it is possible to extend this approach to higher dimensions. For fitness landscapes of higher orders15,35 it is still possible to write down the system of probability generating functions and apply the approach explained here. The concept remains the same. For each type the probability generating functions are needed except for the final mutant type, here only the extinction probability is necessary (SI). Finally the probability generating function for the wild type needs to be recursively calculated for the time distribution. For the path probabilities the probability generating functions related to types not along the considered path again are one time step behind, similar as in Eq. 16. However for these experimental fitness landscapes while we can get accurate data elucidating the fitness landscape, the mutational landscape is usually hard to determine.

Discussion

We have presented a theoretical framework to study mutational pathways in epistatic systems. The crucial part is that in our analysis epistasis affects not only fitness (i.e. proliferation and death rates) but also mutation rates. Hereby we could show, that pathways become accessible, that without mutational epistatic effects are mostly unlikely to happen (cf. e.g. Figure 5). Our analysis is based on multi-type branching processes and hence it does not rely on the assumption of a constant population size.

While we have focused on a fairly simple system with a fitness landscape with a single peak, the approach can be extended to a rugged fitness landscape. Moreover, if back mutations are involved, one can still calculate the time distribution, although pathways are not clearly defined in a system with back mutations anymore (see SI). Furthermore in the current scenario in each time step the individuals could replicate or die. In addition we could have a resting probability where the individuals remain in the same state with a certain probability. Such complicated scenarios can be incorporated in our framework as well (SI). The computations can be precisely represented in analytic terms and need to be solved recursively.

We apply our framework to a cancer model including mutational epistasis28 and show how the path probabilities are altered by it. Mutational epistasis can thus lead to heterogeneity in the density of different mutant types between different age groups as reaching the final mutant early is only possible by one mutational pathway which is not possible at later time points.

As shown here the mutational landscape can undermine the current predictions based solely on fitness landscapes. Just like in long term evolution, experimental as well as theoretical approaches ought to be balanced between studying effects of selection and the strengths of mutations. The theoretical analysis based on the approach explained here helps in understanding the importance of mutational epistasis, even though the computations have to be solved recursively. In particular, it makes analyzing the fitness and mutational landscapes more interactive, since long-lasting simulations are not necessary any more.