The face of evolutionary biology is changing: from reconstructing and analysing the past to predicting future evolutionary processes. Recent developments include prediction of reproducible patterns in parallel evolution experiments, forecasting the future of individual populations using data from their past, and controlled manipulation of evolutionary dynamics. Here we undertake a synthesis of central concepts for evolutionary predictions, based on examples of microbial and viral systems, cancer cell populations, and immune receptor repertoires. These systems have strikingly similar evolutionary dynamics driven by the competition of clades within a population. These dynamics are the basis for models that predict the evolution of clade frequencies, as well as broad genetic and phenotypic changes. Moreover, there are strong links between prediction and control, which are important for interventions such as vaccine or therapy design. All of these are key elements of what may become a predictive theory of evolution.
Chance and necessity of evolution are a classic topic in biology1,
What is predictable in evolution, what may become predictable in the near future, and what will remain unpredictable? These are the central questions of this Perspective. Here we use the term prediction in a specific sense: a testable hypothesis about an evolutionary process that extends into the future. This distinguishes evolutionary predictions from the broader usage of the term prediction (of a model, that can be tested by experiment) and excludes processes with solely metabolic or ecological dynamics. Building on recent progress in microbial and viral evolution, cancer evolution, and the somatic evolution of immune systems, we develop unifying concepts for predictive analysis and identify avenues for future research.
What makes evolution predictable?
A look at evolutionary processes on the molecular scale seems to support scepticism on predictability. Molecular evolution is driven by mutations that arise randomly in an individual's genome and act on complicated, in part unknown cellular machinery. The fate of mutations in an evolving population appears similarly complicated. In Fig. 1a–d, we plot the frequency paths of genetic mutations in systems representative for this article, which include a laboratory population of yeast cells, the human influenza virus A/H3N2, and populations of cancer and immune B cells in a human individual. These systems are examples of Darwinian evolution: genetic variation is continuously produced by mutations and is acted upon by selection. Part of the positively selected changes expands in the entire population and generates increasing divergence from its initial state. A closer look reveals the complexity of the evolutionary dynamics. All of the populations have multiple coexisting clades (that is, groups of genetically related individuals); beneficial mutations in disjoint clades compete for fixation, while mutations in nested clades reinforce one another (Fig. 1e). This evolutionary mode, which is commonly called clonal interference, arises in large asexual populations subject to strong selection17. Here we use the term clades (instead of clones) to highlight that successful clades acquire new genetic diversity on their way to fixation (Fig. 1e). Clonal interference has been observed in laboratory evolution of microbial and viral populations18,19 and probably governs all of the systems shown in Fig. 120,
If selection is to generate predictability, it must prune a highly complex space of evolutionary possibilities to essentially a single likely alternative. Research in recent years has revealed that at different levels of biological organization, the degree to which this takes place varies greatly. In parallel-evolving laboratory populations, the vast majority of single-nucleotide and amino acid changes occur in just a single population26,
The divergence of genome evolution observed in all of these systems is hardly surprising. Even simple units of biological systems have a large number of possible mutational changes that have similar functional effects. For example, loss of genes in regulatory networks is frequently observed in evolution experiments36, and a given gene can be silenced by many different sequence mutations. More generally, changes in gene regulation and in cell metabolism have a large mutational target, given the redundancies in regulatory sequence grammar and metabolic pathways. These redundancies imply that ‘microscopic’ genome evolution is not repeatable. But they also hold a positive message for predictability: in order to forecast functional changes in a population, we do not need to know the exact evolutionary path in sequence space.
At a more coarse-grained level, recent evolution experiments do suggest a route towards predictive analysis. Heritable phenotypes, in particular quantitative traits, often evolve in a more repeatable way than genomic sequences. This is a common feature of microbial populations4,27,37,
The differences in repeatability between genomic and phenotypic evolution reflects the dependence of selection on biological scale. Sequence space contains a staggering number of evolutionary paths. Although negative and positive selection reduces the number of likely paths, sequence evolution remains generically unrepeatable (Fig. 2a). Two main factors generate stochasticity: mutations with similar functional effects have similar fitness effects and similar likelihood; moreover, a fraction of the system's genomic sites evolves under weak selection altogether. For example, sites that are part of quantitative traits with sequence redundancy evolve near neutrality, even if the trait itself is under substantial stabilizing selection. More generally, clonal interference acts as a selective filter: only strongly adaptive mutations are governed by their own selection coefficient and can evolve repeatably; moderately selected beneficial and deleterious mutations acquire near-neutral fixation probabilities47,48 and lose repeatability. This effect has been called emergent neutrality47,48 and can be understood from Fig. 1e: moderately selected mutations are pushed and pulled by the dynamics of clades in their immediate back- and foreground, which is driven by stronger selection. The theoretical expectation of emergent neutrality is in line with the hitchhiking of deleterious mutations observed in evolution experiments and wild populations29,49,
Phenotypic evolution is marked by correlations that can be harvested for the inference of fitness landscapes and for predictive analysis. One source of such correlations is the nonlinearity of phenotypic fitness landscapes, which implies broad fitness interactions (epistasis) between mutations: deleterious changes increase in cost with increasing distance from a ridge; beneficial changes decrease in return with decreasing distance from a peak74. Importantly, these interactions generate evolutionary constraints and increase the predictability of phenotypic processes and outcome. Yeast populations, for example, show a rate of adaptation that is predictable in terms of their initial fitness38. Even in macro-evolutionary processes, phenotypic epistasis can generate a predictable order of evolutionary steps, as has been observed in the evolution of complex functions in prokaryotes by lateral gene transfer75 and of photosynthesis in plants76. The emerging picture of smooth phenotype-fitness maps with ‘macroscopic’ epistasis38 (Fig. 2b) is in sharp contrast to that of rugged fitness landscapes on sequence space. The latter are dominated by ‘microscopic’ epistasis, which decreases the number of accessible evolutionary paths66,72, but the local peaks and valleys in a given system can hardly be captured by a predictive model with few parameters70,77,
In the densely packed genomes of microbial and viral systems, multiple traits are often encoded in common genetic loci. This property (called pleiotropy) is another source of evolutionary correlations relevant for predictions. Pleiotropy constrains adaptive evolution to characteristic serpentine paths: primary beneficial mutations advance adaptive traits but degrade conserved traits encoded at the same site (because the adaptive allele is, on average, deleterious for other traits). The collateral damage of adaptation is subsequently repaired by compensatory mutations55,
A crucial and largely unexplored determinant of predictability is the variation in initial conditions and environmental factors across populations. Many lab evolution experiments are designed to limit this variation: populations start from a well-defined initial state (often a single clone), and the experiments are conducted under carefully calibrated conditions37,49. In contrast, the evolution of populations in the wild can have different—and often unknown—initial states, and it takes place under variable ecological conditions. These factors can clearly hamper predictability. However, some recent results indicate that more complex evolutionary processes retain repeatable characteristics. First, standing variation maintains or even enhances short-term repeatability, because adaptive mutations may already be present in the initial population state28,90. If selection is sufficiently strong, even de-novo mutations generated from a complex initial state have repeatable features91. Second, heterogeneity across parallel-evolving populations can become subdominant if strong adaptive pressure generates convergent evolution. For example, an adaptation experiment of bacteria in the ecosystem of the mouse gut shows similar early-stage phenotypic changes across different hosts92. Another case in point is the adaptive immune response of humans to an influenza infection or vaccination. Although individuals have different immune repertoire-wide responses93, some antigenic characteristics of their response to related viruses are similar6. Despite these convergent aspects, populations are often shaped by differential response to environmental variation. In many cases, this requires modelling evolution under time-dependent selection, in so-called fitness seascapes94. The resulting challenges for predictive analysis will be discussed below.
Predictive data and models
Recent work has underscored the importance of comprehensive data and quantitative modelling for predictions. With modern sequencing, evolutionary models can be based on copious sequence information. We can track the genetic history of entire populations (Fig. 1), detect low-frequency variants, and resolve the spatio-temporal evolutionary dynamics in extended populations95,
To build a predictive analysis from these data, we need to relate genetic or phenotypic data to fitness differences in a population. Given the complexity of generic fitness landscapes, this seems a daunting task77. Densely sampled sequence data, however, contain copious information on selective effects that can be assembled to infer fitness land- and seascapes. Site-specific amino acid preferences can be inferred from deep sequencing data104 using equilibrium models of molecular evolution105,106; related methods map epistatic interactions between these sites16,107. Alternatively, we can infer selection on genetic clades and build predictive models from the local shape of sequence-based coalescent trees15.
At the level of quantitative traits, biophysical principles provide powerful guidance for building empirical fitness models54,58,108,
A minimal fitness model for pathogen evolution can serve to illustrate key concepts of predictive analysis. The model describes the coupled evolution of an adaptive trait (such as antigenicity or resistivity) and a conserved trait (for example, fold stability), which are encoded in a single protein. The minimal fitness seascape, which contains stabilizing selection on the conserved trait and adaptive pressure on the adaptive trait94, is an explicitly time-dependent version of Fisher's geometrical model115,116 (Fig. 3a) or a similar model with components of mesa form (as described above). This type of model has been applied to the evolution of human influenza13; its fitness trade-off between traits also captures aspects of HIV evolution under host immune pressure17, of drug resistance evolution61, and of cancer evolution63. The time-dependence of selection on the adaptive trait describes variable environments and is a key feature of the model. For example, the cross-immunity interactions affecting a pathogen13 depend on the infection history of its hosts; similarly, movement along a spatial gradient of drug concentration generates time-dependent adaptive pressure117,118.
How can we gather data to inform such fitness models? We often have at least partial information on the genetic changes underlying the evolution of the relevant phenotypes. In viral pathogens, for example, the antigenic evolution predominantly occurs in specific epitope sites, whereas fold stability has a broader mutational target of amino acid changes throughout the protein. In some cases, the genetic information includes epistatic interactions between specific sequence sites16,107. Predictive analysis can then exploit an approximate genotype–phenotype map of the adaptive process. Alternatively, we can record phenotypic data by experiment. For example, antigenic assays119,
From these examples, we can distil a few general lessons for predictive modelling in evolution. A ‘mechanistic’ fitness model of the evolutionary dynamics, similar to our minimal model, is feasible if the population harbours substantial variation in fitness that can be explained by few key phenotypes. Such models generically contain positive and negative fitness components, which jointly constrain the evolutionary complexity of the system and generate predictability. In adaptive processes, modelling starts with the key adaptive traits of the system, such as antibiotic resistance or immunity against an antigen. Importantly, however, an adaptive trait alone is often an insufficient basis for predictions, because its evolution is generically coupled to other traits. As discussed above, such correlations arise from epistasis or pleiotropy and generate a serpentine pattern of adaptive paths (Fig. 3a). They can reduce the independent components of fitness variation and, thus, reduce the necessary complexity of fitness models. In complex organisms, we need to map the most informative phenotypes and their correlations to determine the normal modes of predictive analysis. This will eventually require a systems-biology approach to evolutionary predictions122.
The above examples also show that understanding the ecology of fast-evolving populations, which includes exposure to drugs and host–pathogen interactions, is often the salient point of predictive analysis. Co-evolutionary fitness models, which have recently been developed for pathogen-immune systems123,124, are a promising step towards predictions in realistic ecological settings. The success of these models will depend on sufficiently dense time-resolved data of the evolving population and its variable environment. Predicting evolution in an ecological context also generates new questions. For example, we often want to predict not only frequencies but absolute population numbers, such as the viral load of an infection, the size of a cancer cell population, or the size of an epidemic125. These numbers depend on absolute fitness values, which in turn respond strongly to ecological determinants of reproduction. Moreover, in heterogeneous populations of fast-evolving systems, fitness differences within a population can be of the same order of magnitude as absolute growth rates, so population size dynamics must be modelled together with the evolution of clade frequencies. This problem is difficult in general, but at least the response of pathogen population size to immune or vaccination pressure can be computed using fitness models of immune interactions13,120,121. Maximizing this response has been exploited as a criterion for influenza vaccine strain selection13.
An exciting complement to in silico modelling is to use laboratory evolution for predicting an evolutionary process in the wild. This makes sense if we can find a laboratory model that evolves faster than the primary system or can be run in multiple replicates126. For example, massively parallel tumour cell cultures can reveal likely future resistance mutations127. Once these methods can be applied efficiently to individual tumours, they may circumvent the problem of genetic uniqueness and provide patient-specific predictions of tumour response to therapy. Clearly, the increasing knowledge on parallelization and replicability of laboratory evolution will prove very useful for the design of such assays.
Measuring prediction quality
Predictive analysis will be applied to a broadening range of systems, and it will be built on increasingly diverse data and methods. To keep a critical eye on quality, we need an unbiased way to gauge predictive success. Intuitively, we have an idea of what makes a good prediction: it has an element of surprise and an element of truth. To illustrate these criteria, suppose we conduct an evolution experiment with mice and assert the outcome of this experiment will be mice with four legs. This statement is likely to be true but unsurprising; most would rate it as obvious in the first place. On the other hand, we may predict the experiment to produce five-legged mice. That statement is quite surprising but, as performing the actual experiment would show, is unlikely to hold up to testing. The example demonstrates that any prediction is a probabilistic statement. Specifically, it is a bet about the future that should strike a balance between surprise and truth.
We can use information theory to quantify these criteria: good predictions combine low probability in our prior expectation (that is, they are surprising) and high predicted probability (that is, they come close to the truth) of the actual process as observed later. The ‘information gain’, defined as the log ratio of predicted and prior probability, measures how much the prediction reduces our uncertainty about the future process13,128. For an adaptive process, the information gain is closely related to the amount of adaptation (that is, the cumulative fitness flux129) explained by the prediction model. Figure 4a illustrates how prior and posterior probability of a prediction depend on the evolutionary paths of the system and on the time interval of predictions. The prior probability of any future evolutionary path rapidly decreases with time, because longer processes have a much higher number of a priori plausible paths than shorter ones. Specifying the prior probability requires a statistical null model (for example, a neutral model assigns equal probability to all paths of the same length). For a good prediction, the actual path remains likely for some time, but its probability must eventually decay because of noise in the data and imperfections of the model. Therefore, the information gain shows an initial increase and saturates at a characteristic time. This sets the ‘time horizon’ of the prediction method, beyond which the results cannot be trusted.
As an example, Fig. 4b shows the information gain of evolutionary predictions for the human influenza virus A/H3N2. We predict the evolutionary path of clade frequencies by an antigenicity–stability fitness model as described above and evaluate the information gain of these predictions compared to a null model of neutral evolution13 (M. Łuksza and M. Lässig, manuscript in preparation). As shown by the time-dependence of the information gain, the model predictions capture the actual evolutionary process with a time horizon in the order of one year. How much this horizon can be extended by improved modelling remains an open question.
The link of evolutionary predictions to information theory underscores an important general point: the predictability of an evolutionary process is not a yes-or-no issue, but is itself a quantitative trait. We can probe this trait by the information gain of actual predictions, which can be evaluated by comparison with posterior data. In this way, we can compare the predictability of different evolutionary processes by a given method, as well as the prediction quality of different methods for a given process.
From prediction to control
Any therapy or intervention against a fast-evolving pathogen is an attempt to control its future population. Such interventions have different goals and strategies, which range from controlling an infection or cancer within an individual patient to reducing the global spread of pathogen resistance130,131. Similarly, the adaptive immune system can be seen as a host's intrinsic strategy to control pathogens133. There is a fundamental link between predicting an evolutionary process and controlling its future outcome. This is because a predictive computational or experimental model does not just reproduce the actual process; it generates an entire probability distribution of possible outcomes. Controlling the process amounts to changing that distribution by means of an external evolutionary pressure. The intended change is often drastic: an a priori likely outcome, such as the occurrence of resistance mutations or the increase of pathogen load, is to become unlikely. If our intervention or therapy can produce the required evolutionary pressure, predictive models can be leveraged to nudge the process towards the intended outcome. Specifically, we can include the control as an additional component into a fitness model and evaluate the evolutionary response of the population to a given control protocol. For example, a xenograft mouse model of melanoma shows increased survival when the drug is withdrawn at predefined time points132, and fitness modelling of these dynamics predicts how the drug protocol can be optimized based on real-time measurements to further increase survival134. HIV combination therapy, a protocol of multiple suppressive drugs, is a classic case of evolutionary control aimed minimizing the rate of viral escape mutations135. Similarly, a successful vaccine against HIV needs to trigger an immune response that co-evolves with the virus123.
Evolutionary control can reinforce itself if the external adaptive pressure enhances predictability by constraining evolutionary paths (Fig. 2b). For instance, melanoma cells carrying a given mutation in the BRAF oncogene show strong initial response to the drug vemurafenib136, but most cancers of this kind will eventually relapse. The escape to drug resistance appears to be via few mutational pathways, which can be used for predictive analysis of second-line therapy choices. In the coming years, we will have increasingly detailed and time-resolved data of evolutionary pressure and response, for example on immune response to infections by antigen-specific and broadly neutralizing antibodies137. Combined with co-evolutionary fitness models120,121 and fitness models of metabolic pathways under stress138, such data will open new avenues of designing and optimizing evolutionary control.
Prediction and control based on mechanistic evolutionary models are always imperfect, because our knowledge of population data and dynamical parameters remains incomplete for even the simplest biological systems. It is useful to compare mechanistic models with model-free methods, such as deep reinforcement learning. Recent studies have presented remarkable model-free solutions of complex problems; for example, computers can learn to play video games without a priori knowledge of the game139. Can we control an evolving population in a similar way, without prior knowledge of the evolutionary rules? This is far from obvious, given substantial differences in data structure and learning dynamics. Computer game records are comprehensive, free, and fast to acquire; in contrast, evolutionary data are always incomplete, comparatively costly, and ‘computing’ by evolutionary processes is slow. These differences may favour simplified mechanistic models as an avenue to successful prediction and control of evolution.
The link between prediction and control is crucial for ethically responsible decision-making. For example, judging genome editing manipulations must include the question of how predictable their outcome is. Our discussion of evolutionary correlations between phenotypes shows how complex this task is: we have to assess the primary effect of the manipulation, but also secondary changes in other traits that are generated by pleiotropy and epistasis (Fig. 3a). We also need to gauge the effects and persistence of genetic changes under changing environmental and co-evolutionary conditions. For example, drug-resistance mutations can sometimes remain fixed in a population through subsequent epistatic mutations, even when the drug is no longer present and the resistance mechanism bears a fitness cost140. In all of these systems, predictive modelling will take an important role in designing responsible control strategies.
For a growing number of systems, we are witnessing the transition to a new kind of predictive evolutionary biology. In this Perspective, we focused on a specific mode of evolution: fast, predominantly asexual processes driven by a large supply of mutations and strong selection. That is a promising starting point for predictive analysis, but the spectrum of modes and time scales in evolution is clearly much broader. Work in the years ahead will show how predictability plays out in more complex systems, including populations with various rates of recombination. Some of these concepts may also be extendable to repeated patterns in the macro-evolution of multicellular organisms. We expect that the endeavour of predictive analysis will affect our overall view of the life sciences. It will provide a rational basis for decision-making in a number of areas of medicine and public health. At a more fundamental level, it will promote a unifying view on different organisms based on common dynamical principles. Optimizing predictions is a way to learn what the evolutionarily relevant functions of the system are: biology informs predictions and predictions inform biology.
How to cite this article: Lässig, M., Mustonen, V. & Walczak, A. M. Predicting evolution. Nat. Ecol. Evol. 1, 0077 (2017).
We thank M. Desai, I. Gordo, M. Łuksza, T. Mora and A. Nourmohammad for comments on the manuscript. M. Desai, M. Łuksza and A. Nourmohammad also provided important input to illustrations. This work has been partially supported by Deutsche Forschungs-gemeinschaft grant SFB 680 (M.L.), Wellcome Trust grant 098051 (V.M.), and European Research Council ERCStG 306312 (A.M.W.).