Directed evolution circumvents our profound ignorance of how a protein's sequence encodes its function by using iterative rounds of random mutation and artificial selection to discover new and useful proteins. Proteins can be tunedto adapt to new functions or environments by simple adaptive walks involving small numbers of mutations. Directed evolution studies have shown how rapidly some proteins can evolve under strong selection pressures and, because the entire 'fossil record' of evolutionary intermediates is available for detailed study, they have provided new insight into the relationship between sequence and function. Directed evolution has also shown how mutations that are functionally neutral can set the stage for further adaptation.
Directed evolution optimizes protein function by the successive generations of random mutation, artificial selection or screening. This simple design algorithm circumvents our ignorance of how sequence encodes function and provides a reliable approach to engineering proteins with new and useful properties.
Directed evolution can be envisioned as an uphill walk on a protein fitness landscape, in which regions of higher elevation represent more optimized proteins. The ruggedness of this fitness landscape affects the ability to find uphill paths to fitter sequences and therefore affects the ease of evolutionary searches.
Simple adaptive walks effectively optimize many protein functions, despite landscape ruggedness that arises from epistatic interactions between mutations. The many simple uphill routes to higher fitness can circumvent more convoluted paths that involve neutral or deleterious mutations. More-stable proteins can accept a wider ranger of mutations and are more evolvable.
Recombination of homologous protein sequences provides access to functional sequences with many mutations. These recombined (chimeric) proteins can exhibit properties outside the range of the parental sequences, such as higher stability or even novel activities.
Directed evolution studies have generated a wealth of information on the structure of protein fitness landscapes, mechanisms of adaption, pathways that are accessible under different selection pressures and the nature of trade-offs between properties during evolution.
Millions of years of life's struggle for survival in different environments have resulted in proteins providing diverse, creative and efficient solutions to a wide range of problems, from extracting energy from the environment to repairing and replicating their own code. Good solutions to biological problems can also be good solutions to human problems — proteins are widely used in the food, chemicals, consumer products and medical fields. Not content with nature's protein repertoire, however, protein engineers are working to extend known protein function to new environments or tasks1,2,3,4 and to create new functions altogether5,6,7.
Despite major advances, a molecular-level understanding of why one protein performs a certain task better than another remains elusive. This is perhaps not surprising when we remember that a protein often undergoes conformational changes during function and exists as a dynamic ensemble of conformers that are only slightly more stable than their unfolded and non-functional states and that might themselves be functionally diverse8. Mutations far away from active sites can influence protein function9,10. Engineering enzymatic activity is particularly difficult because very small changes in structure or chemical properties can have big effects on catalysis. Thus, predicting the amino acid sequence, or changes to an amino acid sequence, that would generate a specific behaviour remains a challenge, particularly for applications requiring high performance (such as an industrial enzyme or a therapeutic protein). Unfortunately, where function is concerned, details matter, and we just don't understand the details.
Evolution, however, had no difficulty generating these impressive molecules. Despite their complexity and finely tuned nature, proteins are remarkably evolvable: they can adapt under the pressure of selection by changing their behaviour, function and even fold . Protein engineers have learned to exploit this evolvability using directed evolution — the application of iterative rounds of mutation and artificial selection or screening — to generate new proteins. Hundreds of directed evolution experiments have revealed the ease with which proteins adapt to new challenges11. Notable recent examples include a recombinase evolved to remove proviral HIV from the host genome (providing a new strategy for treating retroviral infections)12, a cytochrome P450 fatty acid hydroxylase that was converted into a highly efficient propane hydroxylase (thereby proving that a cytochrome P450 is fully capable of hydroxylating small alkanes, even though most propane-using organisms use structurally and mechanistically unrelated enzymes)13, a more than 40 °C increase in the thermostability of lipase A (extending its application in biocatalysis to a whole new set of environments)14 and a variant of green fluorescent protein that tolerates having all its leucine residues replaced with a non-natural amino acid, trifluoroleucine15. Roger Tsien won the Nobel Prize last year for his work on the fluorescent proteins that have transformed biological imaging16. Directed evolution had a key role by improving many features of fluorescent proteins, including emission and excitation properties, quantum yield, multimerization state and maturation rate4,17.
Directed evolution has become a common laboratory tool for altering and optimizing protein function (as well as the function of other biological molecules and systems, including RNA, DNA regulatory elements, biosynthetic pathways and genetic regulatory circuits18,19,20) (Box 1). To understand the power, and the limitations, of directed evolution, it is helpful to view it as a biological optimization process. We therefore introduce the concept of evolution on a fitness landscape in protein sequence space and use this framework to explain directed evolution strategies. Data from laboratory evolution experiments have revealed important features of this fitness landscape and the types of trajectories that can traverse it efficiently. This landscape picture can help explain why decomposing a large functional hurdle into a series of smaller ones and exploiting protein modularity and structural information are useful strategies for dealing with the combinatorial explosion of possible paths in an evolutionary search. This also helps us to appreciate the power of recombination to generate functional sequences with numerous (mostly neutral) mutations, novel combinations of which can give rise to new protein behaviours and therefore new starting points for optimization of protein function.
There is little doubt that directed evolution is one of the most effective and reliable approaches to engineering useful new proteins. Perhaps less well appreciated, however, is how much our understanding of protein function and evolution has been enriched by data from these experiments. Directed evolution allows us to disconnect a protein from its natural context and observe how adaptation to different functional challenges can occur. These experiments can explore the boundaries between biological relevance (the ability of a protein to contribute to the reproductive fitness of an organism) and what is physically possible (the ability of a protein to carry out a specific function in vitro or in vivo) in ways that studies on natural proteins alone cannot. Directed evolution can test alternative adaptive scenarios, explore the range of possible solutions to a given functional challenge, examine relationships between different protein properties (for example, trade-offs, in which improvements in one property are accompanied by losses of another) and provide biophysical explanations for evolutionary phenomena. Much has been discovered since these topics were first reviewed in the context of temperature adaptation21,22. In this Review, we revisit some of these early lessons and discuss new ones that have emerged.
Protein fitness landscapes
In his influential 1970 paper, John Maynard Smith eloquently described protein evolution as a walk from one functional protein to another in the space of all possible protein sequences23. He arranged all proteins of length L such that sequences differing by one amino acid mutation were neighbours. Although the distance between any two sequences is small (that is, it equals the number of mutations required to interconvert the sequences and is therefore ≤L), this high-dimensional space contains an incomprehensibly large number of possible proteins. For even a small protein of 100 amino acids there are 20100 (∼10130) possible sequences — more than the number of atoms in the universe. Searching in this space for billions of years for solutions to survival, nature has explored only an infinitesimal fraction of the possible proteins24. Furthermore, natural evolution keeps only sequences that are biologically relevant; others are discarded, even if they represent solutions to other interesting problems. There are so many proteins waiting to be discovered and we can only dream about the extent of their capabilities. Directed evolution is one way to extend protein function to new, non-natural tasks and convert dreams into actual proteins.
Each sequence in Maynard Smith's protein space can be assigned a 'fitness', which in natural evolution is a measure of the host organism's ability to reproduce in a given environment: fitter organisms reproduce faster and their genes spread throughout the population25. When artificial selection is imposed, fitness is defined by the experimenter. High-fitness sequences satisfy all of the criteria for a protein to function as desired, or at least to perform well in the assay used for screening, and might include the ability to recognize one substrate but not another, to be expressed at high levels in a particular host organism, to not aggregate and to have a long lifetime. Protein evolution can then be envisioned as a walk on this high-dimensional fitness landscape, in which regions of higher elevation represent desirable proteins, and iterations of mutation and artificial selection continuously discover new sequences further uphill, with higher fitnesses (Fig. 1a).
As with any optimization problem, the structure of the objective function (the fitness landscape) influences the effectiveness of a search strategy26. Possibilities range from smooth, single-peaked 'Fujiyama' landscapes to rugged, multi-peaked 'Badlands' landscapes27 (Fig. 1b). The rougher the landscape, the harder it is for evolution to climb. Local optima create traps that evolution cannot escape from unless a side-step or even a temporary decrease in fitness is permitted, or if multiple simultaneous mutations enable a jump to a new peak. The easiest landscape to climb is one that offers many smooth, uphill paths to the desired fitness (the Fujiyama landscape).
This terrestrial landscape analogy should be interpreted cautiously, however, because it cannot accurately represent the numerous possible paths that evolution can take to higher fitness (or the even larger number of possible downhill paths). Although it is easy to visualize being caught on a local optimum in a three-dimensional landscape, a local optimum in protein sequence space (in which all possible mutations are deleterious) might be rare, unless stability has been compromised and few new mutations can be accepted. For example, the introduction of stabilizing mutations can increase a protein's mutational robustness, opening new routes for further adaptation28,29.
The vast size of sequence space makes it impossible to characterize (or even model) more than a minute fraction of this fitness surface. Despite this, several important features have emerged from accumulated experimental studies. The first is the low overall density of functional sequences: the vast majority do not code for any functional protein, much less the desired protein30,31,32. Another important feature is the uneven distribution of functional sequences. Although representing a very small fraction of all possible sequences, functional sequences are often next to other functional sequences33,34,35. Maynard Smith recognized that this feature was a requirement for evolution by point mutation to be successful. Evolution can step one mutation at a time only if there is a continuous network of functional proteins, otherwise mutation would always lead to lower fitness and evolution would stop23. Proteins are in fact robust to mutation — a significant fraction of possible mutants retain their fold and function36,37.
Whereas natural evolution can discover new protein functions along circuitous paths that involve many neutral or even slightly deleterious mutations, directed evolution does not have that luxury. Because the possible evolutionary paths grow exponentially as mutations accumulate and there are too many ways to take neutral or deleterious steps that do not ultimately lead uphill, directed evolution is largely constrained to moving continuously uphill in an adaptive walk38. This is often not a severe limitation because many interesting proteins are accessible by short and simple adaptive walks. Although the resulting proteins, or even the mutations, might not be the same as those discovered by more convoluted paths to the same fitness level, they nonetheless provide valuable insights into protein function and routes of adaptation.
Strategies for directed evolution
Before we describe some of the key lessons that directed evolution studies have taught us about protein function and evolution, we briefly discuss the experimental strategy. How the experiment is performed obviously influences the outcome and, therefore, the information that is extracted from it. Finding a sequence that performs a desired function in a vast space of possible sequences that is only sparsely populated with functional ones might seem like a daunting task. Inefficient searches of this space could take essentially forever and the task of the protein engineer is to choose a strategy that will reach the objective and do so quickly and easily. Starting with a functional protein, directed evolution uses repeated generations of mutation to create functional variation and selection of the fittest variants to direct the search to higher elevations on the fitness landscape. It involves four key steps (Fig. 2). First, identifying a good starting sequence; second, mutating this 'parent' to create a library of variants; third, identifying variants with improved function and last, repeating the process until the desired function is achieved. There are many options for the implementation of each step, the choice of which can greatly affect both the efficiency and the endpoint of an evolutionary search.
Directed evolution (and, indeed, natural evolution) relies on the ability of proteins to function over a wider range of environments or carry out a wider range of functions than might be biologically relevant at a given time and therefore selected for. This ability to tolerate a non-natural environment or to exhibit 'promiscuous' functions at some minimal level provides the jumping-off point for optimization towards that new goal. A good parent protein for directed evolution, therefore, exhibits enough of the desired function that small improvements (expected from a single mutation) can be reliably discerned in a high-throughput screen38. It is also easy to work with and sufficiently stable to accommodate multiple, potentially destabilizing, mutations if the target function is some other property. Some proteins are much more evolvable than others11,29,39,40. Possible molecular mechanisms that contribute to evolvability have been discussed, including the key role of the chemical mechanism in enzyme functional evolution41,42 and the idea that evolvable proteins exist in multiple closely related but functionally diverse conformations, the distribution of which is easily altered by mutation8. These ideas, however, are still largely speculative, and little other than the ability to accept mutations29,43 has been conclusively shown in laboratory evolution experiments to contribute directly to allowing one protein to adapt to a new challenge more readily than another protein. A good heuristic indicator of a protein family's evolvability is its natural functional diversity40,44. Proteins that have adapted to exhibit a range of functions across their family, for example members of an enzyme family that accepts a wide range of substrates (although individual enzymes in the family might be specific) are likely to be adaptable in the laboratory.
The next step is to create a library of variants. As screening is often the most difficult experimental step, the library is usually created to generate the highest probability of finding improved proteins given the screening capability. Because most mutations are deleterious and multiple mutations frequently inactivate proteins (see below), this usually involves a low mutation rate (one or two amino acid substitutions per gene). If screening is not difficult (for example, there is a good genetic selection), then the library can be constructed to generate the largest potential improvement. This might mean a slightly higher mutation rate45. In either case, mutations can be introduced randomly1 or, if structural or mechanistic information is available, they can be made in a more directed manner46,47,48 in an effort to increase the frequency of improved proteins and reduce the load in the next step.
Screening (with high-throughput functional assays) or selection (for example, a genetic selection in which hosts with improved proteins outcompete the others) is used to identify the library members improved in the target property. A good screen or selection accurately assesses the target properties. The rule 'you get what you screen for' is always useful to remember — screening (or selecting) for something else is risky49. It is also important not to demand too much improvement in a single generation. The hurdle must be tuned to the screening capacity and should usually be no greater than the improvement that can be provided by a single mutation. If the desired function is beyond what a single mutation can accomplish, the problem can be broken down into a series of smaller ones that can be solved by the accumulation of single mutations, for example by gradually increasing the selection pressure or evolving against a series of intermediate challenges13. The process of mutation and selection is repeated until the fitness objective is met; the number of iterations required obviously depends on the starting fitness and the improvement that can be achieved in each round, but is often only five to ten generations.
Mutational steps. An evolutionary search relies on the presence of functional diversity in a population, which is the result of underlying genetic variation. At the molecular level, this genetic variation can take many forms; for example, point mutations, insertions, deletions, recombination and circular permutation50,51,52. To search efficiently and minimize the screening load, the underlying genetic variation should be set to generate the highest probability of improvement. Statistically, random mutations tend to be quite harsh, usually decreasing activity and sometimes destroying it altogether. Typically, 30–50% of single amino acid mutations are strongly deleterious, 50–70% are neutral or slightly deleterious and 0.01–1% are beneficial11,29,37,53,54,55,56. If the fitness landscape is Fujiyama-like with many smooth uphill paths, only beneficial mutations need to accumulate (either in multiple rounds of mutagenesis and screening or by recombining beneficial mutations found in each round57,58) until the desired fitness is reached. In a single-peaked landscape, all beneficial mutations make a cumulative contribution to the desired function and all paths uphill eventually converge to the same optimal solution.
Of course, no real protein landscape consists of a single peak. Most mutations are deleterious and therefore most paths end downhill, with inactive proteins, rather than uphill at fitter sequences. Furthermore, epistatic interactions occur when the presence of one mutation affects the contribution of another mutation. Such epistatic interactions lead to curves in the fitness landscape and constrain evolutionary searches. Extreme forms of epistasis, in which mutations that are negative in one context become beneficial in another (so-called sign epistasis59), create local optima on the landscape that can frustrate evolutionary optimization. Epistatic interactions are a ubiquitous feature of protein fitness landscapes60,61. We argue, however, that they are not important for most optimizations by directed evolution, which instead follow one of many smooth paths that bypass the more rugged, epistatic routes on this high-dimensional surface62,63,64. Among the numerous mutational trajectories between a starting point and a solution, smooth uphill paths can often be found (Fig. 1c).
Dealing with the combinatorial explosion. Knowing of epistatic interactions and local fitness optima, some protein engineers worry about the need to make and find multiple mutations at one time. If multiple mutations are needed to climb the peak, the combinatorial explosion of mutational possibilities makes them especially challenging to find. For even a small protein of 100 amino acids, there are 1,900 single amino acid mutants and more than 1.5 million double mutants. The number of possible sequences increases exponentially with the number of mutations and a complete sampling of even just the double mutants is beyond the capacity of most screens.
Higher-throughput screening approaches have been developed to enable sampling of more mutants and more combinations of mutations3,65,66. These screens can allow multiple paths to be explored simultaneously, increasing the probability of discovering good adaptive routes to higher fitness. However, higher-throughput screens or selections usually come at the cost of decreased accuracy, especially when a surrogate function that is more amenable to high-throughput measurement is substituted for the desired function. Furthermore, increasing the mutation rate to capture rare synergistic mutations can make it more difficult to identify improved single-mutation variants because common deleterious mutations will tend to mask the rare beneficial ones. It is often better, therefore, to focus on sampling single mutants with a higher quality, lower-throughput screen rather than on increasing the throughput to capture multiple simultaneous mutations. Although a search through single adaptive steps cannot find mutations exhibiting negative epistasis, there are usually other, step-wise adaptive routes to the objective.
The high dimensionality of sequence space that makes finding simultaneous beneficial mutations so difficult can be reduced by taking advantage of structural, functional or phylogenetic information to focus mutations to those residues that are most likely to lead to the desired properties. For example, the modularity of protein structures permits the separate optimization of protein domains13,67. Phylogenetic analyses suggest that nature might separately optimize other, structurally non-obvious subunits, or 'sectors'68, which could prove to be appropriate targets for directed evolution. The search space can also be reduced by focusing mutations to specific residues in a domain; for example, in an active site or binding pocket in which functional changes might be more likely to occur11,46,69,70,71. This strategy only works, however, when the experimenter is able to select the right residue combinations for random mutagenesis, leaving out the possibility of finding surprising and informative solutions elsewhere. Numerous studies have shown, for example, that many activating mutations lie outside enzyme catalytic sites and exert their influence through mechanisms that might not be obvious from structural analysis9,10,72.
Alternative search strategies. Evolution by the accumulation of single mutations has proven to be very effective at optimizing a function or property that already exists or can be reached through a series of intermediate steps. Some functions, however, simply can not be reached through a series of small uphill steps and instead require longer jumps that include mutations that would be neutral or even deleterious when made individually. Examples of functions that might require multiple simultaneous mutations include the appearance of a new catalytic activity or an activity on a substrate for which the parent and its single mutants show no measurable activity.
Because most mutations are deleterious, the probability that a variant retains its fold and function declines exponentially with the number of random substitutions36,37, and random jumps in sequence space uncover mostly inactive proteins. Thus, new functions are extremely difficult to obtain without altering some aspect of the search. One approach is to create a new starting point — a parent protein with at least some minimal function — and improve that by directed evolution7. Where natural examples of a desired function are not practical or might not even exist, emerging protein design tools have identified functional sequences5. Expanding the sequence space by the incorporation of non-natural amino acids can also introduce a whole array of new functions and directed evolution can do the fine-tuning that might be needed to optimize these novel designs15. Another approach is to find more conservative ways to make multiple mutations; for example, using computational protein design tools to identify sets of mutations that are likely to be compatible with structure retention47.
An approach to making multiple mutations that is used extensively in nature is recombination. Naturally-occurring homologous proteins can be recombined to create genetic diversity within protein sequence libraries73,74,75 (Fig. 3a). It has been shown that mutations made by recombination are much less disruptive and generate functional proteins with much higher frequency than random mutations56 (Fig. 3b). Recombination methods based on DNA sequence hybridization direct crossovers to regions of high sequence identity and are generally limited to sequences that are very similar (with more than 70% identity)75, whereas various sequence-independent methods can recombine at random76,77 or user-specified sites78,79. Recombining homologous proteins by choosing crossovers based on structural information allows the construction of libraries of chimeric proteins that simultaneously exhibit high levels of functionality and genetic diversity80. In all cases, the chimeric proteins inherit the best (and worst) residues the parents have to offer, in new combinations that are not observed in nature.
Chimeric proteins can differ by tens or even hundreds of mutations from their parent sequences and still function. The conservative nature of recombination can be exploited to make whole families of novel enzymes. For example, in one set of more than 6,000 chimeric cytochrome P450 proteins with an average of 70 mutations from the closest parent, approximately half folded properly, and at least 75% of these folded P450 proteins displayed enzymatic activity80.
The new combinations of residues can give rise to novel properties81. Because many of the mutations made by recombination are neutral or nearly neutral, recombination is an efficient way to generate the neutral drifts (the accumulation of neutral mutations) that have been shown to lead to increases in promiscuous functions82,83 and mutational robustness84,85. For example, members of the chimeric cytochrome P450 library exhibited higher enzymatic activity than any of the three parents across a panel of 11 non-native substrates that included substrates on which the parent enzymes showed no measurable activity86. Several P450 chimaeras were also more thermostable than the most thermostable parent enzyme, and dozens of thermostable chimaeras could be readily identified based on a small sampling of the library87 (Fig. 3c). This approach was subsequently used to generate dozens of highly stable, highly active fungal cellobiohydrolase II enzymes that degrade cellulose into fermentable sugars (for biofuels applications, for example)79.
Lessons from directed evolution
In addition to generating a plethora of novel proteins, directed evolution studies have elucidated available pathways and molecular mechanisms of adaptation, shown a key role for stability in epistasis and evolvability, identified important evolutionary trade-offs in protein properties and revealed the simultaneously conservative and exploratory nature of recombination, all of which have shed light on long-standing questions in protein chemistry and evolutionary biology. First and foremost, directed evolution experiments have shown time and again how rapidly proteins can adapt to exhibit new functions and properties. Protein behaviour can change dramatically on mutating a very small fraction of the protein sequence. Directed evolution also provides a detailed view of the adaptive process.
A directed evolution approach to studying sequence–function relationships circumvents several challenges associated with inferring mechanisms of adaptation using comparisons of evolutionarily related natural amino acid sequences21,22. Such studies are confounded by the numerous, mostly neutral mutations that accumulated during divergence of the sequences and the complex and largely unknown selection pressures under which the natural sequences evolved. By contrast, the sequences generated by directed evolution contain a small number of adaptive mutations that accumulated under well-defined selective pressures. Furthermore, performing the evolution in the laboratory permits access to the full 'fossil record' of evolutionary intermediates, the sequences, structures and functions of which can be analysed in an attempt to explain how new properties were acquired10,44,72,88. Fasan and co-workers analysed selected intermediates that arose during the directed evolution of a cytochrome P450 fatty acid hydroxylase into a highly efficient and highly specific propane monooxygenase13,72 (Fig. 4). The gradual increase in activity on propane (as measured by total turnovers of propane to propanol — the property targeted during directed evolution) was accompanied by other interesting changes in the enzyme's behaviour, the most notable of which was the decrease in thermostability (as measured by T50; the temperature at which 50% of the proteins are inactivated in 10 minutes). Activating mutations came at the cost of thermostability, to the point that it became necessary to incorporate stabilizing mutations (generation nine in Fig. 4) before further increases in activity could appear. This apparent trade-off between functionally beneficial mutations and thermostability reflects the fact that most mutations are destabilizing and therefore most activating mutations are also destabilizing. Because evolution favours the most likely solutions over rarer ones, it favours marginal stability in the absence of selection for higher stability. It also favours properties that are compatible with marginal stability32. Such trade-offs have also been shown to constrain the evolution of antibiotic resistance enzymes89 and will be discussed further below.
The mutations that accumulated in the haem domain of the cytochrome P450 are depicted in Fig. 4b and are colour-coded according to the generation in which they appeared. Many of the mutations that conferred the increased activity on propane lie outside the substrate-binding pocket, where they influence substrate recognition through mechanisms that are difficult to discern from crystal structures or modelling. That the effects of the adaptive mutations are difficult to rationalize, much less predict, underscores how little we understand of how sequence determines protein structure and function. Directed evolution deals with the details of molecular interactions, and it is hoped that these details will eventually help protein design efforts7.
Directed evolution can explore alternative evolutionary scenarios; for example, to identify other possible solutions to the same functional challenge or to address whether multiple paths can lead to the same solution, as was done with a laboratory-evolved β-lactamase variant that contains 5 mutations responsible for a 100,000-fold increase in cefotaxime resistance63. In this study, the authors constructed variants with all 32 (25) combinations of the adaptive mutations, representing all intermediate sequences along all 120 (5 factorial) possible mutational pathways. They were able to estimate the probability of each pathway based on the relative change in antibiotic resistance conferred to the bacteria by each mutation along each path. Whereas most of the possible paths were constrained by epistasis and were therefore highly unlikely, there were 18 different, simple uphill walks to the final solution.
Empirical landscapes. Even the earliest directed evolution experiments noted how rapidly proteins could adapt to new selective pressures1,58, indicating the ready availability of smooth uphill paths in the fitness landscapes. Stability, the ability to tolerate new environments and low-level side reactions or promiscuous functions usually respond well to directed evolution. One study used a well-controlled set of experiments to select for six different promiscuous activities starting from three different enzymes11. After 2 rounds of directed evolution, yielding just 1–4 mutations, the promiscuous enzyme activities (kcat/KM) had increased by up to 150-fold over the activities of the parent enzymes. Interestingly, these newly evolved activities came at little cost to the native enzymatic activities, suggesting a particular robustness of the native functions to mutation and supporting a scenario for evolution of new activities that allows both the native and novel activities to be displayed in the same gene for some period of time8.
As well as demonstrating the availability of smooth uphill paths, directed evolution has provided insight into the molecular epistasis that curves the landscapes. Several studies have revealed a key link between stability and epistasis, where the effect of a mutation can be conditional on the stability of the parent sequence36,43,90 (Fig. 5). This was demonstrated, for example, in a study of cephalosporin antibiotic resistance mutations in β-lactamase, in which the fitness effects of several active site mutations were found to depend on the presence of a stabilizing M182T mutation89 (Fig. 5a). These epistatic interactions are the result of catalytically beneficial but destabilizing mutations in the active site that cannot be tolerated unless the stabilizing M182T mutation is present. Without M182T, the active site mutations destabilize the enzyme to the point that total activity is compromised.
Many examples of stability-mediated epistasis are best explained in terms of a protein stability threshold, whereby stability is under selection only insofar as it allows a protein to fold and function36,43,91 (Fig. 5). The consequences for evolution are profound: a protein with low stability cannot accept more than a small fraction of the possible mutations because most mutations are destabilizing. Thus, it can become trapped on a local optimum, unable to go further. As illustrated in Fig. 5b, proteins enjoying a larger margin above the minimal stability threshold can explore many more mutations and can therefore continue to adapt to other tasks, such as acquiring activity towards a new substrate or partner29. Stability-mediated epistasis is a mechanism whereby neutral mutations can shape the available adaptive pathways during natural evolution as well as in the laboratory. Experience has shown that when an evolutionary search in the laboratory seems to have exhausted all options for further uphill steps, the incorporation of stabilizing mutations is able to open up new adaptive routes13.
Despite being performed on different protein folds with selection for different protein functions, the repeated evaluation of thousands of random mutations has revealed the general features of protein fitness landscapes. In addition to the uphill paths that lie alongside numerous less favourable, epistatic routes there are an even larger number of side-steps in the protein fitness landscape. The high frequency of neutral mutations observed during evaluation of random mutant libraries suggests a myriad of sequences with essentially equivalent fitness. This is consistent with the existence of natural protein homologues that differ at several positions, the majority of which are functionally neutral. Even sequences that are highly optimized are probably just one of many potential solutions to a given functional challenge. Indeed, it is probably more accurate to imagine protein evolution occurring on neutral networks, rather than on fitness landscapes in which each neighbour has a different fitness28,62. This pervasive neutrality is exploited when families of functional proteins are constructed by recombination of homologous proteins79,80.
As discussed above in the context of stability-mediated epistasis, mutations that are neutral in one context might not be neutral in all and therefore can provide new opportunities for evolution. Directed evolution has shown an important role for stabilizing mutations (which can be functionally neutral or only slightly deleterious) in adaptation. Laboratory evolution experiments have also shown that purposefully accumulated neutral mutations alter promiscuous activities and create new starting points for subsequent adaptive evolution82,83,92. Genetic drift and pre-existing diversity might have a similarly important role in natural adaptive evolution62.
Directed evolution to understand natural evolution? An overall picture of the protein function landscape is therefore emerging from accumulating directed evolution data. This picture offers a description of the physical features that all proteins (synthetic or natural) must exhibit and the effects of mutations on these features. Extending the lessons learned from directed evolution to natural evolution, however, requires caution because these search processes operate under, for example, different time scales, population sizes, mutation rates and strength of selection. Furthermore, natural evolution works on a different fitness landscape and it is unclear how the protein fitness assayed during directed evolution is related to the organismal fitness that natural evolution optimizes. Differences reflect the consequences of interactions between the protein and the cellular environment and might include constraints related to metabolic burden, regulation, non-specific interactions and other factors.
The ability to disconnect a protein from its in vivo function is a valuable asset of directed evolution because it allows the exploration of physically possible proteins without the often-severe constraint of their being biologically relevant and contributing to organismal fitness. Thus, directed evolution can be used to identify which features of proteins are dictated by their physical properties versus those that are due to biological constraints or evolutionary origins and history. The laboratory evolution of the cytochrome P450 propane monooxygenase (Fig. 4), for example, showed the physical possibility, and indeed the ready availability, of such an enzyme, even though known organisms that live on small alkanes use mechanistically and evolutionarily unrelated enzymes for this transformation72. Another example is the generation of proteins with combinations of properties that are usually not found in natural proteins, such as high catalytic activity at low temperature and high stability at elevated temperature21,22. When properties seem to trade off like this, it might be tempting to infer that such trade-offs are dictated by physical requirements, such as the incompatibility between molecular rigidity that is needed for high stability and the flexibility that is required for catalytic activity93,94. If stability and enzymatic activity placed mutually exclusive demands on protein flexibility, then highly active, highly stable enzymes could not exist (a statement that protein engineers did not want to hear). Directed evolution, however, has little trouble finding enzymes that are both highly active and highly stable when the experiments select for both properties95. Clearly, such proteins are far rarer than highly active, marginally stable proteins and, without a good reason, natural sequences would not exhibit both features21,22,32,96.
Despite the vast size of sequence space and the complex nature of protein function, the Darwinian algorithm of mutation and selection provides a powerful method to generate proteins with altered functions. This simple uphill walk on a fitness landscape in sequence space works because proteins are wonderfully evolvable and can adapt to new conditions or even take on new functions with only a few mutations.
In addition to providing useful proteins, directed evolution experiments have also taught us how proteins adapt and shed light on processes at work during natural evolution21,62,97. These experimental results allow us to look at sequence data in a functional context, providing a bridge between long-separated fields of evolutionary and molecular biology98. Directed evolution experiments have been used to address important evolutionary questions about the average effects of mutations, mechanisms of functional divergence, evolvability and evolutionary constraints11,85,96,99.
With the growing number of applications for engineered proteins, directed evolution will continue to be an important strategy for making proteins that are well adapted to new environments and new functions. More advanced high-throughput screens and higher quality sequence libraries will make the searches easier and will enable evolution to solve increasingly complex problems. Advances in our understanding of proteins can be incorporated into library design, and the rapidly decreasing cost of DNA synthesis will relieve many sequence construction constraints. Directed evolution will help teach us how biological systems adapt to changing demands; it might also help us to address some of today's most challenging problems of providing effective treatments for disease or producing fuels and chemicals from renewable resources.
The authors acknowledge support from the U.S. Army Research Office, Department of Energy, National Science Foundation and the National Institutes of Health.
A measure of the ability of a protein to adapt in response to mutation and selective pressure; for example, the frequency of beneficial mutations.
- Directed evolution
The application of iterative rounds of mutation and artificial selection or screening to alter the properties of biological molecules and systems
- Fitness landscape
The mapping from genotype (target sequence) to phenotype (fitness; as measured in the experiment). Directed evolution is an optimization on the fitness landscape.
A procedure whereby chimeric proteins are created by recombining sequence fragments from different (usually evolutionarily, and therefore structurally, related) parent proteins.
- Protein sequence space
The space of all possible protein sequences arranged such that sequences that differ by single mutations are neighbours.
- Adaptive walk
An uphill trajectory on the fitness landscape, in which no deleterious mutations are accepted.
- Neutral drift
The accumulation of mutations that have little or no effect on a particular protein function. These mutations, however, might affect other properties.
- Neutral network
An interconnected network of functionally neutral sequences.