Directed evolution has proved to be an effective strategy for improving or altering the activity of biomolecules for industrial, research and therapeutic applications. The evolution of proteins in the laboratory requires methods for generating genetic diversity and for identifying protein variants with desired properties. This Review describes some of the tools used to diversify genes, as well as informative examples of screening and selection methods that identify or isolate evolved proteins. We highlight recent cases in which directed evolution generated enzymatic activities and substrate specificities not known to exist in nature.
Directed evolution is a cyclic process that alternates between gene diversification and screening for or selection of functional gene variants.
Library size limitations can be overcome by focusing library diversity on residues implicated by molecular structures, computational models or phylogenetic data. In cases in which there is limited information, random mutagenesis can be used to interrogate the uncertain determinants of protein function.
Recombination methodologies access new combinations of functional variation and can shuffle disparate genetic elements to yield new chimeric proteins.
Low-throughput screens can directly measure individual phenotypes and thus accurately isolate desired subpopulations. Screen throughput can be increased using indirect visible reporters that are strongly coupled to the desired phenotypes.
Selections isolate functional variants through selective replication schemes or physical segregation. Selections operate simultaneously on entire populations and thus offer unparalleled throughput.
Over many generations, iterated mutation and natural selection during biological evolution provide solutions for challenges that organisms face in the natural world. However, the traits that result from natural selection only occasionally overlap with features of organisms and biomolecules that are sought by humans. To guide evolution to access useful phenotypes more frequently, humans for centuries have used artificial selection, beginning with the selective breeding of crops1 and domestication of animals2. More recently, directed evolution in the laboratory has proved to be a highly effective and broadly applicable framework for optimizing or altering the activities of individual genes and gene products, which are the fundamental units of biology.
Genetic diversity fuels both natural and laboratory evolution. The occurrence rate of spontaneous mutations is generally insufficient to access desired gene variants on a time scale that is practical for laboratory evolution. A number of genetic diversification techniques are therefore used to generate libraries of gene variants that accelerate the exploration of a gene's sequence space. Methods to identify and isolate library members with desired properties are a second crucial component of laboratory evolution. During organismal evolution, phenotype and genotype are intrinsically coupled within each organism. However, during laboratory evolution (Fig. 1a), it is often inconvenient or impossible to manipulate genes and gene products in a coupled manner. Therefore, single-gene evolution in the laboratory requires carefully designed strategies for screening or selecting functional variants in ways that maintain the genotype–phenotype association.
In this Review, we summarize techniques that generate single-gene libraries, including standard methods as well as novel approaches that can generate superior diversity containing a larger proportion of functional mutants. We also review screening and selection methods that identify or isolate improved variants within these libraries. Although these strategies can be applied to multigene pathways3,4 and gene networks5,6,7, the examples in this Review will focus exclusively on the laboratory evolution of single genes. In addition, although many of these approaches apply to other types of biomolecules, we focus on the directed evolution of proteins because protein evolution has proved to be especially useful for generating novel biocatalysts8, reagents9 and therapeutics10.
Methods for gene diversification
It is impossible to cover the entire mutational space of a typical protein: complete randomization of a mere decapeptide would yield 1013 unique combinations of amino acids, which exceeds the achievable library size of almost all known protein library creation methods. Because comprehensive coverage of sequence space is impossible, gene diversification strategies are designed to perform an optimal sparse sampling of a vast multidimensional sequence space. The activity level of each library member can be conceptualized as the elevation in a fitness landscape on an x–y coordinate that represents the genotype of that library member. The goal of directed evolution studies is to take mutational steps within this landscape that 'climb' towards peak activity levels (Fig. 1b). Over many generations, these beneficial mutations accumulate, resulting in a successively improved phenotype.
Researchers can use focused mutagenesis to maximize the likelihood that a library contains improved variants, provided that amino acid positions that are likely determinants of the desired function are known. In the absence of plausible structure–function relationships, random mutagenesis can provide a greater chance of accessing functional library members than focusing library diversity on incorrectly chosen residues that, when mutated, do not confer desired activities. Researchers have developed an extensive range of methods to perform both forms of gene diversification, and the most successful strategies often integrate random and focused mutagenesis.
Random mutagenesis. Traditional genetic screens use chemical and physical agents to randomly damage DNA. These agents include alkylating compounds such as ethyl methanesulfonate (EMS)11, deaminating compounds such as nitrous acid12, base analogues such as 2-aminopurine13, and ultraviolet irradiation14. Chemical mutagenesis is sufficient to deactivate genes at random for a genome-wide screen but is less commonly used for directed evolution because of biases in mutational spectrum11,12.
Non-chemical methods to randomly mutate genes frequently enhance the rate of errors during DNA replication. In Escherichia coli, DNA replication by DNA polymerase III introduces mutations at a rate of 10−10 mutations per replicated base15. This rate is increased in mutator strains containing deactivated proofreading and repair enzymes, mutS, mutT and mutD15,16,17. Transformation of the XL1-red strain with a plasmid bearing the evolving gene yields mutations at a rate of 10−6 per base per generation16. Unfortunately, these strains not only mutate the library member but also induce deleterious mutations in the host genome. Host intolerance to a high degree of genomic mutation places an upper limit on in vivo mutagenesis rates. To avoid this constraint, C. C. Liu and co-workers18 developed orthogonal in vivo DNA replication machinery that only mutates target DNA. This method co-opts naturally occurring Kluyveromyces lactis linear plasmids pGKL1/2 and their specialized TP-DNA polymerases. Because this plasmid is exclusively cytoplasmic, the TP-DNA polymerase exerts no mutational load on the host genome within the nucleus of Saccharomyces cerevisiae.
The relatively low mutation rates and the lack of control offered by most previously described in vivo random mutagenesis protocols have led to a strong preference towards in vitro random mutagenesis strategies. In error-prone PCR (epPCR), first described by Goeddel and co-workers19, the low fidelity of DNA polymerases under certain conditions generates point mutations during PCR amplification of a gene of interest. Increased magnesium concentrations, supplementation with manganese or the use of mutagenic dNTP analogues20 can reduce the base-pairing fidelity and increase mutation rates to 10−4~10−3 per replicated base21. Because mutations during PCR accumulate with each cycle of amplification, it is possible to increase the average number of mutations per clone by increasing the number of cycles.
One application of epPCR is to generate neutral drift libraries. Before directed evolution experiments are carried out, a target gene is mutagenized by epPCR and fused to a GFP reporter, and the variants are then screened for proper protein expression22. After multiple rounds of mutagenesis and screening, the resulting neutral drift library exhibits sequence diversity that does not destabilize protein structure and is therefore largely devoid of the deleterious mutations that would otherwise have accumulated during the multiple rounds of mutagenesis. Such libraries provide a valuable and evolvable starting point for subsequent directed evolution of the target protein towards a phenotype of interest22.
The DNA polymerases used in epPCR exhibit mutational biases, but unbalanced dNTP concentrations and proprietary mixtures of polymerases can help to reduce imbalance in the mutational spectrum23,24. To yield a more ideal nucleotide mutational spectrum, Schwaneberg and co-workers25 developed sequence saturation mutagenesis (SeSaM) in which the universal base deoxyinosine is enzymatically inserted throughout the target gene. Although this approach is effective, epPCR is easier to implement and can provide high mutation rates with fairly broad mutational spectra.
Focused mutagenesis strategies. Many proteins are structurally characterized at sufficient resolution to implicate specific residues in substrate binding or catalysis. Although random mutagenesis can generate stochastic point mutations at codons corresponding to these residues, access to codons that require mutation of more than one nucleotide relative to the initial codon often requires a focused mutagenesis strategy. Perhaps the most straightforward focused mutagenesis approach uses synthetic DNA oligonucleotides containing one or more degenerate codons at positions corresponding to targeted residues. This mutagenic oligonucleotide is incorporated into a gene library as a mutagenic cassette26 using either traditional restriction enzyme cloning or contemporary gene assembly protocols27,28,29. The simultaneous saturation mutagenesis of multiple residues can access combinations of mutations that may exhibit epistatic interactions. For example, synergistic mutations are those that in combination confer an effect that is larger than the sum of the effects of each individual mutation. Two beneficial mutations that exhibit synergism can undergo sequential enrichment and are therefore accessible through iterative single-residue saturation libraries. However, to access combinations of mutations exhibiting sign epistasis — a case in which mutations may be deleterious in isolation but confer gain of function in combination — sequential acquisition is impossible, and simultaneous saturation is therefore necessary.
As the number of unique sequences increases exponentially with the number of randomized sites, only a handful of residues can be randomized if complete coverage of the resulting combinations of mutations is desired. Furthermore, the vast majority of individual mutations are likely to be neutral or deleterious to the desired activity30. The mutational load of simultaneous saturation increases with the number of randomized sites, and the resulting library will be populated with a larger fraction of inactive library members.
For this reason, a number of focused mutagenesis strategies only introduce specific amino acid substitutions that are likely to be beneficial. Phylogenetic analyses of homologous proteins, which are pre-enriched for functional variation owing to natural selection, are one means for identifying these potentially beneficial mutations. Wyss and co-workers31 demonstrated that the introduction of consensus mutations can improve thermostability and native enzymatic activity. Rather than focusing on common ancestral mutations, reconstructed evolutionary adaptive path (REAP) analysis identifies significant mutational divergence that is more likely to confer novel gain of function32. These mutational signatures can be adopted from a distinct evolutionary pathway with known phenotypic characteristics and further curated based on structural proximity to the active site.
Molecular modelling can also predict specific amino acid substitutions that are likely to be beneficial33. Algorithms such as Rosetta calculate free energies based on steric clashes, hydrophobic packing, hydrogen bonding and electrostatic interactions34. Mutations that are predicted to stabilize protein folding35 or to improve transition state stabilization can be introduced into the library semi-stochastically by incorporating synthetic oligonucleotides via gene reassembly (ISOR)36.
Diversification by recombination. The reassortment of mutations to access beneficial combinations of mutations is a crucial component of biological evolution. This natural process can be mimicked by a variety of methods under the broad umbrella of homologous recombination. The original DNA shuffling method described by Stemmer37 fragments a gene with DNase and then allows fragments to randomly prime one another in a PCR reaction without added primers (Fig. 2a). A related method developed by Monticello and colleagues38, random chimeragenesis on transient templates (RACHITT), also uses DNase-mediated fragmentation but a different method of reassembly. Fragments anneal directly to a temporary uracil-containing scaffold; upon flap resection and fragment ligation, the scaffold is digested. DNase concentration and fragmentation reaction duration offer crude mechanisms to shift fragment sizes and crossover frequencies, but newer protocols provide greater control. Nucleotide exchange and excision technology (NExT)39 incorporates a fixed concentration of deoxyuridine triphosphate (dUTP) during PCR; subsequent treatment with uracil deglycosylases and apurinic/apyrimidinic lyases yields random fragments with size distribution determined by dUTP concentration. Unlike fragmentation-based methods, staggered extension process (StEP) described by Arnold and colleagues40 is a modified PCR protocol in which the elongation step is interrupted prematurely by heat denaturation. Subsequent annealing allows incomplete extension products to switch templates, effecting recombination of multiple DNA templates into one amplicon (Fig. 2b).
With the decreasing cost of synthetic oligonucleotides, assembly PCR41 (also known as assembly of designed oligonucleotides (ADO) or synthetic shuffling42,43) has become a preferred recombination strategy. In these reactions, overlapping primers extend one another; after multiple cycles the process yields full-length gene products in which each combination of mutation-bearing oligonucleotides has been recombined (Fig. 2c).
Recombination is only effective on a diverse population of functional genes. Typically, one of the recombination methods described above is used between rounds of evolution to recombine mutations from distinct clones44. Alternatively, homologous recombination with copies of the wild-type DNA sequence can eliminate non-beneficial passenger mutations, analogous to the traditional breeding technique of back-crossing37. In another effective use of homologous recombination during gene diversification, a family of closely related naturally occurring homologues can be shuffled into a starting library to take advantage of nature's pre-evolved repertoire of functional gene variants45.
Methods for in vitro recombination require substantial manual manipulation and are usually followed by transformation or transduction to introduce the recombined gene population back into cells. Cornish and colleagues46 harnessed the power of native systems in S. cerevisiae to perform homologous recombination between a library of donor cassettes and the evolving gene. Through yeast mating, functional gene variants undergo reassortment with different donor cassettes, allowing homologous recombination within the evolving population. Seamless alternation between sexual reproduction and selection can support continuous evolution46.
The recombination methods described above rely on sequence homology to preserve gene structure among recombinants. By contrast, sequence homology- independent protein recombination (SHIPREC) permits shuffling of disparate gene elements. Such a capability is particularly useful for recombining families of proteins with similar functions but disparate sequences47. Homology-independent recombination can also create combinatorial protein libraries that do not preserve the ordering or lengths of domains. Ostermeier et al.48 devised incremental truncation for the creation of hybrid enzymes (ITCHY), in which homology-independent recombination is used to create hybrid enzymes through the incremental truncation and fusion of two distinct genes (Fig. 2d). In addition, our laboratory49 used non-homologous random recombination (NRR) to generate functional proteins with substantial rearrangements of domain topology (Fig. 2e). Although these two techniques are different, they both involve random fragmentation (for example, using a DNase or an exonuclease) followed by sequence-independent ligation of fragments. Tuning fragmentation conditions can shift the average number of crossovers, and electrophoresis can be used to isolate ligated products of the desired length to minimize inactive library members that are too short or too long, or that have excessive numbers of crossovers.
Nonetheless, the vast majority of non-homologous recombinants will display domain disruption and folding instability. The SCHEMA algorithm computationally identifies breakpoints in proteins that minimize the number of inter-domain interactions50. Type IIb restriction enzyme sites can be inserted at these optimal breakpoints within the DNA sequences, and enzymatic digestion yields 'sticky ends' that enable sequence-independent site-directed chimeragenesis (SISDC)51. Alternatively, chimeric oligonucleotides with complementarity to two distinct domains defined either by eukaryotic exons52 or by SCHEMA can be used in overlap extension PCR53. A library of these chimeric primers can be used to shuffle domains even in the complete absence of homology.
Diversification strategy considerations. Directed evolution practitioners increasingly use sophisticated focused mutagenesis methods to construct smaller libraries of higher quality that sample a functionally rich portion of the fitness landscape. These strategies require phylogenetic information or molecular structures to focus library diversity on residues or even specific substitutions that are thought to be necessary for the desired activity. In the absence of this information, random mutagenesis is an absolute necessity. Even when the requisite data are available, deducing the determinants of protein function at the amino acid level can be challenging. Random mutagenesis maybe used to probe mutations that are distant from obvious substrate contact sites or that are not present in naturally evolved orthologues. Fortunately, random and focused mutagenesis strategies can be combined into a single diversification step or applied separately during successive rounds of evolution to maximize the likelihood of success54 (Table 1).
Genetic screens for single-gene evolution
Genetic screens were originally developed to discover genes associated with specific phenotypes. Geneticists randomly mutagenize the genome of a model organism and then assay individual organisms for a phenotype of interest. Organisms with altered phenotypes are characterized by crossing and linkage analyses, or more recently by high-throughput DNA sequencing, to identify specific mutations underlying phenotypic changes. Directed evolution applies similar screening strategies to single-gene libraries prepared with the aforementioned diversification methods.
Screens of spatially separated variants. Spatial separation (that is, encoding by location) of individual mutants preserves the linkage between phenotype and genotype. For these screens, gene variants are expressed in a unicellular model organism such as E. coli that can be screened as colonies on solid media or transferred into multiwell liquid culture plates (Fig. 3A). Although spatial separation of clones imposes a practical throughput limit of fewer than ~104 library members per screening round, a key advantage of this approach is its broad compatibility with many different assay techniques. When a fluorescent readout is not available, techniques such as nuclear magnetic resonance (NMR), high-performance liquid chromatography (HPLC), gas chromatography or mass spectroscopy can directly monitor substrate consumption or product formation. In principle, almost any enzymatic activity can be screened in a spatially separated library format, although the time-consuming and infrastructure-intensive nature of some spatially separated screening techniques further limit throughput.
When performing low-throughput screens, an understanding of structure–activity relationships within the target protein may be necessary to maximize the probability of accessing a desired variant. These considerations are best exemplified by the evolution of cytochromes P450, a class of enzymes with high evolutionary potential evidenced by the diverse oxidative reactions they catalyse in nature. Arnold and colleagues8 screened a panel of ~100 previously designed P450 variants in E. coli lysates for carbene transfer to form cyclopropanes; product formation was monitored by gas chromatography. The resulting enzymes exhibit high-activity cyclopropanation with enantioselectivity and diastereoselectivity, capabilities that are not known to exist in any natural biocatalysts. In this case, prior knowledge of mutants with altered P450 activities enabled success with only a small library and a low-throughput screen.
When molecular insight or prior knowledge is lacking, it may be necessary to screen more variants to reach the desired phenotype. High-throughput screens rely on the rapid assessment of optical features such as colour, fluorescence, luminescence or turbidity. In special cases, the protein of interest has an inherently visible phenotype, as demonstrated by the pioneering evolution of the alkaline serine protease subtilisin. You and Arnold55 screened colonies on casein plates for zones of clearing due to proteolysis of substrate milk proteins. A secondary screen on casein plates containing dimethylformamide (DMF) identified variants exhibiting solvent-tolerant proteolysis.
Fluorescent proteins provide a readily screenable phenotype, and thus multiple research groups have used cellular fluorescence as a screen to identify GFP variants with brighter fluorescence and altered absorption or emission spectra44,56. More recently, this approach was applied to Arch rhodopsin, a form of channel rhodopsin engineered by Cohen and colleagues57 to exhibit voltage-dependent fluorescence and used to directly image neuronal activity. Arnold and colleagues9 expressed a library of Arch variants in E. coli using multiwell liquid culture plates and washed cells with ionic buffer to generate the transmembrane potential required for fluorescence measurements. After multiple rounds of screening random and site-directed libraries, the most active variant displayed red-shifted emission and increased brightness. The capabilities of evolved Arch should enable parallel monitoring of multiple neurons using wide-field microscopy.
Most biomolecules are not associated with directly observable phenotypes and therefore require a fluorescent, colorimetric or other readily detectable reporter. Surrogate substrates can be added directly to liquid culture or lysates to generate a fluorescent, luminescent or colorimetric signal that is proportional to the enzymatic activity of interest. As a result, these reporters allow precise screening of diverse catalysts such as P450 monooxygenases58, cellulases59, organophosphate hydrolases60 and retroaldolases61. However, the development of surrogate substrates for some reactions can represent a substantial undertaking62. In addition, evolved variants will have only been screened for activity on a surrogate substrate, and they must be separately assayed to ensure that enzyme optimization on the surrogate also improves activity on the desired substrate.
Widely used genetic reporters such as GFP, luciferase and beta-galactosidase enable facile detection of gene expression. Expression-mediated screens have been developed for the study of protein–protein interactions63 and the activity of enzymes including cellulases and glycosynthases64,65,66. As a general strategy, small-molecule- or cell-state-inducible genetic circuitry from nature can be used to detect desired enzymatic activity. For example, Ackerley and co-workers67 used the DNA-damage inducible SOS promoter to express beta-galactosidase in proportion to nitroreductase activation of genotoxic prodrugs. Through iterative site-directed mutagenesis, this screen identified nitroreductase variants that activated chemotherapeutic prodrugs and killed tumour cells with greater efficiency than wild-type nitroreductase. Gene expression reporters are imperfect measures of enzymatic activity but, when used properly, can correlate strongly with enzymatic activity68. Automated fluorescence measurement and robotic colony picking lighten the tedious workload of these screens, but the physical and material constraints associated with spatial separation inherently limit throughput.
High-throughput screening by flow cytometry. Rather than spatially separating clones, a bulk population can be interrogated at the level of individual cells using the cell wall or membrane to maintain genotype–phenotype association. Fluorescence-activated cell sorting (FACS)69 relies on a non-diffusing fluorescent reporter to automate the identification and isolation of cells containing desired gene variants (Fig. 3B). Integrating major advances in microfluidics, optics and cell manipulation, state-of-the-art flow cytometry offers one of the highest capacities of any screening method, achieving up to 108 library members screened in <24 hours70,71.
Cytosolic fluorescent or luminescent proteins within cells can form the basis for FACS screens of enzymes such as recombinases, chaperones and inteins72,73,74. Cell surface-displayed epitopes are also non-diffusive and can be detected by FACS using fluorescent-labelled antibodies. This approach became more widely used with the development of a yeast display screen for protein–protein interactions71. Boder and Wittrup71 expressed a library of epitope-tagged antibody fragments fused to the yeast mating adhesion receptor Aga2. The resulting library members were displayed on the surface of cells, where they had the opportunity to bind to a target protein fused to a second epitope tag. FACS enabled the isolation of cells decorated with two fluorescent-labelled antibodies, one for each of the epitopes, indicating proper antibody display and target binding (Fig. 3Ca). Researchers can modulate the stringency of FACS screens by varying washing conditions and the fluorescence threshold that triggers cell isolation. For many years, yeast surface display has facilitated affinity maturation of antibody–antigen pairs75 and the discovery of new protein–protein interactions76.
Recently, the yeast display framework has been applied to the evolution of more diverse enzymatic activities. Bond-forming enzymes can be evolved using yeast display, as our laboratory77 demonstrated by evolving sortase A (SrtA), a sequence-specific transpeptidase (that is, protein ligase) from Staphylococcus aureus. Aga2–SrtA library members were displayed on the cell surface alongside a triglycine (GGG) acceptor peptide fused to Aga1. Upon incubation with the biotinylated substrate peptide LPETG, active SrtA catalysed bond formation between the substrate and the acceptor. FACS was used to isolate cells displaying the biotinylated LPETGGG product (Fig. 3Cb). Owing to the unfavourable kinetics of wild-type SrtA, efficient bioconjugation typically requires equimolar concentration of substrate and enzyme. Iterated rounds of FACS screening with increasing stringency produced evolved variants of SrtA (eSrtA) with 140-fold higher kcat/Km values, enabling new applications78,79,80,81,82,83.
The development of a negative screen (also known as counterscreen) using unlabelled competitor substrates enabled our laboratory84 to evolve reprogrammed orthogonal sortases that selectively conjugate LAETG or LPESG substrates. Because substrates are applied ex vivo, this approach is not limited to genetically encoded peptide substrates, and it should be possible to design similar screens for enzymes that catalyse many different classes of bond-forming reactions.
Yeast display can also be modified for the evolution of bond-cleaving enzymes. Iverson, Georgiou and colleagues85 developed yeast endoplasmic reticulum sequestration screening (YESS) in which Aga2 is expressed as a fusion protein to a negative screening substrate, epitope tag 1, a positive screening substrate and epitope tag 2. The Aga2 substrate is retained in the endoplasmic reticulum for processing by a member of a protease library. The presence of both epitope tags on the cell surface indicates protease inactivity, whereas proteolysis of the negative screening substrate would eliminate both tags. FACS isolated the subpopulation of proteases that exclusively cleaved the positive screening substrate and thereby left only epitope tag 1 on the cell surface (Fig. 3Cc). Using YESS, Iverson, Georgiou and colleagues85 evolved tobacco etch virus (TEV) protease variants that selectively cleave ENLYFE/S or ENLYFH/S sequences but not the wild-type substrate ENLYFQ/S. These recent advances demonstrate how cell surface display can be adapted to screen for complex enzymatic activities.
Screening artificial cell-like compartments. When cell-constrained fluorescent reporters are difficult or impossible to implement for a given gene and phenotype, in vitro compartmentalization (IVC) provides an alternative format to enable high-throughput screening. IVC, pioneered by Tawfik and Griffiths86, uses the aqueous droplets in water–oil emulsions to compartmentalize individual genes and gene products along with a surrogate fluorogenic substrate. IVC can enable protein evolution in two formats: either emulsion of single cells expressing the library member or emulsion of individual DNA molecules together with in vitro transcription–translation machinery. Because flow cytometers can only sort particles in an aqueous mixture, a secondary emulsion is necessary to create water–oil–water droplets87 (Fig. 3D) for FACS-based screening. The flexibility to use fluorogenic substrates expands the phenotypes and enzymes that can be screened by flow cytometry.
Recently, IVC coupled with flow cytometry was used to evolve mammalian paraoxonase 1 (PON1). Wild-type PON1 can degrade a variety of organophosphate compounds and has a weak activity on some nerve agents. Tawfik and colleagues60 used fluorogenic coumarin substrate analogues to sort IVC droplets based on phosphotriesterase activities of PON1 variants. The resulting evolved enzyme rePON1 exhibits a 105-fold increase in catalytic activity on cyclosarin and is the first enzyme to degrade G-type (sarin-like) nerve agents with sufficient efficiency to provide prophylactic protection.
Chip-based microfluidic systems ('FACS on a chip') offer several advantages over conventional flow cytometry apparatus. The process of microfluidic droplet formation is more likely to encapsulate single cells or DNA library members, and the consistent volume and quantity of fluorescent reporters in each droplet can support highly quantitative measurements88. Furthermore, the path length of the flow cell precisely dictates the reaction time. These advantages have been demonstrated in proof-of-concept screens for cellulase and peroxidase activities59,88.
Alternative cell-like compartments beyond water–oil emulsions can also entrap genes, proteins and substrates in a suitable format for screening. Shell-like compartments made of layered polycationic and polyanionic polymers (polyelectrolytes) can encapsulate E. coli cells. Because these compartments are stable to detergent, DNA and protein remain linked even after detergent-induced cell lysis. Scott and Plückthun89 used this platform to screen for properly solubilized G protein-coupled receptors (GPCRs) that retain their structure and affinity for a fluorescent probe. In a similar approach, Hollfelder and co-workers60 built polyelectrolyte gel-shell beads (GSBs) that are compatible with flow cytometry (Fig. 3E). Using the fluorogenic organophosphate analogues described above, the researchers sorted GSBs based on phosphotriesterase activity to identify parathion hydrolase variants that more rapidly degrade organophosphate pesticides90.
Selections for functional proteins
Screening, by definition, requires the inspection of individual phenotypes. The resulting data, which can be very rich depending on the choice of observables, not only identify desirable subpopulations but also inform the choice of appropriate screen stringency in subsequent rounds of evolution. By contrast, selection bypasses the need to individually inspect each library member and instead links an activity of interest to physical separation of the encoding DNA or to survival of the organism producing active library members. The development of effective schemes by which molecular activities of interest lead to segregation or replication of desired variants can be a major undertaking that requires creativity and strong molecular intuition. Well-designed selection offers unparalleled throughput albeit at the expense of potentially rich screening data. This drawback often necessitates a secondary phenotypic assay of selection hits in order to optimize diversification and selection protocols for the next cycle of evolution.
Selections for binding affinity. Because all library members in the same mixture undergo selection simultaneously, a molecular linkage between genes and the corresponding gene products, rather than spatial encoding, must be maintained. In a typical target-binding selection, protein library members with desired binding activity and their encoding DNA sequences are captured using an immobilized target, whereas non-binding library members are washed away. In cell surface display or phage display methods, a cell or bacteriophage serves as a compartment to link genes and gene products. Protein library members are expressed on the surface of the cell or the coat of the bacteriophage through fusion with endogenous cell surface proteins91 or phage coat proteins71,92. Phage display has proved to be highly effective in the development of therapeutic antibodies10,93 and in the elucidation of peptide binding motifs94.
Unlike screening methods that are typically limited by measurement throughput, a transformation bottleneck95,96 restricts library sizes that can be processed by selection methods such as cell surface display or phage display, both of which require intracellular translation. As bacterial transformation provides, at best, ~109–1010 transformants per experiment, cell- or phage-based selection methods are generally limited to library sizes in this range96. Ribosome display, developed by Hanes and Plückthun97, can bypass this bottleneck through the use of in vitro translation reactions. In the absence of a stop codon and under carefully controlled conditions, ribosomes remain stably bound to both the mRNA and the growing polypeptide, thereby coupling proteins with their encoding genes. Similarly, mRNA display, developed by Wilson, Keefee and Szostak98, covalently links a translated protein to its encoding mRNA through a puromycin analogue. Binding selections are conceptually simple but limited in scope (Fig. 4a). They are well suited for evolving binding affinity but have only been used in a limited number of cases to evolve enzymes, including β-lactamases99 and RNA ligases100. Although binding affinity is an important component of enzymatic activity, catalytic efficiency and the rate of product release — two properties that are not necessarily maintained or improved during a binding selection — can strongly determine overall enzyme desirability.
Organismal survival as a basis for selection. In a second important class of selections, active library members enable organisms containing their corresponding genes to survive and replicate. Antibiotic resistance is perhaps the most straightforward activity to evolve using the selective replication of E. coli. Numerous studies have evolved enzymes that neutralize or export antibiotics, yielding variant enzymes that are predictive of natural evolutionary trajectories in microorganisms with tolerance to higher doses of antibiotics or resistance to a broader scope of antibiotic substrates45,101,102. In addition to evolving the genes that confer antibiotic resistance, it is also possible to use antibiotic selections to evolve other proteins by linking the desired activity to the expression of an antibiotic resistance gene. For example, Schultz and co-workers103,104 evolved aminoacyl tRNA synthetases that aminoacylate suppressor tRNAs with non-canonical amino acids, resulting in the suppression of a stop codon within a chloramphenicol efflux pump gene. In a similar strategy linking enzymatic activity to antibiotic resistance, Barbas and colleagues105 evolved recombinases with altered DNA sequence specificities by using their activity to reassemble a beta-lactamase gene.
Auxotroph complementation can also form the basis of selections for the evolution of metabolic enzymes. Xylose metabolism is an important target for protein evolution because xylose is a limiting factor in the conversion of lignocellulose biomass into ethanol for use in biofuels. Growth in media containing xylose as the sole carbon source enriches for genes encoding enzymes that better utilize this energy source. Using this strategy, monosaccharide transporters106 and a xylose isomerase107 were evolved for more efficient xylose consumption and ethanol production in S. cerevisiae.
The design of selections for protein activities that do not fulfil metabolic functions is more challenging and requires ingenuity. For example, Hilvert and colleagues108 evolved nanocontainers to more effectively trap HIV protease, a protein that is toxic to E. coli hosts and for which sequestration confers faster growth rates. This approach yielded lumazine synthase capsids that had tenfold higher loading capacity.
Selections within in vitro compartments. In vitro selections can bypass limitations of in vivo selections such as transformation efficiency bottlenecks and host genome mutations that unexpectedly influence selection survival. A popular approach to couple genes and gene products without using cells is the translation of library members in artificial compartments such as the aqueous droplets of water–oil emulsions. Selections within in vitro compartments are particularly well suited for enzymes that directly act on DNA substrates. For example, in a selection for meganucleases with altered sequence specificity, Stoddard and colleagues109 placed a mutated substrate sequence directly upstream of the meganuclease gene; DNA cleavage generated sticky ends that were competent for ligation of a PCR adapter. As a result, PCR within the emulsion droplets selectively amplified genes encoding nucleases that were active on the new substrate sequences.
In vitro selections for DNA and RNA polymerases in emulsions are also referred to as compartmentalized self-replication (CSR) because the polymerases that most efficiently replicate their encoding gene in an emulsion PCR are enriched post-selection110 (Fig. 4b). Using CSR, Holliger and colleagues110 evolved DNA polymerases with higher thermostability and expanded substrate preferences, including Taq polymerase variants that accept Cy3 and Cy5 fluorophore-linked dNTPs111. These evolved polymerases directly incorporate bright fluorescent dyes into DNA molecules, generating nucleic acid polymers with highly altered physical and chemical properties111. In a separate study, Holliger and colleagues112 used CSR to select DNA polymerases that more efficiently amplify damaged DNA isolated from extinct organisms.
The development of compartmentalized partnered replication (CPR) extends IVC selections beyond enzymes that act on DNA113. In CPR selection schemes, the evolving enzymatic activity controls expression of Taq polymerase. Higher concentrations of Taq lead to better PCR amplification of active genes within emulsion droplets containing single E. coli cells (Fig. 4c). The first demonstrations of CPR evolved T7 RNA polymerase variants with orthogonal promoter preferences113,114, an achievement that could in principle be accomplished using CSR. However, the power of CPR to evolve enzymes that do not act on DNA substrates was demonstrated through the evolution of tryptophanyl-tRNA synthetases that selectively charged the non-canonical amino acid 5-hydroxy-L-tryptophan onto suppressor tRNAs that suppress stop codons placed in the Taq polymerase gene113.
Emerging evolution paradigms
Continuous evolution. Traditional protein evolution methods require discrete time- and labour-intensive steps in which researchers generate gene libraries, introduce them into translation systems such as cells or in vitro compartments, perform screens or selections, and then isolate genes encoding library members with desired activities. Recently, researchers have developed methods by which all steps of the protein evolution cycle are performed continuously without manual intervention. These continuous evolution systems can markedly increase the efficiency of protein evolution and, therefore, the number of steps in the sequence space that can be explored in the search for optimal protein variants115.
The majority of continuous evolution experiments have selected for replicative fitness of microorganisms under continuous dilution. This continuous culture format has been applied to the evolution of bacterial genomes for shortened replication time116 and resistance to antibiotics117. Single-gene evolutions are also feasible in continuous culture, as demonstrated with chorismate mutase118 and β-lactamases119. However, specially designed continuous mutagenesis methods that only target the evolving gene of interest are crucial for long evolutionary trajectories to avoid host genome mutations that circumvent selections by inducing cell survival for reasons unrelated to the protein of interest. For this reason, error-prone polymerases that exclusively replicate the library member are particularly amenable to continuous evolution in both E. coli119 and S. cerevisiae18. In addition, the aforementioned system for in vivo recombination in S. cerevisiae exclusively triggers recombination in an evolving gene during alternating stages of sporulation and selection46.
Continuous evolution of viruses, including bacteriophage, is conducted in a fixed-volume vessel (a cellstat or 'lagoon') that is diluted with fresh bacterial host cells. The average residence time in the vessel is shorter than the time required for bacterial replication but longer than phage replication; thus, mutations only accumulate in the phage genome. This process has been used to study evolutionary dynamics within viral genomes120,121, but our laboratory122,123,124 (also B.P. Hubbard and D.R.L, unpublished observations) has more recently extended its application to single-gene evolution. In our phage-assisted continuous evolution (PACE) system, an evolving gene is inserted into the M13 bacteriophage genome in place of an essential phage gene such as gene III (gIII). Instead, the evolving gene controls expression of gIII from an accessory plasmid. If the phage encodes a functional library member, then pIII, the protein encoded by gIII, is produced. Only phage assembled in the presence of pIII are infectious and can go on to infect and replicate in fresh host cells that dilute the vessel (Fig. 4d).
The continuous nature of PACE coupled with enhanced in vivo mutagenesis enables several hundred 'rounds' of selection, mutation and replication to take place per week without manual intervention. The first demonstration of PACE not only reprogrammed the promoter preferences of the T7 RNA polymerase but also suggested schemes for protein–protein interactions and recombinases122. A subsequent study developed a dominant-negative phage protein pIII-neg that can poison progeny phage and form the basis of negative selection123. The recent use of PACE to continuously evolve proteases124 and DNA-binding proteins (B.P. Hubbard and D.R.L, unpublished observations) demonstrates how PACE can be generalized through the development of gene circuitry that links desired enzymatic activities to the expression of gIII.
Computational design and directed evolution. Continuous evolution can extensively explore a fitness landscape over many rounds of evolution but, similar to other methods described above, accesses mutants that successively emerge from a starting gene. Computational protein design can initiate sequence space exploration from starting points that are inaccessible to evolutionary processes originating from naturally existing genes; as a result, it has the potential to expedite the evolution of completely novel protein functions125,126. Although growing computational power and more sophisticated design methodologies have recently produced complex designs such as macromolecular assemblies, receptors and even catalysts127,128,129, initial designs frequently remain suboptimal and require directed evolution to achieve high activity. For example, we and our collaborators130 used phage and yeast display to increase affinity between the designed binding partners Pdar and Prb. Designed enzymes such as peroxidases131 and retroaldolases61 can also be optimized through evolution, yielding efficiencies that rival unrelated natural catalysts of the same reactions. Perhaps the most impressive testimony to the power of computational design coupled with directed evolution is the creation of novel protein catalysts. Tawfik, Baker and co-workers132,133 achieved this aim by designing and evolving proteins that catalyse the Kemp elimination, a reaction not known to be carried out by natural enzymes.
Conclusions and perspectives
Current protein evolution methods each offer unique features that make them more appropriate for solving certain classes of molecular problems (Table 2). When choosing a methodology, researchers should assess the features of the protein that is being evolved to find an optimal screening and selection technology, as well as an appropriate accompanying genetic diversification strategy (Fig. 5). Pioneering studies in the field of directed evolution sought to improve the wild-type activity of enzymes through the enhancement of solubility, thermostability, affinity for substrate or catalytic turnover. These properties remain important in contemporary directed evolution because increased activity and stability often facilitate the engineering or evolution of other desirable properties. The pursuit of ambitious goals such as reprogrammed substrate selectivity33,85 and synthetically useful biocatalysts134 benefits from innovative screens and selections that balance the need for throughput and accurate assessments of library members. New screens and selections that achieve higher throughput or carry out more continuous rounds of evolution can broaden the exploration of the fitness landscape, whereas novel mutagenesis strategies increase the search efficiency. Through computational techniques and creative molecular biology protocols, diversity is focused on residues and specific mutations that influence desired activities135. New directed evolution methods will continue to generate proteins with useful new activities and specificities, as well as expand the scope of protein evolution to include even larger sets of chemical and biological functions.
This work was supported by the US Defense Advanced Research Projects Agency grants DARPA HR0011-11-2-0003 and DARPA N66001-12-C-4207, the US National Institutes of Health (NIH)/National Institute of General Medical Sciences (NIGMS) (grant R01 GM095501) and the Howard Hughes Medical Institute (HHMI).
- Natural selection
A process by which individuals with the highest reproductive fitness pass on their genetic material to their offspring, thus maintaining and enriching heritable traits that are adaptive to the natural environment.
- Artificial selection
(Also known as selective breeding). A process by which human intervention in the reproductive cycle imposes a selection pressure for phenotypic traits desired by the breeder.
Diverse populations of DNA fragments that are subject to downstream screening and selection.
- Library size
The number variants that are subjected to screening and selection. Library sizes are limited by molecular cloning protocols and/or by host transformation efficiency.
- Focused mutagenesis
A strategy of diversification that introduces mutations at DNA regions expected to influence protein activity.
- Random mutagenesis
A strategy of diversification that introduces mutations in an unbiased manner throughout the entire gene.
- Mutational spectrum
The frequency of each specific type of transition and transversion. The evenness of this spectrum allows more thorough sampling of sequence space.
The process by which a cell directly acquires a foreign DNA molecule. A number of protocols allow high-efficiency transformation of microorganisms through treatments with ionic buffers, heat shock or electroporation.
- Neutral drift
A process that occurs in the presence of a purifying selection pressure to eliminate deleterious mutations. This is in contrast to genetic drift, a process by which mutations fluctuate in frequency in the absence of selection pressure.
- Degenerate codons
Codons constructed with a mixed population of nucleotides at a given position, thus sampling all possible amino acids within the constructed libraries. The most popular examples are NNK and NNS (where N can be any of the four nucleotides, K can be G or T, and S can be G or C).
- Epistatic interactions
Non-additive effects between mutations (for example, mutational synergy or synthetic lethality). As a result, the sequential acquisition of mutations is not always equivalent to mutational co-occurrence.
- Homologous recombination
A process by which separate pieces of DNA swap genetic material, guided by the annealing of complementary DNA fragments.
- Passenger mutations
(Also known as hitchhiker mutations). Unnecessary mutations that are enriched in a population owing to co-occurrence with a highly beneficial linked mutation.
The process by which a viral vector delivers a foreign DNA molecule to a cellular host.
- Evolutionary potential
The capacity of a protein to take on new functions through evolution. High thermostability allows for necessary but destabilizing mutations, and functional diversity of homologues is a demonstration of previous evolution in nature.
- Surrogate substrates
Substrate analogues that are permissive of enzymatic conversion but that, upon catalysis, exhibit chemical rearrangements that lead to altered optical properties, including visible colour, relief of fluorophore quenching, shifted fluorophore excitation or emission, and downstream chemiluminescence.
- Fluorescence-activated cell sorting
(FACS). A flow cytometry method in which an aqueous suspension of cells or cell-like compartments is measured for fluorescence (often at multiple wavelengths) one cell at a time and subsequently separated based on a fluorescence threshold.
- Negative screen
A screening method that involves depletion of an undesired phenotype.
- Positive screening
Enrichment for a desired activity such as improved kinetics, tolerance to unnatural conditions and acceptance of new substrates.
- Transformation bottleneck
The efficiency at which DNA library members are transferred into the host organism, thus restricting the number of variants that can be assessed by in vivo selection and screening.
- Auxotroph complementation
The ability of functional library members to resolve a metabolic defect in the host, leading to replication of DNA that encodes active library members.