Abstract
Directed evolution can generate proteins with tailor-made activities. However, full-length genotypes, their frequencies and fitnesses are difficult to measure for evolving gene-length biomolecules using most high-throughput DNA sequencing methods, as short read lengths can lose mutation linkages in haplotypes. Here we present Evoracle, a machine learning method that accurately reconstructs full-length genotypes (R2 = 0.94) and fitness using short-read data from directed evolution experiments, with substantial improvements over related methods. We validate Evoracle on phage-assisted continuous evolution (PACE) and phage-assisted non-continuous evolution (PANCE) of adenine base editors and OrthoRep evolution of drug-resistant enzymes. Evoracle retains strong performance (R2 = 0.86) on data with complete linkage loss between neighboring nucleotides and large measurement noise, such as pooled Sanger sequencing data (~US$10 per timepoint), and broadens the accessibility of training machine learning models on gene variant fitnesses. Evoracle can also identify high-fitness variants, including low-frequency ‘rising stars’, well before they are identifiable from consensus mutations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The sequencing data generated during this study are available at the NCBI Sequence Read Archive database under accession code PRJNA625117. Processed data have been deposited at https://doi.org/10.6084/m9.figshare.12121359.
Code availability
The code used for data processing and analysis are available at https://github.com/maxwshen/evoracle-dataprocessinganalysis. The Evoracle model is available at https://github.com/maxwshen/evoracle.
References
Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet. 16, 379–394 (2015).
Dalkara, D. et al. In vivo-directed evolution of a new adeno-associated virus for therapeutic outer retinal gene delivery from the vitreous. Sci. Transl. Med. 5, 189ra76 (2013).
Badran, A. H. et al. Continuous evolution of Bacillus thuringiensis toxins overcomes insect resistance. Nature 533, 58–63 (2016).
Arnold, F. H. Directed evolution: bringing new chemistry to Life. Angew. Chem. Int. Ed. 57, 4143–4148 (2018).
Esvelt, K. M., Carlson, J. C. & Liu, D. R. A system for the continuous directed evolution of biomolecules. Nature 472, 499–503 (2011).
Ravikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A. & Liu, C. C. Scalable, continuous evolution of genes at mutation rates above genomic error thresholds. Cell 175, 1946–1957 (2018).
Boder, E. T., Midelfort, K. S. & Wittrup, K. D. Directed evolution of antibody fragments with monovalent femtomolar antigen-binding affinity. Proc. Natl Acad. Sci. USA 97, 10701–10705 (2000).
Bornscheuer, U. T., Hauer, B., Jaeger, K. E. & Schwaneberg, U. Directed evolution empowered redesign of natural proteins for the sustainable production of chemicals and pharmaceuticals. Angew. Chem. Int. Ed. 58, 36–40 (2019).
Chen, Z., Lichtor, P. A., Berliner, A. P., Chen, J. C. & Liu, D. R. Evolution of sequence-defined highly functionalized nucleic acid polymers. Nat. Chem. 10, 420–427 (2018).
Lichtor, P. A., Chen, Z., Elowe, N. H., Chen, J. C. & Liu, D. R. Side chain determinants of biopolymer function during selection and replication. Nat. Chem. Biol. 15, 419–426 (2019).
Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018).
Miller, S. M. et al. Continuous evolution of SpCas9 variants compatible with non-G PAMs. Nat. Biotechnol. 38, 471–481 (2020).
Badran, A. H. & Liu, D. R. In vivo continuous directed evolution. Curr. Opin. Chem. Biol. 24, 1–10 (2015).
Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Beerenwinkel, N., Günthard, H., Roth, V. & Metzner, K. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329 (2012).
Buermans, H. P. J. & den Dunnen, J. T. Next generation sequencing technology: advances and applications. Genome Funct. 1842, 1932–1941 (2014).
Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res. 6, 100 (2017).
McCoy, R. C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).
Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA 74, 5463–5467 (1977).
Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Ayling, M., Clark, M. D. & Leggett, R. M. New approaches for metagenome assembly with short reads. Brief. Bioinform. 21, 584–594 (2019).
Nguyen Ba, A. N. et al. High-resolution lineage tracking reveals travelling wave of adaptation in laboratory yeast. Nature 575, 494–499 (2019).
Strino, F., Parisi, F., Micsinai, M. & Kluger, Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res. 41, e165 (2013).
Ramazzotti, D. et al. CAPRI: efficient inference of cancer progression models from cross-sectional data. Bioinformatics 31, 3016–3026 (2015).
Illingworth, C. J. R. Fitness inference from short-read data: within-host evolution of a reassortant H5N1 Influenza Virus. Mol. Biol. Evol. 32, 3012–3026 (2015).
Sobel Leonard, A. et al. The effective rate of influenza reassortment is limited during human infection. PLoS Pathog. 13, e1006203 (2017).
Li, X., Saadat, S., Hu, H. & Li, X. BHap: a novel approach for bacterial haplotype reconstruction. Bioinformatics 35, 4624–4631 (2019).
Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).
Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110, 9007–9012 (2013).
Thuronyi, B. W. et al. Continuous evolution of base editors with expanded target compatibility and improved activity. Nat. Biotechnol. 37, 1070–1079 (2019).
Orr, H. A. Fitness and its role in evolutionary genetics. Nat. Rev. Genet. 10, 531–539 (2009).
Ionides, E. L., Bretó, C. & King, A. A. Inference for nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 103, 18438–18443 (2006).
Snyder, C., Bengtsson, T., Bickel, P. & Anderson, J. Obstacles to high-dimensional particle filtering. Mon. Weather Rev. 136, 4629–4640 (2008).
Csilléry, K., Blum, M. G. B., Gaggiotti, O. E. & François, O. Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol. 25, 410–418 (2010).
Macdonald, B. & Husmeier, D. Gradient matching methods for computational inference in mechanistic models for systems biology: a review and comparative analysis. Front. Bioeng. Biotechnol. 3, 180 (2015).
Varah, J. M. A spline least squares method for numerical parameter estimation in differential equations. SIAM J. Sci. Stat. Comput. 3, 28–46 (1982).
Dong, C. & Yu, B. Mutation surveyor: an in silico tool for sequencing analysis. Methods Mol. Biol. 760, 223–237 (2011).
Kluesner, M. G. et al. EditR: a method to quantify base editing from Sanger sequencing. CRISPR J. 1, 239–250 (2018).
Kim, J. et al. Structural and kinetic characterization of Escherichia coli TadA, the wobble-specific tRNA deaminase. Biochemistry 45, 6407–6416 (2006).
Gaudelli, N. M. et al. Programmable base editing of AT to GC in genomic DNA without DNA cleavage. Nature 551, 464–471 (2017).
Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).
Lizardi, P. M. Next-generation sequencing-by-hybridization. Nat. Biotechnol. 26, 649–650 (2008).
Drmanac, R. et al. Sequencing by hybridization (SBH): advantages, achievements and opportunities. Adv. Biochem. Eng. Biotechnol. 77, 75–101 (2002).
Aguiar, D. & Istrail, S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 19, 577–590 (2012).
Berger, E. et al. Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets. Nat. Commun. 11, 4662 (2020).
Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).
Pulido-Tamayo, S. et al. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res. 43, e105 (2015).
Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. 36th Int. Conf. Mach. Learn. PMLR 97, 773–782 (2019).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at https://arxiv.org/pdf/1712.06148.pdf (2017).
Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).
Acknowledgements
This work was supported by NIH R01 EB031172, R01 EB027793 and R35 GM118062, and the HHMI. We acknowledge an NSF Graduate Research Fellowship to M.W.S. We thank A. Vieira for assistance editing the manuscript.
Author information
Authors and Affiliations
Contributions
Conceptualization, investigation, and computational and statistical analyses were performed by M.W.S. Data curation and formal analysis were conducted by M.W.S. Software and methodology were provided by M.W.S. and resources by K.T.Z. Validation was performed by M.W.S. Project administration was carried out by M.W.S. and D.R.L. The manuscript was written by M.W.S. and D.R.L. Visualization was provided by M.W.S. Supervision and funding acquisition were performed by D.R.L. TadA next-generation sequencing was performed by K.T.Z.
Corresponding author
Ethics declarations
Competing interests
D.R.L. is a co-founder of Beam Therapeutics, Prime Medicine, Editas Medicine and Pairwise Plants, companies that use genome editing technologies.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Evoracle model properties.
a, Regularization strategies. Comparison of loss incurred by L2 norm, variance, normalized statistical skew, and unnormalized statistical skew (our skew) regularizers for distributions of three variables. b, Synthetic data to demonstrate the utility of the skew regularizer. The top left graph shows a ground-truth simulated population containing only a wild-type genotype and a double mutant. Observed single-mutation frequencies from the ground-truth simulation were used by Evoracle to infer full-length genotype trajectories of the wild-type genotype, both single mutants, and the double mutant. Evoracle was performed with varying values of beta (top right, bottom left, and bottom right). When beta is higher, Evoracle more correctly infers the ground-truth trajectories. Inferred genotype frequencies are plotted with a small jitter to show overlapping lines clearly. c, Robustness to hyperparameters. Performance while varying hyperparameters alpha and beta for Cry1Ac data. Reported statistics summarize performance across ten replicates with random parameter initializations.
Extended Data Fig. 2 Evaluating Evoracle’s genotype proposal strategy.
a-b, Sequence proposal strategies. Performance with varying full-length genotype proposal strategies for (a) Cry1Ac data, and (b) TadA data. N = 40 replicates. Box plot depicts median and interquartile range. Default strategy is described in the Methods; x2 to x100 represent adding full-length genotypes comprising combinations of mutations to increase the total number of reconstructed full-length genotypes by the stated multiplicative factor of the default number. See Methods for more details.
Extended Data Fig. 3 Evolutionary fitness reconstruction from pooled Sanger sequencing of OrthoRep campaigns.
a, Comparison of ground-truth and inferred fitness, indicating a negative epistatic interaction between A-76V and D384Y, S404C in Cry1Ac. b-d, Comparison of MIC values and inferred fitness for evolved PfDHFR variants.
Extended Data Fig. 4 Evoracle performance on ABE8e evolution replicates.
Model evaluation on replicate PACE experiments 1 and 3 of the ABE8e directed evolution campaign. Samples 1-20 are from low-stringency PANCE, samples 21-29 are from high- stringency PANCE, and samples 30-36 are from PACE. a, Observed frequencies of 34 mutations. Colors represent amino acid mutations, using the same coloring scheme as in Fig. 2a-b. b, Observed full-length genotype trajectories. Colors represent full-length genotypes, using the same coloring scheme as in Fig. 2c-d. c, Inferred full-length genotype trajectories. Colors represent full-length genotypes, using the same coloring scheme as in Fig. 2c-d. d, Consistency between observed and predicted full-length genotype frequencies; scatter plot and swarm plot with kernel density estimate.
Extended Data Fig. 5 Evaluation of fitness inference.
a-b, Comparison of inferred fitness to fitness calculated from full-length reads for (a) Cry1Ac and (b) TadA.
Extended Data Fig. 6 Evoracle performance with varying sequencing read depth.
a-b, Full-length genotype reconstruction performance across timepoints with varying simulated read depths using binomial samples for (a) Cry1Ac and (b) TadA. Box plot depicts median and interquartile range. N = 50 independent replicates with random seeds.
Extended Data Fig. 7 Comparison to related methods.
a, Observed Cry1Ac (2,138 nt) genotypes from 34 timepoints (spanning 528 h) of PACE from PacBio long-read sequencing data. Colors represent distinct genotypes. Figure is the same as Fig. 1c and reproduced for convenience. b, Cry1Ac genotype frequencies reconstructed by SGML. Gray lines indicate genotypes that are not present in PacBio data. c, Comparison of performance by clonal Sanger sequencing depth compared to pooled Sanger sequencing. Box plots indicate median and interquartile range, and whiskers indicate extrema. N = 50 random seed replicates. d, Comparison of rising star performance by clonal Sanger sequencing depth vs pooled Sanger sequencing on 12 h interpolated Cry1Ac data.
Supplementary information
Supplementary Information
Supplementary Table 1 and Notes 1–4.
Rights and permissions
About this article
Cite this article
Shen, M.W., Zhao, K.T. & Liu, D.R. Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat Chem Biol 17, 1188–1198 (2021). https://doi.org/10.1038/s41589-021-00876-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41589-021-00876-6
This article is cited by
-
Quantification of evolved DNA-editing enzymes at scale with DEQSeq
Genome Biology (2023)
-
Prediction of designer-recombinases for DNA editing with generative deep learning
Nature Communications (2022)
-
In vivo hypermutation and continuous evolution
Nature Reviews Methods Primers (2022)