Reconstruction of evolving gene variants and fitness from short sequencing reads

Shen, Max W.; Zhao, Kevin T.; Liu, David R.

doi:10.1038/s41589-021-00876-6

Article
Published: 11 October 2021

Reconstruction of evolving gene variants and fitness from short sequencing reads

Nature Chemical Biology volume 17, pages 1188–1198 (2021)Cite this article

4922 Accesses
5 Citations
33 Altmetric
Metrics details

Subjects

Abstract

Directed evolution can generate proteins with tailor-made activities. However, full-length genotypes, their frequencies and fitnesses are difficult to measure for evolving gene-length biomolecules using most high-throughput DNA sequencing methods, as short read lengths can lose mutation linkages in haplotypes. Here we present Evoracle, a machine learning method that accurately reconstructs full-length genotypes (R² = 0.94) and fitness using short-read data from directed evolution experiments, with substantial improvements over related methods. We validate Evoracle on phage-assisted continuous evolution (PACE) and phage-assisted non-continuous evolution (PANCE) of adenine base editors and OrthoRep evolution of drug-resistant enzymes. Evoracle retains strong performance (R² = 0.86) on data with complete linkage loss between neighboring nucleotides and large measurement noise, such as pooled Sanger sequencing data (~US$10 per timepoint), and broadens the accessibility of training machine learning models on gene variant fitnesses. Evoracle can also identify high-fitness variants, including low-frequency ‘rising stars’, well before they are identifiable from consensus mutations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Genotype reconstruction during evolution from short-read sequencing data with incomplete physical linkage.**

**Fig. 2: Evolutionary fitness reconstruction from pooled Sanger sequencing of OrthoRep campaigns.**

**Fig. 3: Evolutionary time-series frequency reconstruction from non-continuous directed evolution data.**

**Fig. 4: Robustness to shorter read lengths.**

**Fig. 5: Robustness to measurement noise.**

**Fig. 6: Model-guided fitness optimization.**

ACIDES: on-line monitoring of forward genetic screens for protein engineering

Article Open access 26 December 2023

Takahiro Nemoto, Tommaso Ocari, … Ulisse Ferrari

Extreme purifying selection against point mutations in the human genome

Article Open access 25 July 2022

Noah Dukler, Mehreen R. Mughal, … Adam Siepel

Systematic molecular evolution enables robust biomolecule discovery

Article 30 December 2021

Erika A. DeBenedictis, Emma J. Chory, … Kevin M. Esvelt

Data availability

The sequencing data generated during this study are available at the NCBI Sequence Read Archive database under accession code PRJNA625117. Processed data have been deposited at https://doi.org/10.6084/m9.figshare.12121359.

Code availability

The code used for data processing and analysis are available at https://github.com/maxwshen/evoracle-dataprocessinganalysis. The Evoracle model is available at https://github.com/maxwshen/evoracle.

References

Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet. 16, 379–394 (2015).
Article CAS PubMed Google Scholar
Dalkara, D. et al. In vivo-directed evolution of a new adeno-associated virus for therapeutic outer retinal gene delivery from the vitreous. Sci. Transl. Med. 5, 189ra76 (2013).
Article PubMed CAS Google Scholar
Badran, A. H. et al. Continuous evolution of Bacillus thuringiensis toxins overcomes insect resistance. Nature 533, 58–63 (2016).
Article CAS PubMed PubMed Central Google Scholar
Arnold, F. H. Directed evolution: bringing new chemistry to Life. Angew. Chem. Int. Ed. 57, 4143–4148 (2018).
Article CAS Google Scholar
Esvelt, K. M., Carlson, J. C. & Liu, D. R. A system for the continuous directed evolution of biomolecules. Nature 472, 499–503 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ravikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A. & Liu, C. C. Scalable, continuous evolution of genes at mutation rates above genomic error thresholds. Cell 175, 1946–1957 (2018).
Article CAS PubMed PubMed Central Google Scholar
Boder, E. T., Midelfort, K. S. & Wittrup, K. D. Directed evolution of antibody fragments with monovalent femtomolar antigen-binding affinity. Proc. Natl Acad. Sci. USA 97, 10701–10705 (2000).
Article CAS PubMed PubMed Central Google Scholar
Bornscheuer, U. T., Hauer, B., Jaeger, K. E. & Schwaneberg, U. Directed evolution empowered redesign of natural proteins for the sustainable production of chemicals and pharmaceuticals. Angew. Chem. Int. Ed. 58, 36–40 (2019).
Article CAS Google Scholar
Chen, Z., Lichtor, P. A., Berliner, A. P., Chen, J. C. & Liu, D. R. Evolution of sequence-defined highly functionalized nucleic acid polymers. Nat. Chem. 10, 420–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lichtor, P. A., Chen, Z., Elowe, N. H., Chen, J. C. & Liu, D. R. Side chain determinants of biopolymer function during selection and replication. Nat. Chem. Biol. 15, 419–426 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018).
Article CAS PubMed PubMed Central Google Scholar
Miller, S. M. et al. Continuous evolution of SpCas9 variants compatible with non-G PAMs. Nat. Biotechnol. 38, 471–481 (2020).
Article CAS PubMed PubMed Central Google Scholar
Badran, A. H. & Liu, D. R. In vivo continuous directed evolution. Curr. Opin. Chem. Biol. 24, 1–10 (2015).
Article CAS PubMed Google Scholar
Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Article CAS PubMed Google Scholar
Beerenwinkel, N., Günthard, H., Roth, V. & Metzner, K. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329 (2012).
Article CAS PubMed PubMed Central Google Scholar
Buermans, H. P. J. & den Dunnen, J. T. Next generation sequencing technology: advances and applications. Genome Funct. 1842, 1932–1941 (2014).
CAS Google Scholar
Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res. 6, 100 (2017).
Article PubMed PubMed Central Google Scholar
McCoy, R. C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).
Article PubMed PubMed Central CAS Google Scholar
Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA 74, 5463–5467 (1977).
Article CAS PubMed PubMed Central Google Scholar
Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ayling, M., Clark, M. D. & Leggett, R. M. New approaches for metagenome assembly with short reads. Brief. Bioinform. 21, 584–594 (2019).
Article PubMed Central CAS Google Scholar
Nguyen Ba, A. N. et al. High-resolution lineage tracking reveals travelling wave of adaptation in laboratory yeast. Nature 575, 494–499 (2019).
Article CAS PubMed Google Scholar
Strino, F., Parisi, F., Micsinai, M. & Kluger, Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res. 41, e165 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ramazzotti, D. et al. CAPRI: efficient inference of cancer progression models from cross-sectional data. Bioinformatics 31, 3016–3026 (2015).
Article CAS PubMed Google Scholar
Illingworth, C. J. R. Fitness inference from short-read data: within-host evolution of a reassortant H5N1 Influenza Virus. Mol. Biol. Evol. 32, 3012–3026 (2015).
Article CAS PubMed PubMed Central Google Scholar
Sobel Leonard, A. et al. The effective rate of influenza reassortment is limited during human infection. PLoS Pathog. 13, e1006203 (2017).
Article PubMed PubMed Central CAS Google Scholar
Li, X., Saadat, S., Hu, H. & Li, X. BHap: a novel approach for bacterial haplotype reconstruction. Bioinformatics 35, 4624–4631 (2019).
Article PubMed PubMed Central CAS Google Scholar
Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).
Article CAS PubMed PubMed Central Google Scholar
Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110, 9007–9012 (2013).
Article CAS PubMed PubMed Central Google Scholar
Thuronyi, B. W. et al. Continuous evolution of base editors with expanded target compatibility and improved activity. Nat. Biotechnol. 37, 1070–1079 (2019).
Article CAS PubMed PubMed Central Google Scholar
Orr, H. A. Fitness and its role in evolutionary genetics. Nat. Rev. Genet. 10, 531–539 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ionides, E. L., Bretó, C. & King, A. A. Inference for nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 103, 18438–18443 (2006).
Article CAS PubMed PubMed Central Google Scholar
Snyder, C., Bengtsson, T., Bickel, P. & Anderson, J. Obstacles to high-dimensional particle filtering. Mon. Weather Rev. 136, 4629–4640 (2008).
Article Google Scholar
Csilléry, K., Blum, M. G. B., Gaggiotti, O. E. & François, O. Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol. 25, 410–418 (2010).
Article PubMed Google Scholar
Macdonald, B. & Husmeier, D. Gradient matching methods for computational inference in mechanistic models for systems biology: a review and comparative analysis. Front. Bioeng. Biotechnol. 3, 180 (2015).
Article PubMed PubMed Central Google Scholar
Varah, J. M. A spline least squares method for numerical parameter estimation in differential equations. SIAM J. Sci. Stat. Comput. 3, 28–46 (1982).
Article Google Scholar
Dong, C. & Yu, B. Mutation surveyor: an in silico tool for sequencing analysis. Methods Mol. Biol. 760, 223–237 (2011).
Article CAS PubMed Google Scholar
Kluesner, M. G. et al. EditR: a method to quantify base editing from Sanger sequencing. CRISPR J. 1, 239–250 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kim, J. et al. Structural and kinetic characterization of Escherichia coli TadA, the wobble-specific tRNA deaminase. Biochemistry 45, 6407–6416 (2006).
Article CAS PubMed Google Scholar
Gaudelli, N. M. et al. Programmable base editing of AT to GC in genomic DNA without DNA cleavage. Nature 551, 464–471 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lizardi, P. M. Next-generation sequencing-by-hybridization. Nat. Biotechnol. 26, 649–650 (2008).
Article CAS PubMed Google Scholar
Drmanac, R. et al. Sequencing by hybridization (SBH): advantages, achievements and opportunities. Adv. Biochem. Eng. Biotechnol. 77, 75–101 (2002).
CAS PubMed Google Scholar
Aguiar, D. & Istrail, S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 19, 577–590 (2012).
Article CAS PubMed PubMed Central Google Scholar
Berger, E. et al. Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets. Nat. Commun. 11, 4662 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pulido-Tamayo, S. et al. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res. 43, e105 (2015).
Article PubMed PubMed Central Google Scholar
Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).
Article CAS PubMed PubMed Central Google Scholar
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. 36th Int. Conf. Mach. Learn. PMLR 97, 773–782 (2019).
Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article CAS PubMed Google Scholar
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at https://arxiv.org/pdf/1712.06148.pdf (2017).
Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).
Article PubMed PubMed Central Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article CAS PubMed PubMed Central Google Scholar
Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).
Article CAS PubMed Google Scholar
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).
Google Scholar

Download references

Acknowledgements

This work was supported by NIH R01 EB031172, R01 EB027793 and R35 GM118062, and the HHMI. We acknowledge an NSF Graduate Research Fellowship to M.W.S. We thank A. Vieira for assistance editing the manuscript.

Author information

Authors and Affiliations

Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Max W. Shen, Kevin T. Zhao & David R. Liu
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
Max W. Shen, Kevin T. Zhao & David R. Liu
Howard Hughes Medical Institute, Harvard University, Cambridge, MA, USA
Max W. Shen, Kevin T. Zhao & David R. Liu

Authors

Max W. Shen
View author publications
You can also search for this author in PubMed Google Scholar
Kevin T. Zhao
View author publications
You can also search for this author in PubMed Google Scholar
David R. Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, investigation, and computational and statistical analyses were performed by M.W.S. Data curation and formal analysis were conducted by M.W.S. Software and methodology were provided by M.W.S. and resources by K.T.Z. Validation was performed by M.W.S. Project administration was carried out by M.W.S. and D.R.L. The manuscript was written by M.W.S. and D.R.L. Visualization was provided by M.W.S. Supervision and funding acquisition were performed by D.R.L. TadA next-generation sequencing was performed by K.T.Z.

Corresponding author

Correspondence to David R. Liu.

Ethics declarations

Competing interests

D.R.L. is a co-founder of Beam Therapeutics, Prime Medicine, Editas Medicine and Pairwise Plants, companies that use genome editing technologies.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evoracle model properties.

a, Regularization strategies. Comparison of loss incurred by L2 norm, variance, normalized statistical skew, and unnormalized statistical skew (our skew) regularizers for distributions of three variables. b, Synthetic data to demonstrate the utility of the skew regularizer. The top left graph shows a ground-truth simulated population containing only a wild-type genotype and a double mutant. Observed single-mutation frequencies from the ground-truth simulation were used by Evoracle to infer full-length genotype trajectories of the wild-type genotype, both single mutants, and the double mutant. Evoracle was performed with varying values of beta (top right, bottom left, and bottom right). When beta is higher, Evoracle more correctly infers the ground-truth trajectories. Inferred genotype frequencies are plotted with a small jitter to show overlapping lines clearly. c, Robustness to hyperparameters. Performance while varying hyperparameters alpha and beta for Cry1Ac data. Reported statistics summarize performance across ten replicates with random parameter initializations.

Extended Data Fig. 2 Evaluating Evoracle’s genotype proposal strategy.

a-b, Sequence proposal strategies. Performance with varying full-length genotype proposal strategies for (a) Cry1Ac data, and (b) TadA data. N = 40 replicates. Box plot depicts median and interquartile range. Default strategy is described in the Methods; x2 to x100 represent adding full-length genotypes comprising combinations of mutations to increase the total number of reconstructed full-length genotypes by the stated multiplicative factor of the default number. See Methods for more details.

Extended Data Fig. 3 Evolutionary fitness reconstruction from pooled Sanger sequencing of OrthoRep campaigns.

a, Comparison of ground-truth and inferred fitness, indicating a negative epistatic interaction between A-76V and D384Y, S404C in Cry1Ac. b-d, Comparison of MIC values and inferred fitness for evolved PfDHFR variants.

Extended Data Fig. 4 Evoracle performance on ABE8e evolution replicates.

Model evaluation on replicate PACE experiments 1 and 3 of the ABE8e directed evolution campaign. Samples 1-20 are from low-stringency PANCE, samples 21-29 are from high- stringency PANCE, and samples 30-36 are from PACE. a, Observed frequencies of 34 mutations. Colors represent amino acid mutations, using the same coloring scheme as in Fig. 2a-b. b, Observed full-length genotype trajectories. Colors represent full-length genotypes, using the same coloring scheme as in Fig. 2c-d. c, Inferred full-length genotype trajectories. Colors represent full-length genotypes, using the same coloring scheme as in Fig. 2c-d. d, Consistency between observed and predicted full-length genotype frequencies; scatter plot and swarm plot with kernel density estimate.

Extended Data Fig. 5 Evaluation of fitness inference.

a-b, Comparison of inferred fitness to fitness calculated from full-length reads for (a) Cry1Ac and (b) TadA.

Extended Data Fig. 6 Evoracle performance with varying sequencing read depth.

a-b, Full-length genotype reconstruction performance across timepoints with varying simulated read depths using binomial samples for (a) Cry1Ac and (b) TadA. Box plot depicts median and interquartile range. N = 50 independent replicates with random seeds.

Extended Data Fig. 7 Comparison to related methods.

a, Observed Cry1Ac (2,138 nt) genotypes from 34 timepoints (spanning 528 h) of PACE from PacBio long-read sequencing data. Colors represent distinct genotypes. Figure is the same as Fig. 1c and reproduced for convenience. b, Cry1Ac genotype frequencies reconstructed by SGML. Gray lines indicate genotypes that are not present in PacBio data. c, Comparison of performance by clonal Sanger sequencing depth compared to pooled Sanger sequencing. Box plots indicate median and interquartile range, and whiskers indicate extrema. N = 50 random seed replicates. d, Comparison of rising star performance by clonal Sanger sequencing depth vs pooled Sanger sequencing on 12 h interpolated Cry1Ac data.

Supplementary information

Supplementary Information

Supplementary Table 1 and Notes 1–4.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, M.W., Zhao, K.T. & Liu, D.R. Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat Chem Biol 17, 1188–1198 (2021). https://doi.org/10.1038/s41589-021-00876-6

Download citation

Received: 25 November 2020
Accepted: 09 August 2021
Published: 11 October 2021
Issue Date: November 2021
DOI: https://doi.org/10.1038/s41589-021-00876-6

This article is cited by

Quantification of evolved DNA-editing enzymes at scale with DEQSeq
- Lukas Theo Schmitt
- Aksana Schneider
- Duran Sürün
Genome Biology (2023)
Prediction of designer-recombinases for DNA editing with generative deep learning
- Lukas Theo Schmitt
- Maciej Paszkowski-Rogacz
- Frank Buchholz
Nature Communications (2022)
In vivo hypermutation and continuous evolution
- Rosana S. Molina
- Gordon Rix
- Chang C. Liu
Nature Reviews Methods Primers (2022)