Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Reconstruction of evolving gene variants and fitness from short sequencing reads

Abstract

Directed evolution can generate proteins with tailor-made activities. However, full-length genotypes, their frequencies and fitnesses are difficult to measure for evolving gene-length biomolecules using most high-throughput DNA sequencing methods, as short read lengths can lose mutation linkages in haplotypes. Here we present Evoracle, a machine learning method that accurately reconstructs full-length genotypes (R2 = 0.94) and fitness using short-read data from directed evolution experiments, with substantial improvements over related methods. We validate Evoracle on phage-assisted continuous evolution (PACE) and phage-assisted non-continuous evolution (PANCE) of adenine base editors and OrthoRep evolution of drug-resistant enzymes. Evoracle retains strong performance (R2 = 0.86) on data with complete linkage loss between neighboring nucleotides and large measurement noise, such as pooled Sanger sequencing data (~US$10 per timepoint), and broadens the accessibility of training machine learning models on gene variant fitnesses. Evoracle can also identify high-fitness variants, including low-frequency ‘rising stars’, well before they are identifiable from consensus mutations.

This is a preview of subscription content

Access options

Fig. 1: Genotype reconstruction during evolution from short-read sequencing data with incomplete physical linkage.
Fig. 2: Evolutionary fitness reconstruction from pooled Sanger sequencing of OrthoRep campaigns.
Fig. 3: Evolutionary time-series frequency reconstruction from non-continuous directed evolution data.
Fig. 4: Robustness to shorter read lengths.
Fig. 5: Robustness to measurement noise.
Fig. 6: Model-guided fitness optimization.

Data availability

The sequencing data generated during this study are available at the NCBI Sequence Read Archive database under accession code PRJNA625117. Processed data have been deposited at https://doi.org/10.6084/m9.figshare.12121359.

Code availability

The code used for data processing and analysis are available at https://github.com/maxwshen/evoracle-dataprocessinganalysis. The Evoracle model is available at https://github.com/maxwshen/evoracle.

References

  1. Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet. 16, 379–394 (2015).

    CAS  PubMed  Article  Google Scholar 

  2. Dalkara, D. et al. In vivo-directed evolution of a new adeno-associated virus for therapeutic outer retinal gene delivery from the vitreous. Sci. Transl. Med. 5, 189ra76 (2013).

    PubMed  Article  CAS  Google Scholar 

  3. Badran, A. H. et al. Continuous evolution of Bacillus thuringiensis toxins overcomes insect resistance. Nature 533, 58–63 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. Arnold, F. H. Directed evolution: bringing new chemistry to Life. Angew. Chem. Int. Ed. 57, 4143–4148 (2018).

    CAS  Article  Google Scholar 

  5. Esvelt, K. M., Carlson, J. C. & Liu, D. R. A system for the continuous directed evolution of biomolecules. Nature 472, 499–503 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. Ravikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A. & Liu, C. C. Scalable, continuous evolution of genes at mutation rates above genomic error thresholds. Cell 175, 1946–1957 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. Boder, E. T., Midelfort, K. S. & Wittrup, K. D. Directed evolution of antibody fragments with monovalent femtomolar antigen-binding affinity. Proc. Natl Acad. Sci. USA 97, 10701–10705 (2000).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. Bornscheuer, U. T., Hauer, B., Jaeger, K. E. & Schwaneberg, U. Directed evolution empowered redesign of natural proteins for the sustainable production of chemicals and pharmaceuticals. Angew. Chem. Int. Ed. 58, 36–40 (2019).

    CAS  Article  Google Scholar 

  9. Chen, Z., Lichtor, P. A., Berliner, A. P., Chen, J. C. & Liu, D. R. Evolution of sequence-defined highly functionalized nucleic acid polymers. Nat. Chem. 10, 420–427 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. Lichtor, P. A., Chen, Z., Elowe, N. H., Chen, J. C. & Liu, D. R. Side chain determinants of biopolymer function during selection and replication. Nat. Chem. Biol. 15, 419–426 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. Miller, S. M. et al. Continuous evolution of SpCas9 variants compatible with non-G PAMs. Nat. Biotechnol. 38, 471–481 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. Badran, A. H. & Liu, D. R. In vivo continuous directed evolution. Curr. Opin. Chem. Biol. 24, 1–10 (2015).

    CAS  PubMed  Article  Google Scholar 

  14. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    CAS  PubMed  Article  Google Scholar 

  15. Beerenwinkel, N., Günthard, H., Roth, V. & Metzner, K. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. Buermans, H. P. J. & den Dunnen, J. T. Next generation sequencing technology: advances and applications. Genome Funct. 1842, 1932–1941 (2014).

    CAS  Google Scholar 

  17. Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res. 6, 100 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  18. McCoy, R. C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  19. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA 74, 5463–5467 (1977).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. Ayling, M., Clark, M. D. & Leggett, R. M. New approaches for metagenome assembly with short reads. Brief. Bioinform. 21, 584–594 (2019).

    PubMed Central  Article  CAS  Google Scholar 

  23. Nguyen Ba, A. N. et al. High-resolution lineage tracking reveals travelling wave of adaptation in laboratory yeast. Nature 575, 494–499 (2019).

    CAS  PubMed  Article  Google Scholar 

  24. Strino, F., Parisi, F., Micsinai, M. & Kluger, Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res. 41, e165 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  25. Ramazzotti, D. et al. CAPRI: efficient inference of cancer progression models from cross-sectional data. Bioinformatics 31, 3016–3026 (2015).

    CAS  PubMed  Article  Google Scholar 

  26. Illingworth, C. J. R. Fitness inference from short-read data: within-host evolution of a reassortant H5N1 Influenza Virus. Mol. Biol. Evol. 32, 3012–3026 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. Sobel Leonard, A. et al. The effective rate of influenza reassortment is limited during human infection. PLoS Pathog. 13, e1006203 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  28. Li, X., Saadat, S., Hu, H. & Li, X. BHap: a novel approach for bacterial haplotype reconstruction. Bioinformatics 35, 4624–4631 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  29. Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  30. Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110, 9007–9012 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. Thuronyi, B. W. et al. Continuous evolution of base editors with expanded target compatibility and improved activity. Nat. Biotechnol. 37, 1070–1079 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. Orr, H. A. Fitness and its role in evolutionary genetics. Nat. Rev. Genet. 10, 531–539 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. Ionides, E. L., Bretó, C. & King, A. A. Inference for nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 103, 18438–18443 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. Snyder, C., Bengtsson, T., Bickel, P. & Anderson, J. Obstacles to high-dimensional particle filtering. Mon. Weather Rev. 136, 4629–4640 (2008).

    Article  Google Scholar 

  35. Csilléry, K., Blum, M. G. B., Gaggiotti, O. E. & François, O. Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol. 25, 410–418 (2010).

    PubMed  Article  Google Scholar 

  36. Macdonald, B. & Husmeier, D. Gradient matching methods for computational inference in mechanistic models for systems biology: a review and comparative analysis. Front. Bioeng. Biotechnol. 3, 180 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  37. Varah, J. M. A spline least squares method for numerical parameter estimation in differential equations. SIAM J. Sci. Stat. Comput. 3, 28–46 (1982).

    Article  Google Scholar 

  38. Dong, C. & Yu, B. Mutation surveyor: an in silico tool for sequencing analysis. Methods Mol. Biol. 760, 223–237 (2011).

    CAS  PubMed  Article  Google Scholar 

  39. Kluesner, M. G. et al. EditR: a method to quantify base editing from Sanger sequencing. CRISPR J. 1, 239–250 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  40. Kim, J. et al. Structural and kinetic characterization of Escherichia coli TadA, the wobble-specific tRNA deaminase. Biochemistry 45, 6407–6416 (2006).

    CAS  PubMed  Article  Google Scholar 

  41. Gaudelli, N. M. et al. Programmable base editing of AT to GC in genomic DNA without DNA cleavage. Nature 551, 464–471 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. Lizardi, P. M. Next-generation sequencing-by-hybridization. Nat. Biotechnol. 26, 649–650 (2008).

    CAS  PubMed  Article  Google Scholar 

  44. Drmanac, R. et al. Sequencing by hybridization (SBH): advantages, achievements and opportunities. Adv. Biochem. Eng. Biotechnol. 77, 75–101 (2002).

    CAS  PubMed  Google Scholar 

  45. Aguiar, D. & Istrail, S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 19, 577–590 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  46. Berger, E. et al. Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets. Nat. Commun. 11, 4662 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  47. Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. Pulido-Tamayo, S. et al. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res. 43, e105 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  49. Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. 36th Int. Conf. Mach. Learn. PMLR 97, 773–782 (2019).

    Google Scholar 

  51. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    CAS  PubMed  Article  Google Scholar 

  52. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at https://arxiv.org/pdf/1712.06148.pdf (2017).

  53. Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).

    PubMed  PubMed Central  Google Scholar 

  54. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  55. Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).

    CAS  PubMed  Article  Google Scholar 

  56. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).

    Google Scholar 

Download references

Acknowledgements

This work was supported by NIH R01 EB031172, R01 EB027793 and R35 GM118062, and the HHMI. We acknowledge an NSF Graduate Research Fellowship to M.W.S. We thank A. Vieira for assistance editing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, investigation, and computational and statistical analyses were performed by M.W.S. Data curation and formal analysis were conducted by M.W.S. Software and methodology were provided by M.W.S. and resources by K.T.Z. Validation was performed by M.W.S. Project administration was carried out by M.W.S. and D.R.L. The manuscript was written by M.W.S. and D.R.L. Visualization was provided by M.W.S. Supervision and funding acquisition were performed by D.R.L. TadA next-generation sequencing was performed by K.T.Z.

Corresponding author

Correspondence to David R. Liu.

Ethics declarations

Competing interests

D.R.L. is a co-founder of Beam Therapeutics, Prime Medicine, Editas Medicine and Pairwise Plants, companies that use genome editing technologies.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evoracle model properties.

a, Regularization strategies. Comparison of loss incurred by L2 norm, variance, normalized statistical skew, and unnormalized statistical skew (our skew) regularizers for distributions of three variables. b, Synthetic data to demonstrate the utility of the skew regularizer. The top left graph shows a ground-truth simulated population containing only a wild-type genotype and a double mutant. Observed single-mutation frequencies from the ground-truth simulation were used by Evoracle to infer full-length genotype trajectories of the wild-type genotype, both single mutants, and the double mutant. Evoracle was performed with varying values of beta (top right, bottom left, and bottom right). When beta is higher, Evoracle more correctly infers the ground-truth trajectories. Inferred genotype frequencies are plotted with a small jitter to show overlapping lines clearly. c, Robustness to hyperparameters. Performance while varying hyperparameters alpha and beta for Cry1Ac data. Reported statistics summarize performance across ten replicates with random parameter initializations.

Extended Data Fig. 2 Evaluating Evoracle’s genotype proposal strategy.

a-b, Sequence proposal strategies. Performance with varying full-length genotype proposal strategies for (a) Cry1Ac data, and (b) TadA data. N = 40 replicates. Box plot depicts median and interquartile range. Default strategy is described in the Methods; x2 to x100 represent adding full-length genotypes comprising combinations of mutations to increase the total number of reconstructed full-length genotypes by the stated multiplicative factor of the default number. See Methods for more details.

Extended Data Fig. 3 Evolutionary fitness reconstruction from pooled Sanger sequencing of OrthoRep campaigns.

a, Comparison of ground-truth and inferred fitness, indicating a negative epistatic interaction between A-76V and D384Y, S404C in Cry1Ac. b-d, Comparison of MIC values and inferred fitness for evolved PfDHFR variants.

Extended Data Fig. 4 Evoracle performance on ABE8e evolution replicates.

Model evaluation on replicate PACE experiments 1 and 3 of the ABE8e directed evolution campaign. Samples 1-20 are from low-stringency PANCE, samples 21-29 are from high- stringency PANCE, and samples 30-36 are from PACE. a, Observed frequencies of 34 mutations. Colors represent amino acid mutations, using the same coloring scheme as in Fig. 2a-b. b, Observed full-length genotype trajectories. Colors represent full-length genotypes, using the same coloring scheme as in Fig. 2c-d. c, Inferred full-length genotype trajectories. Colors represent full-length genotypes, using the same coloring scheme as in Fig. 2c-d. d, Consistency between observed and predicted full-length genotype frequencies; scatter plot and swarm plot with kernel density estimate.

Extended Data Fig. 5 Evaluation of fitness inference.

a-b, Comparison of inferred fitness to fitness calculated from full-length reads for (a) Cry1Ac and (b) TadA.

Extended Data Fig. 6 Evoracle performance with varying sequencing read depth.

a-b, Full-length genotype reconstruction performance across timepoints with varying simulated read depths using binomial samples for (a) Cry1Ac and (b) TadA. Box plot depicts median and interquartile range. N = 50 independent replicates with random seeds.

Extended Data Fig. 7 Comparison to related methods.

a, Observed Cry1Ac (2,138 nt) genotypes from 34 timepoints (spanning 528 h) of PACE from PacBio long-read sequencing data. Colors represent distinct genotypes. Figure is the same as Fig. 1c and reproduced for convenience. b, Cry1Ac genotype frequencies reconstructed by SGML. Gray lines indicate genotypes that are not present in PacBio data. c, Comparison of performance by clonal Sanger sequencing depth compared to pooled Sanger sequencing. Box plots indicate median and interquartile range, and whiskers indicate extrema. N = 50 random seed replicates. d, Comparison of rising star performance by clonal Sanger sequencing depth vs pooled Sanger sequencing on 12 h interpolated Cry1Ac data.

Supplementary information

Supplementary Information

Supplementary Table 1 and Notes 1–4.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shen, M.W., Zhao, K.T. & Liu, D.R. Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat Chem Biol 17, 1188–1198 (2021). https://doi.org/10.1038/s41589-021-00876-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41589-021-00876-6

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing