Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Deep diversification of an AAV capsid protein by machine learning

Abstract

Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12–29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Generation of diverse sequence variants guided by ML models trained on deep mutational libraries.
Fig. 2: Experimental validation of synthetic sequences demonstrates high performance and robustness of NN models to training data composition.
Fig. 3: Neural network models generate greater diversity across positions.
Fig. 4: Neural networks generate greater functional diversity at equivalent levels of performance relative to additive and LR models.

Similar content being viewed by others

Data availability

Experimental data for all three experiments are available at NCBI SRA under accession code PRJNA673640).

Code availability

The TensorFlow 1.3 API was used to implement and train all models using the architectures described in Methods. The training and validation datasets used to create each model are available as part of the experimental dataset released as described in Data availability. The code required to construct the A39 training data, and also to synthesize, process and analyze the experimental data, is provided for download (https://github.com/churchlab/Deep_diversification_AAV), as well as the ipython notebooks that reproduced the analysis figures from the main text (https://github.com/google-research/google-research/tree/master/aav).

References

  1. Huang, P. S. et al. High thermodynamic stability of parametrically designed helical bundles. Science 346, 481–485 (2014).

    Article  CAS  Google Scholar 

  2. Butterfield, G. L. et al. Evolution of a designed protein assembly encapsulating its own RNA genome. Nature 552, 415–420 (2017).

    Article  CAS  Google Scholar 

  3. Langan, R. A. et al. De novo design of bioactive protein switches. Nature 572, 205–210 (2019).

    Article  CAS  Google Scholar 

  4. Weinreich, D. M., Delaney, N. F., DePristo, M. A. & Hartl, D. L. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111–114 (2006).

    Article  CAS  Google Scholar 

  5. Halabi, N., Rivoire, O., Leibler, S. & Ranganathan, R. Protein sectors: evolutionary units of three-dimensional structure. Cell 138, 774–786 (2009).

    Article  CAS  Google Scholar 

  6. Ferretti, L., Weinreich, D., Tajima, F. & Achaz, G. Evolutionary constraints in fitness landscapes. Heredity 121, 466–481 (2018).

    Article  CAS  Google Scholar 

  7. Stemmer, W. P. Rapid evolution of a protein in vitro by DNA shuffling. Nature 370, 389–391 (1994).

    Article  CAS  Google Scholar 

  8. Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).

    Article  CAS  Google Scholar 

  9. Davis, A. M., Plowright, A. T. & Valeur, E. Directing evolution: the next revolution in drug discovery? Nat. Rev. Drug Discov. 16, 681–698 (2017).

    Article  Google Scholar 

  10. Grimm, D. et al. In vitro and in vivo gene therapy vector evolution via multispecies interbreeding and retargeting of adeno-associated viruses. J. Virol. 82, 5887–5911 (2008).

    Article  CAS  Google Scholar 

  11. Dalkara, D. et al. In vivo-directed evolution of a new adeno-associated virus for therapeutic outer retinal gene delivery from the vitreous. Sci. Transl. Med. 5, 189ra76 (2013).

  12. Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl Acad. Sci. USA 109, 16858–16863 (2012).

    Article  CAS  Google Scholar 

  13. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  CAS  Google Scholar 

  14. Poelwijk, F. J., Socolich, M. & Ranganathan, R. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. Commun. 10, 4213 (2019).

  15. Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    Article  CAS  Google Scholar 

  16. Wu, Z., Kan, S. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).

    Article  CAS  Google Scholar 

  17. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  CAS  Google Scholar 

  18. Kelsic, E. D. & Church, G. M. Challenges and opportunities of machine-guided capsid engineering for gene therapy. Cell Gene Ther. Insights 5, 523–536 (2019).

    Article  Google Scholar 

  19. Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139–1143 (2019).

    Article  CAS  Google Scholar 

  20. Liu, G. et al. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics 36, 2126–2133 (2020).

  21. Brookes, D. H., Park, H. & Listgarten, J. 2019. Conditioning by adaptive sampling for robust design. Proc. 36th Intl Conf. Machine Learning, PMLR 97, 773–782 (2019).

  22. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  CAS  Google Scholar 

  23. Russell, S. et al. Efficacy and safety of voretigene neparvovec (AAV2-hRPE65v2) in patients with RPE65-mediated inherited retinal dystrophy: a randomised, controlled, open-label, phase 3 trial. Lancet 390, 849–860 (2017).

  24. Dunbar, C. E. et al. Gene therapy comes of age. Science 359, eaan4672 (2018).

    Article  Google Scholar 

  25. Mendell, J. R. et al. Single-dose gene-replacement therapy for spinal muscular atrophy. New Engl. J. Med. 377, 1713–1722 (2017).

    Article  CAS  Google Scholar 

  26. Calcedo, R., Vandenberghe, L. H., Gao, G., Lin, J. & Wilson, J. M. Worldwide epidemiology of neutralizing antibodies to adeno-associated viruses. J. Infect. Dis. 199, 381–390 (2009).

    Article  Google Scholar 

  27. Tse, L. V. et al. Structure-guided evolution of antigenically distinct adeno-associated virus variants for immune evasion. Proc. Natl Acad. Sci. USA 114, E4812–E4821 (2017).

    Article  CAS  Google Scholar 

  28. Tseng, Y. S. & Agbandje-McKenna, M. Mapping the AAV capsid host antibody response toward the development of second generation gene delivery vectors. Front. Immunol. 5, 9 (2014).

    PubMed  PubMed Central  Google Scholar 

  29. Adachi, K., Enoki, T., Kawano, Y., Veraz, M. & Nakai, H. Drawing a high-resolution functional map of adeno-associated virus capsid by massively parallel sequencing. Nat. Commun. 5, 3075 (2014).

    Article  Google Scholar 

  30. Szubert, B. & Drozdov, I. ivis: dimensionality reduction in very large datasets using Siamese Networks. J. Open Source Softw. https://doi.org/10.21105/joss.01596 (2019).

  31. Wheeler, T. J., Clements, J. & Finn, R. D. Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. BMC Bioinformatics 15, 7 (2014).

    Article  Google Scholar 

  32. Pereira, F. et al. Pydna: a simulation and documentation tool for DNA assembly strategies using python. BMC Bioinformatics 16, 142 (2015).

    Article  Google Scholar 

  33. Zolotukhin, S. et al. Recombinant adeno-associated virus purification using novel methods improves infectious titer and yield. Gene Ther. 6, 973–985 (1999).

    Article  CAS  Google Scholar 

  34. Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614–620 (2014).

    Article  CAS  Google Scholar 

  35. Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank K. Kohlhoff, S. Kearnes, D. Belanger, E. Bixby and J. Gerold for helpful discussions. The authors thank the Wyss Institute for funding. L.J.C. gratefully acknowledges support from the Simons Foundation.

Author information

Authors and Affiliations

Authors

Contributions

E.D.K., L.J.C., A.B., G.M.C. and P.F.R. conceived the study. E.D.K., N.K.J. and P.J.O. performed in vitro experiments. D.H.B., A.B. and L.J.C. designed, implemented and used ML models to generate variants, with input from E.D.K. and S.S. D.H.B., S.S., A.B., L.J.C. and E.D.K. analyzed the data. D.H.B., S.S., A.B., L.J.C. and E.D.K. wrote the paper, with input from all authors. A.B., P.F.R., G.M.C., L.J.C. and E.D.K. supervised the project and secured funding.

Corresponding authors

Correspondence to George M. Church, Lucy J. Colwell or Eric D. Kelsic.

Ethics declarations

Competing interests

E.D.K., P.J.O., N.K.J., S.S. and G.M.C. performed research while at Harvard University, and E.D.K. and S.S. also performed research while at Dyno Therapeutics. E.D.K., S.S. and G.M.C. hold equity at Dyno Therapeutics. A full list of G.M.C.’s tech transfer, advisory roles and funding sources can be found on the website: http://arep.med.harvard.edu/gmc/tech.html. Harvard University has filed a provisional patent application for inventions related to this work. D.H.B., A.B., L.J.C. and P.F.R. performed research as part of their employment at Google LLC. Google is a technology company that sells ML services as part of its business.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5 and Tables 1–7.

Reporting Summary

Supplementary Data

Supplementary Data 1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bryant, D.H., Bashir, A., Sinai, S. et al. Deep diversification of an AAV capsid protein by machine learning. Nat Biotechnol 39, 691–696 (2021). https://doi.org/10.1038/s41587-020-00793-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-020-00793-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing