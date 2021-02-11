Deep diversification of an AAV capsid protein by machine learning

Nature Biotechnology (2021)

Subjects

Abstract

Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12–29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics.

Fig. 1: Generation of diverse sequence variants guided by ML models trained on deep mutational libraries.
Fig. 2: Experimental validation of synthetic sequences demonstrates high performance and robustness of NN models to training data composition.
Fig. 3: Neural network models generate greater diversity across positions.
Fig. 4: Neural networks generate greater functional diversity at equivalent levels of performance relative to additive and LR models.

Data availability

Experimental data for all three experiments are available at NCBI SRA under accession code PRJNA673640).

Code availability

The TensorFlow 1.3 API was used to implement and train all models using the architectures described in Methods. The training and validation datasets used to create each model are available as part of the experimental dataset released as described in Data availability. The code required to construct the A39 training data, and also to synthesize, process and analyze the experimental data, is provided for download (https://github.com/churchlab/Deep_diversification_AAV), as well as the ipython notebooks that reproduced the analysis figures from the main text (https://github.com/google-research/google-research/tree/master/aav).

Acknowledgements

The authors thank K. Kohlhoff, S. Kearnes, D. Belanger, E. Bixby and J. Gerold for helpful discussions. The authors thank the Wyss Institute for funding. L.J.C. gratefully acknowledges support from the Simons Foundation.

Author information

Author notes

  1. Pierce J. Ogden

    Present address: Manifold Biotechnologies, Allston, MA, USA

  2. These authors contributed equally: Drew H. Bryant, Ali Bashir, Sam Sinai.

Affiliations

  1. Google Research, Mountain View, CA, USA

    Drew H. Bryant, Ali Bashir, Patrick F. Riley & Lucy J. Colwell

  2. Wyss Institute for Biologically Inspired Engineering, Boston, MA, USA

    Sam Sinai, Nina K. Jain, Pierce J. Ogden, George M. Church & Eric D. Kelsic

  3. Department of Genetics, Harvard Medical School, Boston, MA, USA

    Sam Sinai, Nina K. Jain, Pierce J. Ogden, George M. Church & Eric D. Kelsic

  4. Dyno Therapeutics, Cambridge, MA, USA

    Sam Sinai & Eric D. Kelsic

  5. Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA

    Sam Sinai

  6. Deptartment of Chemistry, University of Cambridge, Cambridge, UK

    Lucy J. Colwell

Authors
  Drew H. Bryant
  Ali Bashir
  Sam Sinai
  Nina K. Jain
  Pierce J. Ogden
  Patrick F. Riley
  George M. Church
  Lucy J. Colwell
  Eric D. Kelsic
Contributions

E.D.K., L.J.C., A.B., G.M.C. and P.F.R. conceived the study. E.D.K., N.K.J. and P.J.O. performed in vitro experiments. D.H.B., A.B. and L.J.C. designed, implemented and used ML models to generate variants, with input from E.D.K. and S.S. D.H.B., S.S., A.B., L.J.C. and E.D.K. analyzed the data. D.H.B., S.S., A.B., L.J.C. and E.D.K. wrote the paper, with input from all authors. A.B., P.F.R., G.M.C., L.J.C. and E.D.K. supervised the project and secured funding.

Corresponding authors

Correspondence to George M. Church or Lucy J. Colwell or Eric D. Kelsic.

Ethics declarations

Competing interests

E.D.K., P.J.O., N.K.J., S.S. and G.M.C. performed research while at Harvard University, and E.D.K. and S.S. also performed research while at Dyno Therapeutics. E.D.K., S.S. and G.M.C. hold equity at Dyno Therapeutics. A full list of G.M.C.’s tech transfer, advisory roles and funding sources can be found on the website: http://arep.med.harvard.edu/gmc/tech.html. Harvard University has filed a provisional patent application for inventions related to this work. D.H.B., A.B., L.J.C. and P.F.R. performed research as part of their employment at Google LLC. Google is a technology company that sells ML services as part of its business.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5 and Tables 1–7.

Reporting Summary

Supplementary Data

Supplementary Data 1.

