Unified rational protein engineering with sequence-based deep representation learning

Alley, Ethan C.; Khimulya, Grigory; Biswas, Surojit; AlQuraishi, Mohammed; Church, George M.

doi:10.1038/s41592-019-0598-1

Article
Published: 21 October 2019

Unified rational protein engineering with sequence-based deep representation learning

Nature Methods volume 16, pages 1315–1322 (2019)Cite this article

49k Accesses
472 Citations
223 Altmetric
Metrics details

Subjects

Abstract

Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Workflow to learn and apply deep protein representations.**

**Fig. 2: UniRep encodes amino-acid physicochemistry, organism level information, secondary structure, evolutionary and functional information, and higher-order structural features.**

**Fig. 3: UniRep predicts structural and functional properties of proteins.**

**Fig. 4: UniRep, fine-tuned to a local evolutionary context, facilitates protein engineering by enabling generalization to distant peaks in the sequence landscape.**

Machine-learning-guided directed evolution for protein engineering

Article 15 July 2019

Low-N protein engineering with data-efficient deep learning

Article 07 April 2021

ECNet is an evolutionary context-integrated deep learning framework for protein engineering

Article Open access 30 September 2021

Data availability

All data are available in the main text or the supplementary materials.

Code availability

Code for UniRep model training and inference with trained weights along with links to all necessary data is available in a public repository at https://github.com/churchlab/UniRep. Code to reproduce all analysis and regenerate figures with links to preprocessed benchmark datasets is available online https://github.com/churchlab/UniRep-analysis.

References

Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet. 16, 379–394 (2015).
Article CAS PubMed Google Scholar
Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).
Article CAS PubMed PubMed Central Google Scholar
Biswas, S. et al. Toward machine-guided design of proteins. Preprint at bioRxiv https://doi.org/10.1101/337154 (2018).
Bedbrook, C. N., Yang, K. K., Rice, A. J., Gradinaru, V. & Arnold, F. H. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS Comput. Biol. 13, e1005786 (2017).
Article PubMed PubMed Central CAS Google Scholar
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Article CAS PubMed PubMed Central Google Scholar
Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
Article CAS PubMed Google Scholar
Coluzza, I. Computational protein design: a review. J. Phys. Condens. Matter 29, 143001 (2017).
Article PubMed Google Scholar
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Article CAS PubMed Google Scholar
Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338 (2007).
Article CAS PubMed Google Scholar
Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using rosetta. Numer. Computer Methods D. 383, 66–93 (2004).
Article CAS Google Scholar
Karplus, M. & Andrew McCammon, J. Molecular dynamics simulations of biomolecules. Nat. Struct. Mol. Biol. 9, 646 (2002).
Article CAS Google Scholar
Simon, J. R., Carroll, N. J., Rubinstein, M., Chilkoti, A. & López, G. P. Programming molecular self-assembly of intrinsically disordered proteins containing sequences of low complexity. Nat. Chem. 9, 509–515 (2017).
Article CAS PubMed PubMed Central Google Scholar
Taylor, N. D. et al. Engineering an allosteric transcription factor to respond to new ligands. Nat. Methods 13, 177–183 (2016).
Article CAS PubMed Google Scholar
Juárez, J. F., Lecube-Azpeitia, B., Brown, S. L., Johnston, C. D. & Church, G. M. Biosensor libraries harness large classes of binding domains for construction of allosteric transcriptional regulators. Nat. Commun. 9, 3101 (2018).
Article PubMed PubMed Central CAS Google Scholar
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article CAS PubMed PubMed Central Google Scholar
AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301 (2019).
Article CAS PubMed PubMed Central Google Scholar
Liu, X. Deep recurrent neural network for protein function prediction from sequence. Preprint at arXiv https://arxiv.org/abs/1701.08318 (2017).
Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://doi.org/10.1101/365965 (2018).
UniProtKB/TrEMBL 2018_10 (UniProt, accessed 21 November 2018); https://www.uniprot.org/statistics/TrEMBL
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).
Article PubMed PubMed Central CAS Google Scholar
Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).
Article CAS PubMed PubMed Central Google Scholar
Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed Google Scholar
Radford, A., Jozefowicz, R. & Sutskever, I. Learning to generate reviews and discovering sentiment. Preprint at arXiv https://arxiv.org/abs/1704.01444 (2017).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 37, 339–351 (2008).
Google Scholar
Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7, 2469–2471 (1998).
Article CAS PubMed PubMed Central Google Scholar
Raghava, G. P. S., Searle, S. M. J., Audley, P. C., Barber, J. D. & Barton, G. J. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinforma. 4, 47 (2003).
Article CAS Google Scholar
Doan, A., Halevy, A. & Ives, Z. in Principles of Data Integration 95–119 (Elsevier, 2012).
Chua, S.-L. & Foo, L. K. Tree alignment based on Needleman–Wunsch algorithm for sensor selection in smart homes. Sensors 17, 1902 (2017).
Article PubMed Central Google Scholar
Kwon, W. S., Da Silva, N. A. & Kellis, J. T. Jr. Relationship between thermal stability, degradation rate and expression yield of barnase variants in the periplasm of Escherichia coli. Protein Eng. 9, 1197–1202 (1996).
Article CAS PubMed Google Scholar
Bommarius, A. S. & Paye, M. F. Stabilizing biocatalysts. Chem. Soc. Rev. 42, 6534–6565 (2013).
Article CAS PubMed Google Scholar
Manning, M. C., Chou, D. K., Murphy, B. M., Payne, R. W. & Katayama, D. S. Stability of protein pharmaceuticals: an update. Pharm. Res. 27, 544–575 (2010).
Article PubMed CAS Google Scholar
Ovchinnikov, S. et al. Large-scale determination of previously unsolved protein structures using evolutionary information. eLife 4, e09248 (2015).
Article PubMed PubMed Central Google Scholar
De novo designed protein AND identity:0.5 in UniRef (UnitProt, accessed 2 November 2018); https://www.uniprot.org/uniref/?query=de+novo+designed+protein+AND+identity%3A0.5
Quan, L., Lv, Q. & Zhang, Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics 32, 2936–2946 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Syst. 6, 116–124 (2018).
Article CAS PubMed Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at arXiv https://arxiv.org/abs/1611.03530 (2016).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rodriguez, E. A. et al. The growing and glowing toolbox of fluorescent and photoactive proteins. Trends Biochem. Sci. 42, 111–129 (2017).
Article CAS PubMed Google Scholar
Lambert, T. Tlambert03/Fpbase v.1.1.0 (Zenodo, 2018); https://doi.org/10.5281/ZENODO.1244328
Usmanova, D. R., Ferretti, L., Povolotskaya, I. S., Vlasov, P. K. & Kondrashov, F. A. A model of substitution trajectories in sequence space and long-term protein evolution. Mol. Biol. Evol. 32, 542–554 (2015).
Article PubMed Google Scholar
Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012).
Article CAS PubMed Google Scholar
Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).
Article CAS PubMed PubMed Central Google Scholar
Brookes, D. H., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. Machine Learn. Res. 97, 773–782 (2019).
Google Scholar
Snoek, J. et al. Scalable Bayesian optimization using deep neural networks. Preprint at arXiv https://arxiv.org/abs/1502.05700 (2015).
Hernández-Lobato, J. M., Requeima, J., Pyzer-Knapp, E. O. & Aspuru-Guzik, A. Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space. Preprint at arXiv https://arxiv.org/abs/1706.01825 (2017).
Snoek, J., Larochelle, H. & Adams, R. P. in Advances in Neural Information Processing Systems Vol. 25 (eds. Pereira, F. et al.) 2951–2959 (Curran Associates, Inc., 2012).
Griffiths, R.-R. & Hernández-Lobato, J. M. Constrained Bayesian optimization for automaticchemical design. Preprint at arXiv https://arxiv.org/abs/1709.05501 (2017).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article PubMed PubMed Central CAS Google Scholar
Yang, K. K., Chen, Y., Lee, A. & Yue, Y. Batched stochastic Bayesian optimization via combinatorial constraints design. Preprint at arXiv https://arxiv.org/abs/1904.08102 (2019).
González, J., Longworth, J., James, D. C. & Lawrence, N. D. Bayesian optimization for synthetic gene design. Preprint at arXiv https://arxiv.org/abs/1505.01627 (2015).
Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).
Article CAS PubMed Google Scholar
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533 (2017).
Article CAS PubMed Google Scholar
EMBL-EBI. Current Release Statistics (UniProt, accessed 1 November 2018); https://www.ebi.ac.uk/uniprot/TrEMBLstats
Jouppi, N. P. et al. In-datacenter performance analysis of a tensorprocessing unit. In Proc. 44th Annual International Symposium of Computer Architecture Vol. 45, 1–12 (ACM, 2017).
Plesa, C., Sidore, A. M., Lubock, N. B., Zhang, D. & Kosuri, S. Multiplexed gene synthesis in emulsions for exploring protein functional landscapes. Science 359, 343–347 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gu, L. et al. Multiplex single-molecule interaction profiling of DNA-barcoded proteins. Nature 515, 554–557 (2014).
Article CAS PubMed PubMed Central Google Scholar
Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).
Article CAS PubMed PubMed Central Google Scholar
Thompson, D. B. et al. The future of multiplexed eukaryotic genome engineering. ACS Chem. Biol. 13, 313–325 (2018).
Article CAS PubMed Google Scholar
Ruder, S. An overview of multi-task learning in deep neural networks. Preprint at arXiv https://arxiv.org/abs/1706.05098 (2017).
Fox, N. K., Brenner, S. E. & Chandonia, J.-M. SCOPe: structural classification of proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42, D304–D309 (2014).
Article CAS PubMed Google Scholar
Krause, B., Lu, L., Murray, I. & Renals, S. Multiplicative LSTM for sequence modelling. Preprint at arXiv https://arxiv.org/abs/1609.07959 (2016).
Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000).
Article CAS PubMed Google Scholar
Cho, K., van Merrienboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: encoder-decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (2014).
Salimans, T. & Kingma, D. P. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Preprint at arXiv https://arxiv.org/abs/1602.07868 (2016).
AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinform. 20, 311 (2019).
Article Google Scholar
Robertson, S. Understanding inverse document frequency: on theoretical arguments for IDF. J. Documentation 60, 503–520 (2004).
Article Google Scholar
Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).
Article CAS PubMed PubMed Central Google Scholar
Alford, R. F. et al. The rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Article CAS PubMed PubMed Central Google Scholar
Glorot, X., Bordes, A. & Bengio, Y. Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proc. 28th International Conference on International Conference on Machine Learning 513–520 (Omnipress, 2011).
Håndstad, T., Hestnes, A. J. H. & Sætrom, P. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinform. 8, 23 (2007).
Article Google Scholar
Li, S., Chen, J. & Liu, B. Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinform. 18, 443 (2017).
Article CAS Google Scholar
Lovato, P., Cristani, M. & Bicego, M. Soft Ngram representation and modeling for protein remote homology detection. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1482–1488 (2017).
Article PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Jones, E., Oliphant, T. & Peterson, P. SciPy: Open source scientific tools for Python (SciPy, 2001); http://www.scipy.org/
2.3. Clustering—scikit-learn 0.20.0 documentation (scikit, 2018); http://scikit-learn.org/stable/modules/clustering.html
Alieva, N. O. et al. Diversity and evolution of coral fluorescent proteins. PLoS ONE 3, e2680 (2008).
Article PubMed PubMed Central CAS Google Scholar
EMBL-EBI, H. jackhmmer search | HMMER (EBI, accessed 2 November 2018); https://www.ebi.ac.uk/Tools/hmmer/search/jackhmmer
Thompson, J. D., Gibson, T. J. & Higgins, D. G. Multiple sequence alignment using ClustalW and ClustalX. Curr. Protoc. Bioinforma. 2, 2.3.1–2.3.22 (2002).
Google Scholar
Zdobnov, E. M. et al. OrthoDBv9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic Acids Res. 45, D744–D749 (2017).
Article CAS PubMed Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article CAS PubMed PubMed Central Google Scholar
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym.: Original Res. Biomolecules 22, 2577–2637 (1983).
Article CAS Google Scholar
Alley E. et al. Unified rational protein engineering with sequence-based deep representation learning protocol. Preprint at bioRxiv https://doi.org/10.1101/589333 (2019).

Download references

Acknowledgements

We thank J. Aach, A. Taylor-Weiner, D. Goodman, P. Ogden, G. Kuznetsov, S. Sinai, A. Tucker, M. Turpin, J. Swett, N. Thomas, R. Sha, C. Bakerlee and K. Fish for valuable feedback and discussion. S.B. was supported by an NIH Training Grant (no. T32HG002295) to the Harvard Bioinformatics and Integrative Genomics program as well as an NSF GRFP Fellowship. M.A. was supported through NIGMS Grant no. P50GM107618 and NIH grant no. U54-CA225088. E.C.A. and G.K. were supported by the Center for Effective Altruism. E.C.A. was partially supported by the Wyss Institute for Biologically Inspired Engineering. Computational resources were, in part, generously provided by the AWS Cloud Credits for the Research program.

Author information

These authors contributed equally: Ethan C. Alley, Grigory Khimulya, Surojit Biswas.
Unaffiliated: Grigory Khimulya.

Authors and Affiliations

Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA
Ethan C. Alley, Surojit Biswas & George M. Church
MIT Media Laboratory, Cambridge, MA, USA
Ethan C. Alley
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Surojit Biswas
Department of Systems Biology, Harvard Medical School, Boston, MA, USA
Mohammed AlQuraishi
Department of Genetics, Harvard Medical School, Boston, MA, USA
George M. Church

Authors

Ethan C. Alley
View author publications
You can also search for this author in PubMed Google Scholar
Grigory Khimulya
View author publications
You can also search for this author in PubMed Google Scholar
Surojit Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed AlQuraishi
View author publications
You can also search for this author in PubMed Google Scholar
George M. Church
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.C.A. and G.K. conceived the study. E.C.A., G.K. and S.B. conceived the experiments, managed data and performed the analysis. M.A. performed large-scale RGN inference and managed data and software for parts of the analysis. G.M.C. supervised the project. E.C.A., G.K. and S.B. wrote the manuscript with help from all authors.

Corresponding author

Correspondence to George M. Church.

Ethics declarations

Competing interests

E.C.A., G.K. and S.B. are in the process of pursuing a patent on this technology. S.B. is a former consultant for Flagship Pioneering company VL57 (now VL56). A full list of G.M.C.’s tech transfer, advisory roles and funding sources can be found on the laboratory’s website: http://arep.med.harvard.edu/gmc/tech.html.

Additional information

Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Tables 1–9.

Reporting Summary

Supplementary Data Set 1

All test set results in .xlsx format

Supplementary Data Set 2

All validation set results in .xlsx format

Supplementary Data Set 3

All test set results graphically presented in .pdf format

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1

Download citation

Received: 08 April 2019
Accepted: 11 September 2019
Published: 21 October 2019
Issue Date: December 2019
DOI: https://doi.org/10.1038/s41592-019-0598-1

This article is cited by

DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning
- Jonghyun Lee
- Dae Won Jun
- Yun Kim
Journal of Cheminformatics (2024)
Using protein language models for protein interaction hot spot prediction with limited data
- Karen Sargsyan
- Carmay Lim
BMC Bioinformatics (2024)
Efficient evolution of human antibodies from general protein language models
- Brian L. Hie
- Varun R. Shanker
- Peter S. Kim
Nature Biotechnology (2024)
Machine learning for functional protein design
- Pascal Notin
- Nathan Rollins
- Debora Marks
Nature Biotechnology (2024)
Machine learning for antimicrobial peptide identification and design
- Fangping Wan
- Felix Wong
- Cesar de la Fuente-Nunez
Nature Reviews Bioengineering (2024)

Unified rational protein engineering with sequence-based deep representation learning

Subjects

Abstract

Access options

Similar content being viewed by others

Machine-learning-guided directed evolution for protein engineering

Low-N protein engineering with data-efficient deep learning

ECNet is an evolutionary context-integrated deep learning framework for protein engineering

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Data Set 1

Supplementary Data Set 2

Supplementary Data Set 3

Rights and permissions

About this article

Cite this article

This article is cited by

DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning

Using protein language models for protein interaction hot spot prediction with limited data

Efficient evolution of human antibodies from general protein language models

Machine learning for functional protein design

Machine learning for antimicrobial peptide identification and design

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links