Unified rational protein engineering with sequence-based deep representation learning

Abstract

Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Workflow to learn and apply deep protein representations.
Fig. 2: UniRep encodes amino-acid physicochemistry, organism level information, secondary structure, evolutionary and functional information, and higher-order structural features.
Fig. 3: UniRep predicts structural and functional properties of proteins.
Fig. 4: UniRep, fine-tuned to a local evolutionary context, facilitates protein engineering by enabling generalization to distant peaks in the sequence landscape.

Data availability

All data are available in the main text or the supplementary materials.

Code availability

Code for UniRep model training and inference with trained weights along with links to all necessary data is available in a public repository at https://github.com/churchlab/UniRep. Code to reproduce all analysis and regenerate figures with links to preprocessed benchmark datasets is available online https://github.com/churchlab/UniRep-analysis.

References

  1. 1.

    Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet. 16, 379–394 (2015).

    CAS  Google Scholar 

  2. 2.

    Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Biswas, S. et al. Toward machine-guided design of proteins. Preprint at bioRxiv https://doi.org/10.1101/337154 (2018).

  4. 4.

    Bedbrook, C. N., Yang, K. K., Rice, A. J., Gradinaru, V. & Arnold, F. H. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS Comput. Biol. 13, e1005786 (2017).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

    CAS  PubMed  Google Scholar 

  7. 7.

    Coluzza, I. Computational protein design: a review. J. Phys. Condens. Matter 29, 143001 (2017).

    PubMed  Google Scholar 

  8. 8.

    Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    CAS  PubMed  Google Scholar 

  9. 9.

    Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338 (2007).

    CAS  PubMed  Google Scholar 

  10. 10.

    Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using rosetta. Numer. Computer Methods D. 383, 66–93 (2004).

    CAS  Google Scholar 

  11. 11.

    Karplus, M. & Andrew McCammon, J. Molecular dynamics simulations of biomolecules. Nat. Struct. Mol. Biol. 9, 646 (2002).

    CAS  Google Scholar 

  12. 12.

    Simon, J. R., Carroll, N. J., Rubinstein, M., Chilkoti, A. & López, G. P. Programming molecular self-assembly of intrinsically disordered proteins containing sequences of low complexity. Nat. Chem. 9, 509–515 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Taylor, N. D. et al. Engineering an allosteric transcription factor to respond to new ligands. Nat. Methods 13, 177–183 (2016).

    CAS  PubMed  Google Scholar 

  14. 14.

    Juárez, J. F., Lecube-Azpeitia, B., Brown, S. L., Johnston, C. D. & Church, G. M. Biosensor libraries harness large classes of binding domains for construction of allosteric transcriptional regulators. Nat. Commun. 9, 3101 (2018).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301 (2019).

    CAS  PubMed  Google Scholar 

  17. 17.

    Liu, X. Deep recurrent neural network for protein function prediction from sequence. Preprint at arXiv https://arxiv.org/abs/1701.08318 (2017).

  18. 18.

    Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://doi.org/10.1101/365965 (2018).

  19. 19.

    UniProtKB/TrEMBL 2018_10 (UniProt, accessed 21 November 2018); https://www.uniprot.org/statistics/TrEMBL

  20. 20.

    Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

    CAS  PubMed  Google Scholar 

  23. 23.

    Radford, A., Jozefowicz, R. & Sutskever, I. Learning to generate reviews and discovering sentiment. Preprint at arXiv https://arxiv.org/abs/1704.01444 (2017).

  24. 24.

    van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 37, 339–351 (2008).

    Google Scholar 

  25. 25.

    Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7, 2469–2471 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Raghava, G. P. S., Searle, S. M. J., Audley, P. C., Barber, J. D. & Barton, G. J. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinforma. 4, 47 (2003).

    CAS  Google Scholar 

  27. 27.

    Doan, A., Halevy, A. & Ives, Z. in Principles of Data Integration 95–119 (Elsevier, 2012).

  28. 28.

    Chua, S.-L. & Foo, L. K. Tree alignment based on Needleman–Wunsch algorithm for sensor selection in smart homes. Sensors 17, 1902 (2017).

    Google Scholar 

  29. 29.

    Kwon, W. S., Da Silva, N. A. & Kellis, J. T. Jr. Relationship between thermal stability, degradation rate and expression yield of barnase variants in the periplasm of Escherichia coli. Protein Eng. 9, 1197–1202 (1996).

    CAS  PubMed  Google Scholar 

  30. 30.

    Bommarius, A. S. & Paye, M. F. Stabilizing biocatalysts. Chem. Soc. Rev. 42, 6534–6565 (2013).

    CAS  PubMed  Google Scholar 

  31. 31.

    Manning, M. C., Chou, D. K., Murphy, B. M., Payne, R. W. & Katayama, D. S. Stability of protein pharmaceuticals: an update. Pharm. Res. 27, 544–575 (2010).

    PubMed  Google Scholar 

  32. 32.

    Ovchinnikov, S. et al. Large-scale determination of previously unsolved protein structures using evolutionary information. eLife 4, e09248 (2015).

    PubMed  PubMed Central  Google Scholar 

  33. 33.

    De novo designed protein AND identity:0.5 in UniRef (UnitProt, accessed 2 November 2018); https://www.uniprot.org/uniref/?query=de+novo+designed+protein+AND+identity%3A0.5

  34. 34.

    Quan, L., Lv, Q. & Zhang, Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics 32, 2936–2946 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Syst. 6, 116–124 (2018).

    CAS  PubMed  Google Scholar 

  36. 36.

    Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at arXiv https://arxiv.org/abs/1611.03530 (2016).

  37. 37.

    Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Rodriguez, E. A. et al. The growing and glowing toolbox of fluorescent and photoactive proteins. Trends Biochem. Sci. 42, 111–129 (2017).

    CAS  PubMed  Google Scholar 

  39. 39.

    Lambert, T. Tlambert03/Fpbase v.1.1.0 (Zenodo, 2018); https://doi.org/10.5281/ZENODO.1244328

  40. 40.

    Usmanova, D. R., Ferretti, L., Povolotskaya, I. S., Vlasov, P. K. & Kondrashov, F. A. A model of substitution trajectories in sequence space and long-term protein evolution. Mol. Biol. Evol. 32, 542–554 (2015).

    PubMed  Google Scholar 

  41. 41.

    Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012).

    CAS  PubMed  Google Scholar 

  42. 42.

    Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Brookes, D. H., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. Machine Learn. Res. 97, 773–782 (2019).

    Google Scholar 

  44. 44.

    Snoek, J. et al. Scalable Bayesian optimization using deep neural networks. Preprint at arXiv https://arxiv.org/abs/1502.05700 (2015).

  45. 45.

    Hernández-Lobato, J. M., Requeima, J., Pyzer-Knapp, E. O. & Aspuru-Guzik, A. Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space. Preprint at arXiv https://arxiv.org/abs/1706.01825 (2017).

  46. 46.

    Snoek, J., Larochelle, H. & Adams, R. P. in Advances in Neural Information Processing Systems Vol. 25 (eds. Pereira, F. et al.) 2951–2959 (Curran Associates, Inc., 2012).

  47. 47.

    Griffiths, R.-R. & Hernández-Lobato, J. M. Constrained Bayesian optimization for automaticchemical design. Preprint at arXiv https://arxiv.org/abs/1709.05501 (2017).

  48. 48.

    Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    PubMed  PubMed Central  Google Scholar 

  49. 49.

    Yang, K. K., Chen, Y., Lee, A. & Yue, Y. Batched stochastic Bayesian optimization via combinatorial constraints design. Preprint at arXiv https://arxiv.org/abs/1904.08102 (2019).

  50. 50.

    González, J., Longworth, J., James, D. C. & Lawrence, N. D. Bayesian optimization for synthetic gene design. Preprint at arXiv https://arxiv.org/abs/1505.01627 (2015).

  51. 51.

    Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

    CAS  PubMed  Google Scholar 

  52. 52.

    Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533 (2017).

    CAS  PubMed  Google Scholar 

  53. 53.

    EMBL-EBI. Current Release Statistics (UniProt, accessed 1 November 2018); https://www.ebi.ac.uk/uniprot/TrEMBLstats

  54. 54.

    Jouppi, N. P. et al. In-datacenter performance analysis of a tensorprocessing unit. In Proc. 44th Annual International Symposium of Computer Architecture Vol. 45, 1–12 (ACM, 2017).

  55. 55.

    Plesa, C., Sidore, A. M., Lubock, N. B., Zhang, D. & Kosuri, S. Multiplexed gene synthesis in emulsions for exploring protein functional landscapes. Science 359, 343–347 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Gu, L. et al. Multiplex single-molecule interaction profiling of DNA-barcoded proteins. Nature 515, 554–557 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Thompson, D. B. et al. The future of multiplexed eukaryotic genome engineering. ACS Chem. Biol. 13, 313–325 (2018).

    CAS  PubMed  Google Scholar 

  59. 59.

    Ruder, S. An overview of multi-task learning in deep neural networks. Preprint at arXiv https://arxiv.org/abs/1706.05098 (2017).

  60. 60.

    Fox, N. K., Brenner, S. E. & Chandonia, J.-M. SCOPe: structural classification of proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42, D304–D309 (2014).

    CAS  PubMed  Google Scholar 

  61. 61.

    Krause, B., Lu, L., Murray, I. & Renals, S. Multiplicative LSTM for sequence modelling. Preprint at arXiv https://arxiv.org/abs/1609.07959 (2016).

  62. 62.

    Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000).

    CAS  PubMed  Google Scholar 

  63. 63.

    Cho, K., van Merrienboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: encoder-decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (2014).

  64. 64.

    Salimans, T. & Kingma, D. P. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Preprint at arXiv https://arxiv.org/abs/1602.07868 (2016).

  65. 65.

    AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinform. 20, 311 (2019).

    Google Scholar 

  66. 66.

    Robertson, S. Understanding inverse document frequency: on theoretical arguments for IDF. J. Documentation 60, 503–520 (2004).

    Google Scholar 

  67. 67.

    Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. 68.

    Alford, R. F. et al. The rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69.

    Glorot, X., Bordes, A. & Bengio, Y. Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proc. 28th International Conference on International Conference on Machine Learning 513–520 (Omnipress, 2011).

  70. 70.

    Håndstad, T., Hestnes, A. J. H. & Sætrom, P. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinform. 8, 23 (2007).

    Google Scholar 

  71. 71.

    Li, S., Chen, J. & Liu, B. Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinform. 18, 443 (2017).

    Google Scholar 

  72. 72.

    Lovato, P., Cristani, M. & Bicego, M. Soft Ngram representation and modeling for protein remote homology detection. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1482–1488 (2017).

    PubMed  Google Scholar 

  73. 73.

    Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  74. 74.

    Jones, E., Oliphant, T. & Peterson, P. SciPy: Open source scientific tools for Python (SciPy, 2001); http://www.scipy.org/

  75. 75.

    2.3. Clustering—scikit-learn 0.20.0 documentation (scikit, 2018); http://scikit-learn.org/stable/modules/clustering.html

  76. 76.

    Alieva, N. O. et al. Diversity and evolution of coral fluorescent proteins. PLoS ONE 3, e2680 (2008).

    PubMed  PubMed Central  Google Scholar 

  77. 77.

    EMBL-EBI, H. jackhmmer search | HMMER (EBI, accessed 2 November 2018); https://www.ebi.ac.uk/Tools/hmmer/search/jackhmmer

  78. 78.

    Thompson, J. D., Gibson, T. J. & Higgins, D. G. Multiple sequence alignment using ClustalW and ClustalX. Curr. Protoc. Bioinforma. 2, 2.3.1–2.3.22 (2002).

    Google Scholar 

  79. 79.

    Zdobnov, E. M. et al. OrthoDBv9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic Acids Res. 45, D744–D749 (2017).

    CAS  PubMed  Google Scholar 

  80. 80.

    Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  81. 81.

    Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym.: Original Res. Biomolecules 22, 2577–2637 (1983).

    CAS  Google Scholar 

  82. 82.

    Alley E. et al. Unified rational protein engineering with sequence-based deep representation learning protocol. Preprint at bioRxiv https://doi.org/10.1101/589333 (2019).

Download references

Acknowledgements

We thank J. Aach, A. Taylor-Weiner, D. Goodman, P. Ogden, G. Kuznetsov, S. Sinai, A. Tucker, M. Turpin, J. Swett, N. Thomas, R. Sha, C. Bakerlee and K. Fish for valuable feedback and discussion. S.B. was supported by an NIH Training Grant (no. T32HG002295) to the Harvard Bioinformatics and Integrative Genomics program as well as an NSF GRFP Fellowship. M.A. was supported through NIGMS Grant no. P50GM107618 and NIH grant no. U54-CA225088. E.C.A. and G.K. were supported by the Center for Effective Altruism. E.C.A. was partially supported by the Wyss Institute for Biologically Inspired Engineering. Computational resources were, in part, generously provided by the AWS Cloud Credits for the Research program.

Author information

Affiliations

Authors

Contributions

E.C.A. and G.K. conceived the study. E.C.A., G.K. and S.B. conceived the experiments, managed data and performed the analysis. M.A. performed large-scale RGN inference and managed data and software for parts of the analysis. G.M.C. supervised the project. E.C.A., G.K. and S.B. wrote the manuscript with help from all authors.

Corresponding author

Correspondence to George M. Church.

Ethics declarations

Competing interests

E.C.A., G.K. and S.B. are in the process of pursuing a patent on this technology. S.B. is a former consultant for Flagship Pioneering company VL57 (now VL56). A full list of G.M.C.’s tech transfer, advisory roles and funding sources can be found on the laboratory’s website: http://arep.med.harvard.edu/gmc/tech.html.

Additional information

Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Tables 1–9.

Reporting Summary

Supplementary Data Set 1

All test set results in .xlsx format

Supplementary Data Set 2

All validation set results in .xlsx format

Supplementary Data Set 3

All test set results graphically presented in .pdf format

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing