Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Machine learning for functional protein design

Abstract

Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Protein design objectives.
Fig. 2: Protein design applications.
Fig. 3: Typology of protein design models.

Similar content being viewed by others

References

  1. Lu, H. et al. Machine learning-aided engineering of hydrolases for PET depolymerization. Nature 604, 662–667 (2022).

    CAS  PubMed  ADS  Google Scholar 

  2. Giessel, A. et al. Therapeutic enzyme engineering using a generative neural network. Sci. Rep. 12, 1536 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  3. Fram, B. et al. Simultaneous enhancement of multiple functional properties using evolution-informed protein design. Preprint at bioRxiv https://doi.org/10.1101/2023.05.09.539914 (2023).

  4. Sumida, K. H. et al. Improving protein expression, stability, and function with ProteinMPNN. J. Am. Chem. Soc. 146, 2054–2061 (2024).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Schubert, B. et al. Population-specific design of de-immunized protein biotherapeutics. PLoS Comput. Biol. 14, e1005983 (2018).

    PubMed  PubMed Central  Google Scholar 

  6. Salvat, R. S. et al. Computationally optimized deimmunization libraries yield highly mutated enzymes with low immunogenicity and enhanced activity. Proc. Natl Acad. Sci. USA 114, E5085–E5093 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Jankowski, W. et al. Mitigation of T-cell dependent immunogenicity by reengineering factor VIIa analogue. Blood Adv. 3, 2668–2678 (2019).

    CAS  Google Scholar 

  8. Mufarrege, E. F. et al. De-immunized and functional therapeutic (DeFT) versions of a long lasting recombinant α interferon for antiviral therapy. Clin. Immunol. 176, 31–41 (2017).

    CAS  PubMed  Google Scholar 

  9. Winterling, K. et al. Development of a novel fully functional coagulation factor VIII with reduced immunogenicity utilizing an in silico prediction and deimmunization approach. J. Thromb. Haemost. 19, 2161–2170 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Zhao, H. et al. Globally deimmunized lysostaphin evades human immune surveillance and enables highly efficacious repeat dosing. Sci. Adv. 6, eabb9011 (2020).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  11. Zhao, H. et al. Depletion of T cell epitopes in lysostaphin mitigates anti-drug antibody response and enhances antibacterial efficacy in vivo. Chem. Biol. 22, 629–639 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  15. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

    Google Scholar 

  16. Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    CAS  PubMed  ADS  Google Scholar 

  17. Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 1512–1522 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Notin, P. et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. In Advances in Neural Information Processing Systems (NeurIPS) Vol. 36 (2023).

  19. Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).

    MathSciNet  CAS  PubMed  ADS  Google Scholar 

  20. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

  21. Lian, X. et al. Deep learning-enabled design of synthetic orthologs of a signaling protein. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521443 (2022).

  22. Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  23. Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  24. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    CAS  PubMed  Google Scholar 

  25. Eid, F.-E. et al. Systematic multi-trait AAV capsid engineering for efficient gene delivery. Preprint at bioRxiv https://doi.org/10.1101/2022.12.22.521680 (2022).

  26. Li, Y. et al. A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat. Biotechnol. 25, 1051–1056 (2007).

    CAS  PubMed  Google Scholar 

  27. Pak, M. A., Dovidchenko, N. V., Sharma, S. M. & Ivankov, D. N. New mega dataset combined with deep neural network makes a progress in predicting impact of mutation on protein stability. Preprint at bioRxiv https://doi.org/10.1101/2022.12.31.522396 (2023).

  28. Umerenkov, D. et al. PROSTATA: protein stability assessment using transformers. Preprint at bioRxiv https://doi.org/10.1101/2022.12.25.521875 (2022).

  29. Schmitt, L. T., Paszkowski-Rogacz, M., Jug, F. & Buchholz, F. Prediction of designer-recombinases for DNA editing with generative deep learning. Nat. Commun. 13, 7966 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  30. Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  31. Malbranke, C. et al. Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment. PLoS Comput. Biol. 19, e1011621 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Harvey, E. P. et al. An in silico method to assess antibody fragment polyreactivity. Nat. Commun. 13, 7554 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  33. Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).

    CAS  PubMed  Google Scholar 

  34. Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    MathSciNet  CAS  PubMed  ADS  Google Scholar 

  35. Saito, Y. et al. Machine-learning-guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration. ACS Catal. 11, 14615–14624 (2021).

    CAS  Google Scholar 

  36. Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).

    Google Scholar 

  37. Sinai, S., Jain, N., Church, G. M. & Kelsic, E. D. Generative AAV capsid diversification by latent interpolation. Preprint at bioRxiv https://doi.org/10.1101/2021.04.16.440236 (2021).

  38. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01763-2 (2023).

    Article  PubMed  Google Scholar 

  40. Liu, G. et al. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics 36, 2126–2133 (2020).

    CAS  PubMed  Google Scholar 

  41. Holst, L. H. et al. De novo design of a polycarbonate hydrolase. Protein Eng. Des. Sel. 36, gzad022 (2023).

    PubMed  Google Scholar 

  42. Siegel, J. B. et al. Computational design of an enzyme catalyst for a stereoselective bimolecular Diels–Alder reaction. Science 329, 309–313 (2010).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  43. Jiang, L. et al. De novo computational design of retro-aldol enzymes. Science 319, 1387–1391 (2008).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  44. Yeh, A. H.-W. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  45. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  46. Verkuil, R. et al. Language models generalize beyond natural proteins. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521521 (2022).

  47. Lutz, I. D. et al. Top–down design of protein architectures with reinforcement learning. Science 380, 266–273 (2023).

    CAS  PubMed  ADS  Google Scholar 

  48. Wang, J. et al. Scaffolding protein functional sites using deep learning. Science 377, 387–394 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  49. Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  50. Basanta, B. et al. An enumerative algorithm for de novo design of proteins with diverse pocket structures. Proc. Natl Acad. Sci. USA 117, 22135–22145 (2020).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  51. Nijkamp, E., Ruffolo, J., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968–978 (2023).

    CAS  PubMed  Google Scholar 

  52. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  53. Bloom, J. D., Wilke, C. O., Arnold, F. H. & Adami, C. Stability and the evolvability of function in a model protein. Biophys. J. 86, 2758–2764 (2004).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  54. Bloom, J. D., Labthavikul, S. T., Otey, C. R. & Arnold, F. H. Protein stability promotes evolvability. Proc. Natl Acad. Sci. USA 103, 5869–5874 (2006).

  55. Tokuriki, N., Stricher, F., Serrano, L. & Tawfik, D. S. How protein stability and new functions trade off. PLoS Comput. Biol. 4, e1000002 (2008).

    PubMed  PubMed Central  ADS  Google Scholar 

  56. Nakatani, K. et al. Increase in the thermostability of Bacillus sp. strain TAR-1 xylanase using a site saturation mutagenesis library. Biosci. Biotechnol. Biochem. 82, 1715–1723 (2018).

    CAS  PubMed  Google Scholar 

  57. Katano, Y. et al. Generation of thermostable Moloney murine leukemia virus reverse transcriptase variants using site saturation mutagenesis library and cell-free protein expression system. Biosci. Biotechnol. Biochem. 81, 2339–2345 (2017).

    CAS  PubMed  Google Scholar 

  58. Richardson, T. H. et al. A novel, high performance enzyme for starch liquefaction. J. Biol. Chem. 277, 26501–26507 (2002).

    CAS  PubMed  Google Scholar 

  59. Giver, L., Gershenson, A., Freskgard, P.-O. & Arnold, F. H. Directed evolution of a thermostable esterase. Proc. Natl Acad. Sci. USA 95, 12809–12813 (1998).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  60. Bell, E. L. et al. Directed evolution of an efficient and thermostable PET depolymerase. Nat. Catal. 5, 673–681 (2022).

    CAS  Google Scholar 

  61. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proceedings of the 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 8946–8970 (PMLR, 2022).

  63. Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and protein design. Nature 620, 434–444 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  64. Dieckhaus, H., Brocidiacono, M., Randolph, N. & Kuhlman, B. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc. Natl Acad. Sci USA 121, e2314853121 (2024).

    PubMed  Google Scholar 

  65. Nagano, N., Orengo, C. A. & Thornton, J. M. One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J. Mol. Biol. 321, 741–765 (2002).

    CAS  PubMed  Google Scholar 

  66. Isin, E. M. & Guengerich, F. P. Complex reactions catalyzed by cytochrome P450 enzymes. Biochim. Biophys. Acta 1770, 314–329 (2007).

    CAS  PubMed  Google Scholar 

  67. Guengerich, F. P. & Munro, A. W. Unusual cytochrome P450 enzymes and reactions. J. Biol. Chem. 288, 17065–17073 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. Khersonsky, O. & Tawfik, D. S. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471–505 (2010).

    CAS  PubMed  Google Scholar 

  69. Arnold, F. H. Directed evolution: bringing new chemistry to life. Angew. Chem. Int. Ed. Engl. 57, 4143–4148 (2018).

    CAS  PubMed  Google Scholar 

  70. Yang, Y. & Arnold, F. H. Navigating the unnatural reaction space: directed evolution of heme proteins for selective carbene and nitrene transfer. Acc. Chem. Res. 54, 1209–1225 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Bedbrook, C. N. et al. Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics. Nat. Methods 16, 1176–1184 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021).

    CAS  PubMed  Google Scholar 

  73. Röthlisberger, D. et al. Kemp elimination catalysts by computational enzyme design. Nature 453, 190–195 (2008).

    PubMed  ADS  Google Scholar 

  74. Sesterhenn, F. et al. De novo protein design enables the precise induction of RSV-neutralizing antibodies. Science 368, eaay5051 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. Yang, C. et al. Bottom–up de novo design of functional proteins with complex structural features. Nat. Chem. Biol. 17, 492–500 (2021).

    CAS  PubMed  Google Scholar 

  76. Cao, L. et al. Design of protein-binding proteins from the target structure alone. Nature 605, 551–560 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  77. Ingraham, J. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  78. Trippe, B. L. et al. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. In International Conference on Learning Representations Vol. 11 (ICLR, 2023).

  79. Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 382–392 (2023).

    CAS  Google Scholar 

  80. Rajewsky, K. Clonal selection and learning in the antibody system. Nature 381, 751–758 (1996).

    CAS  PubMed  ADS  Google Scholar 

  81. Teng, G. & Papavasiliou, F. N. Immunoglobulin somatic hypermutation. Annu. Rev. Genet. 41, 107–120 (2007).

    CAS  PubMed  Google Scholar 

  82. Boder, E. T., Raeeszadeh-Sarmazdeh, M. & Price, J. V. Engineering antibodies by yeast display. Arch. Biochem. Biophys. 526, 99–106 (2012).

    CAS  PubMed  Google Scholar 

  83. Wellner, A. et al. Rapid generation of potent antibodies by autonomous hypermutation in yeast. Nat. Chem. Biol. 17, 1057–1064 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  84. McMahon, C. et al. Yeast surface display platform for rapid discovery of conformationally selective nanobodies. Nat. Struct. Mol. Biol. 25, 289–296 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Almagro, J. C., Pedraza-Escalona, M., Arrieta, H. I. & Pérez-Tapia, S. M. Phage display libraries for antibody therapeutic discovery and development. Antibodies 8, 44 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  86. Ledsgaard, L. et al. Advances in antibody phage display technology. Drug Discov. Today 27, 2151–2169 (2022).

    CAS  PubMed  Google Scholar 

  87. Parkinson, J., Hard, R. & Wang, W. The RESP AI model accelerates the identification of tight-binding antibodies. Nat. Commun. 14, 454 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  88. Saka, K. et al. Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Sci. Rep. 11, 5852 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  89. Mason, D. M. et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat. Biomed. Eng. 5, 600–612 (2021).

    CAS  PubMed  Google Scholar 

  90. Makowski, E. K. et al. Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space. Nat. Commun. 13, 3788 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  91. Shanker, V. R., Bruun, T. U. J., Hie, B. L. & Kim, P. S. Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution. Preprint at bioRxiv https://doi.org/10.1101/2023.12.19.572475 (2023).

  92. Shanehsazzadeh, A. et al. In vitro validated antibody design against multiple therapeutic antigens using generative inverse folding. In Generative AI and Biology (GenBio) Workshop, NeurIPS (2023).

  93. Olsen, T. H., Boyles, F. & Deane, C. M. Observed Antibody Space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 31, 141–146 (2022).

    CAS  PubMed  Google Scholar 

  94. Weinstein, E. N. et al. Optimal design of stochastic DNA synthesis protocols based on generative sequence models. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (eds Camps-Valls, G., Ruiz, F. J. R. & Valera, I.) 7450–7482 (PMLR, 2022).

  95. Eguchi, R. R. et al. Deep generative design of epitope-specific binding proteins by latent conformation optimization. Preprint at bioRxiv https://doi.org/10.1101/2022.12.22.521698 (2022).

  96. Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  97. Shanehsazzadeh, A. et al. Unlocking de novo antibody design with generative artificial intelligence. Preprint at bioRxiv https://doi.org/10.1101/2023.01.08.523187 (2023).

  98. Gainza, P. et al. De novo design of protein interactions with learned surface fingerprints. Nature 617, 176–184 (2023).

  99. Mahajan, S. P., Ruffolo, J. A., Frick, R. & Gray, J. J. Hallucinating structure-conditioned antibody libraries for target-specific binders. Front. Immunol. 13, 999034 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  100. Lisanza, S. L. et al. Joint generation of protein sequence and structure with RoseTTAFold sequence space diffusion. Preprint at bioRxiv https://doi.org/10.1101/2023.05.08.539766 (2023).

  101. Chu, A. E., Cheng, L., El Nesr, G., Xu, M. & Huang, P.-S. An all-atom protein generative model. Preprint at bioRxiv https://doi.org/10.1101/2023.05.24.542194 (2023).

  102. Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Preprint at bioRxiv https://doi.org/10.1101/2023.10.09.561603 (2023).

  103. Krishna, M. & Nadler, S. G. Immunogenicity to biotherapeutics — the role of anti-drug immune complexes. Front. Immunol. 7, 21 (2016).

    PubMed  PubMed Central  Google Scholar 

  104. Chapman, A. M. & McNaughton, B. R. Scratching the surface: resurfacing proteins to endow new properties and function. Cell Chem. Biol. 23, 543–553 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  105. Remmel, J. L. et al. Combinatorial resurfacing of Dengue envelope protein domain III antigens selectively ablates epitopes associated with serotype-specific or infection-enhancing antibody responses. ACS Comb. Sci. 22, 446–456 (2020).

    CAS  PubMed  Google Scholar 

  106. Bootwala, A. et al. Protein re-surfacing of E. coli l-asparaginase to evade pre-existing anti-drug antibodies and hypersensitivity responses. Front. Immunol. 13, 1016179 (2022).

    PubMed  PubMed Central  Google Scholar 

  107. Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Advances in Neural Information Processing Systems Vol. 32 (2019).

  108. Thadani, N. N. et al. Learning from prepandemic data to forecast viral escape. Nature 622, 818–825 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  109. Singh, H. & Raghava, G. P. ProPred: prediction of HLA-DR binding sites. Bioinformatics 17, 1236–1237 (2001).

    CAS  PubMed  Google Scholar 

  110. Zhang, L. et al. TEPITOPEpan: extending TEPITOPE for peptide binding prediction covering over 700 HLA-DR molecules. PLoS ONE 7, e30483 (2012).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  111. Racle, J. et al. Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes. Nat. Biotechnol. 37, 1283–1286 (2019).

    CAS  PubMed  Google Scholar 

  112. Reynisson, B. et al. Improved prediction of MHC II antigen presentation through integration and motif deconvolution of mass spectrometry MHC eluted ligand data. J. Proteome Res. 19, 2304–2315 (2020).

    CAS  PubMed  Google Scholar 

  113. Racle, J. et al. Machine learning predictions of MHC-II specificities reveal alternative binding mode of class II epitopes. Immunity 56, 1359–1375 (2023).

    CAS  PubMed  Google Scholar 

  114. Peters, B., Nielsen, M. & Sette, A. T cell epitope predictions. Annu. Rev. Immunol. 38, 123–145 (2020).

    CAS  PubMed  Google Scholar 

  115. Bennett, N. et al. Improving de novo protein binder design with deep learning. Nat. Commun. 14, 2625 (2023).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  116. Glasscock, C. J. et al. Computational design of sequence-specific DNA-binding proteins. Preprint at bioRxiv https://doi.org/10.1101/2023.09.20.558720 (2023).

  117. Youssef, N. et al. Deep generative models predict SARS-CoV-2 spike infectivity and foreshadow neutralizing antibody escape. Preprint at bioRxiv https://doi.org/10.1101/2023.10.08.561389 (2023).

  118. Walls, A. C. et al. Elicitation of potent neutralizing antibody responses by designed protein nanoparticle vaccines for SARS-CoV-2. Cell 183, 1367–1382 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  119. Brouwer, P. J. M. et al. Two-component spike nanoparticle vaccine protects macaques from SARS-CoV-2 infection. Cell 184, 1188–1200 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  120. Cohen, A. A. et al. Mosaic nanoparticles elicit cross-reactive immune responses to zoonotic coronaviruses in mice. Science 371, 735–741 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  121. Kang, Y.-F. et al. Rapid development of SARS-CoV-2 spike protein receptor-binding domain self-assembled nanoparticle vaccine candidates. ACS Nano 15, 2738–2752 (2021).

    CAS  Google Scholar 

  122. Nguyen, B. & Tolia, N. H. Protein-based antigen presentation platforms for nanoparticle vaccines. NPJ Vaccines 6, 70 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  123. Karoyan, P. et al. Human ACE2 peptide-mimics block SARS-CoV-2 pulmonary cells infection. Commun. Biol. 4, 197 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  124. Glasgow, A. et al. Engineered ACE2 receptor traps potently neutralize SARS-CoV-2. Proc. Natl Acad. Sci. USA 117, 28046–28055 (2020).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  125. Torchia, J. A. et al. Optimized ACE2 decoys neutralize antibody-resistant SARS-CoV-2 variants through functional receptor mimicry and treat infection in vivo. Sci. Adv. 8, eabq6527 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  126. Cao, L. et al. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science 370, 426–431 (2020).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  127. Hunt, A. C. et al. Multivalent designed proteins neutralize SARS-CoV-2 variants of concern and confer protection against infection in mice. Sci. Transl. Med. 14, eabn1252 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  128. Zhang, J. Z. et al. Thermodynamically coupled biosensors for detecting neutralizing antibodies against SARS-CoV-2 variants. Nat. Biotechnol. 40, 1336–1340 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  129. Leonard, A. C. & Whitehead, T. A. Design and engineering of genetically encoded protein biosensors for small molecules. Curr. Opin. Biotechnol. 78, 102787 (2022).

    CAS  PubMed  Google Scholar 

  130. Quijano-Rubio, A. et al. De novo design of modular and tunable protein biosensors. Nature 591, 482–487 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  131. Langan, R. A. et al. De novo design of bioactive protein switches. Nature 572, 205–210 (2019).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  132. Ng, A. H. et al. Modular and tunable biological feedback control using a de novo protein switch. Nature 572, 265–269 (2019).

    CAS  PubMed  ADS  Google Scholar 

  133. Lee, G. R. et al. Small-molecule binding and sensing with a designed protein family. Preprint at bioRxiv https://doi.org/10.1101/2023.11.01.565201 (2023).

  134. Courbet, A. et al. Computational design of mechanically coupled axle-rotor protein assemblies. Science 376, 383–390 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  135. Huang, G., Willems, K., Soskine, M., Wloka, C. & Maglia, G. Electro-osmotic capture and ionic discrimination of peptide and protein biomarkers with FraC nanopores. Nat. Commun. 8, 935 (2017).

    PubMed  PubMed Central  ADS  Google Scholar 

  136. Zhang, S. et al. Bottom–up fabrication of a proteasome–nanopore that unravels and processes single proteins. Nat. Chem. 13, 1192–1199 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  137. Shimizu, K. et al. De novo design of a nanopore for single-molecule detection that incorporates a β-hairpin peptide. Nat. Nanotechnol. 17, 67–75 (2022).

    CAS  PubMed  ADS  Google Scholar 

  138. Alfaro, J. A. et al. The emerging landscape of single-molecule protein sequencing technologies. Nat. Methods 18, 604–617 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  139. Berhanu, S. et al. Sculpting conducting nanopore size and shape through de novo protein design. Preprint at bioRxiv https://doi.org/10.1101/2023.12.20.572500 (2023).

  140. Xu, C. et al. Computational design of transmembrane pores. Nature 585, 129–134 (2020).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  141. Hesslow, D., Zanichelli, N., Notin, P., Poli, I. & Marks, D. RITA: a study on scaling up generative protein sequence models. Workshop on Computational Biology, ICML (2022).

  142. Hoffmann, J. et al. Training compute-optimal large language models. Adv. Neural Inf. Process. Syst. 35, 30016–30030 (2022).

    Google Scholar 

  143. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  144. Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proceedings of the 39th International Conference on Machine Learning 16990–17017 (PMLR, 2022).

  145. Notin, P. et al. TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. Learning Meaningful Representations of Life Workshop, NeurIPS (2022).

  146. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    MathSciNet  CAS  PubMed  ADS  Google Scholar 

  147. Kanehisa, M. Enzyme annotation and metabolic reconstruction using KEGG. Methods Mol. Biol. 1611, 135–145 (2017).

    CAS  PubMed  Google Scholar 

  148. Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).

    CAS  PubMed  Google Scholar 

  149. Bairoch, A. The ENZYME database in 2000. Nucleic Acids Res. 28, 304–305 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  150. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  151. Nikam, R., Kulandaisamy, A., Harini, K., Sharma, D. & Gromiha, M. M. ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 49, D420–D424 (2021).

    CAS  PubMed  Google Scholar 

  152. Rubin, A. F. et al. MaveDB v2: a curated community database with over three million variant effects from multiplexed functional assays. Preprint at bioRxiv https://doi.org/10.1101/2021.11.29.470445 (2021).

  153. Munsamy, G., Lindner, S., Lorenz, P. & Ferruz, N. ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. In Machine Learning for Structural Biology Workshop, NeurIPS (2022).

  154. Born, J. & Manica, M. Regression Transformer: concurrent sequence regression and generation for molecular language modeling. Nat. Mach. Intell. 5, 432–444 (2023).

    Google Scholar 

  155. Notin, P., Weitzman, R., Marks, D. S. & Gal, Y. ProteinNPT: improving protein property prediction and design with non-parametric transformers. In Advances in Neural Information Processing Systems Vol. 36 (2023).

  156. Bran, A. M., Cox, S., White, A. D. & Schwaller, P. ChemCrow: augmenting large-language models with chemistry tools. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.05376 (2023).

  157. Liu, S. et al. A text-guided protein design framework. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.04611 (2023).

  158. Hie, B. et al. A high-level programming language for generative protein design. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521526 (2022).

  159. Dallago, C. et al. FLIP: benchmark tasks in fitness landscape inference for proteins. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021).

  160. Pak, M. A. et al. Using AlphaFold to predict the impact of single mutations on protein stability and function. PLoS ONE 18, e0282689 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  161. AlphaFold Protein Structure Database. Frequently asked questions. AlphaFold Protein Structure Database https://alphafold.ebi.ac.uk/faq (2022).

  162. Johnson, S. R. et al. Computational scoring and experimental evaluation of enzymes generated by neural networks. Preprint at bioRxiv https://doi.org/10.1101/2023.03.04.531015 (2023).

  163. Tagasovska, N. et al. A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences. Preprint at arXiv https://doi.org/10.48550/arXiv.2210.10838 (2022).

  164. Zheng, Z. et al. Structure-informed language models are protein designers. In International Conference on Machine Learning Vol. 40 (PMLR, 2023).

  165. Su, J. et al. SaProt: protein language modeling with structure-aware vocabulary. Preprint at bioRxiv https://doi.org/10.1101/2023.10.01.560349 (2023).

  166. Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning 8748–8763 (PMLR, 2021).

  167. Xu, M., Yuan, X., Miret, S. & Tang, J. ProtST: multi-modality learning of protein sequences and biomedical texts. In International Conference on Machine Learning Vol. 40 (PMLR, 2023).

  168. Malbranke, C., Bikard, D., Cocco, S., Monasson, R. & Tubiana, J. Machine learning for evolutionary-based and physics-inspired protein design: current and future synergies. Curr. Opin. Struct. Biol. 80, 102571 (2023).

    CAS  PubMed  Google Scholar 

  169. Frey, N. C. et al. Protein discovery with discrete walk–jump sampling. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.12360 (2023).

  170. Darmawan, J. T., Gal, Y. & Notin, P. Sampling protein language models for functional protein design. In Generative AI and Biology (GenBio) Workshop, NeurIPS (2023).

  171. Kirjner, A. et al. Optimizing protein fitness using Gibbs sampling with graph-based smoothing. Preprint at arXiv https://doi.org/10.48550/arXiv.2307.00494 (2023).

  172. Rapp, J. T., Bremer, B. J. & Romero, P. A. Self-driving laboratories to autonomously navigate the protein fitness landscape. Nat. Chem. Eng. 1, 97–107 (2024).

    Google Scholar 

  173. Yu, T., Boob, A. G., Singh, N., Su, Y. & Zhao, H. In vitro continuous protein evolution empowered by machine learning and automation. Cell Syst. 14, 633–644 (2023).

    CAS  PubMed  Google Scholar 

  174. Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  175. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  176. Yang, K. K., Fusi, N. & Lu, A. X. Convolutions are competitive with transformers for protein sequence pretraining. Preprint at bioRxiv https://doi.org/10.1101/2022.05.19.492714 (2023).

  177. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    PubMed  Google Scholar 

  178. Elnaggar, A. et al. Ankh: optimized protein language model unlocks general-purpose modelling. Preprint at arXiv https://doi.org/10.48550/arXiv.2301.06568 (2023).

  179. Rao, R. M. et al. MSA Transformer. In Proceedings of the 38th International Conference on Machine Learning 8844–8856 (PMLR, 2021).

  180. Truong, T. F. Jr. & Bepler, T. PoET: a generative model of protein families as sequences-of-sequences. Advances in Neural Information Processing Systems Vol. 36 (2023).

  181. Alamdari, S. et al. Protein generation with evolutionary diffusion: sequence is all you need. Preprint at bioRxiv https://doi.org/10.1101/2023.09.11.556673 (2023).

  182. Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  183. Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol. 39, 691–696 (2021).

    CAS  PubMed  Google Scholar 

  184. Zhu, D. et al. Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy. Sci. Adv. 10, eadj3786 (2024).

    PubMed  PubMed Central  Google Scholar 

  185. Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723 (2019).

    CAS  Google Scholar 

  186. Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv. 1, vbab035 (2021).

    PubMed  PubMed Central  Google Scholar 

  187. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    CAS  PubMed  Google Scholar 

  188. Gruver, N. et al. Protein design with guided discrete diffusion. In Advances in Neural Information Processing Systems Vol. 36 (2023).

  189. Blaabjerg, L. M. et al. Rapid protein stability prediction using deep learning representations. eLife 12, e82593 (2023).

  190. Baek, M. Efficient and accurate prediction of protein structures and interactions using RoseTTAFold. Acta Crystallogr. A Found. Adv. 78, a235 (2022).

    Google Scholar 

  191. Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at bioRxiv https://doi.org/10.1101/2022.07.21.500999 (2022).

  192. Anand, N., Eguchi, R. & Huang, P.-S. Fully differentiable full-atom protein backbone generation. In Deep Generative Models for Highly Structured Data Workshop, ICLR (2019).

  193. Wu, K. E. et al. Protein structure generation via folding diffusion. Nat. Commun. 15, 1059 (2024).

    CAS  PubMed  PubMed Central  Google Scholar 

  194. Jing, B., Eismann, S., Suriana, P., Townshend, R. J. L. & Dror, R. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations Vol. 9 (2021).

  195. Gao, Z., Tan, C., Chacón, P. & Li, S. Z. PiFold: toward effective and efficient protein inverse folding. In International Conference on Learning Representations Vo. 11 (2023).

  196. Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems Vol. 29 (2016).

  197. Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations Vol. 5 (2017).

  198. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. Geometric deep learning: going beyond Euclidean data. IEEE Signal Process. Mag. 34, 18–42 (2017).

    ADS  Google Scholar 

  199. Veličković, P. et al. Graph attention networks. In International Conference on Learning Representations Vol. 6 (2018).

  200. Wicky, B. I. M. et al. Hallucinating symmetric protein assemblies. Science 378, 56–61 (2022).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  201. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    PubMed  PubMed Central  Google Scholar 

  202. Castro, E. et al. Transformer-based protein generation with regularized latent space optimization. Nat. Mach. Intell. 4, 840–851 (2022).

    Google Scholar 

  203. Notin, P., Hernández-Lobato, J. M. & Gal, Y. Improving black-box optimization in VAE latent space using decoder uncertainty. Adv. Neural Inf. Process. Syst. 34, 802–814 (2021).

    Google Scholar 

Download references

Acknowledgements

We thank members of the Marks lab for valuable discussions. P.N. was supported by GSK, the UK Engineering and Physical Sciences Research Council (EPSRC ICASE award no. 18000077) and a Chan Zuckerberg Initiative Award (Neurodegeneration Challenge Network, CZI2018-191853). Y.G. holds a Turing AI Fellowship (Phase 1) at the Alan Turing Institute, which is supported by EPSRC grant reference V030302/1. C.S. is supported by the National Resource for Network Biology (NRNB, P41GM103504). D.S.M. holds a Ben Barres Early Career Award from the Chan Zuckerberg Initiative as part of the Neurodegeneration Challenge Network (CZI2018-191853) and is supported by a NIH Transformational Research Award (TR01 1R01CA260415).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pascal Notin, Nathan Rollins or Debora Marks.

Ethics declarations

Competing interests

D.M. is an advisor for Dyno Therapeutics, Octant, Jura Bio, Tectonic Therapeutic and Genentech and a cofounder of Seismic. N.R. is employed by Seismic. C.S. is on the scientific advisory board of CytoReason. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Notin, P., Rollins, N., Gal, Y. et al. Machine learning for functional protein design. Nat Biotechnol 42, 216–228 (2024). https://doi.org/10.1038/s41587-024-02127-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-024-02127-0

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing