Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Single-sequence protein structure prediction using a language model and deep learning

A Publisher Correction to this article was published on 17 October 2022

This article has been updated

Abstract

AlphaFold2 and related computational systems predict protein structure using deep learning and co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite high prediction accuracy achieved by these systems, challenges remain in (1) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated; (2) rapid exploration of designed structures; and (3) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model (AminoBERT) to learn latent structural information from unaligned proteins. A linked geometric module compactly represents Cα backbone geometry in a translationally and rotationally invariant way. On average, RGN2 outperforms AlphaFold2 and RoseTTAFold on orphan proteins and classes of designed proteins while achieving up to a 106-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Organization and application of RGN2.
Fig. 2: Prediction performance on orphan proteins.
Fig. 3: Comparing RGN2 and AF2 structure predictions for orphan proteins.
Fig. 4: Prediction performance on designed proteins.
Fig. 5: Comparing RGN2 and AF2 structure predictions for designed proteins.

Similar content being viewed by others

Data availability

The AminoBERT module was trained using the UniParc sequence database (https://www.uniprot.org/help/uniparc). Homologous sequence searches to determine orphan sequences were performed across UniRef90 (https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/), PDB70 (http://prodata.swmed.edu/procain/info/database.html) and MGnify (https://www.ebi.ac.uk/metagenomics/) metagenomic sequence alignment datasets. The six PDB structures discussed in detail in the article (5FKP, 2KWZ, 6E5N, 2L96, 5UP5 and 7KBQ) were all sourced from the Protein Data Bank.

Code availability

RGN2 is available freely as a standalone tool from https://github.com/aqlaboratory/rgn2. Users can make structure predictions using a Python3-based web user interface by uploading the protein sequence in FASTA format (https://colab.research.google.com/github/aqlaboratory/rgn2/blob/master/rgn2_prediction.ipynb).

Change history

References

  1. Yang, J. & Zhang, Y. I-TASSER server: new development for protein structure and function predictions. Nucleic Acids Res. 43, W174–W181 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Wang, J., Wang, W., Kollman, P. A. & Case, D. A. Automatic atom type and bond type perception in molecular mechanical calculations. J. Mol. Graph. Model. 25, 247–260 (2006).

    Article  PubMed  Google Scholar 

  3. Hess, B., Kutzner, C., Van Der Spoel, D. & Lindahl, E. GRGMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4, 435–447 (2008).

    Article  CAS  PubMed  Google Scholar 

  4. Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. AlQuraishi, M. Machine learning in protein structure prediction. Curr. Opin. Chem. Biol. 65, 1–8 (2021).

    Article  CAS  PubMed  Google Scholar 

  6. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

  7. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

  9. Pearson, W. R. An introduction to sequence similarity (‘homology’) searching. Curr. Protoc. Bioinformatics Chapter 3, Unit3.1 (2013).

  10. Perdigão, N. et al. Unexpected features of the dark proteome. Proc. Natl Acad. Sci. USA 112, 15898–15903 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Price, N. D. et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat. Biotechnol. 35, 747–756 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Stittrich, A. B. et al. Genomic architecture of inflammatory bowel disease in five families with multiple affected individuals. Hum. Genome Var. 3, 15060 (2016).

  13. Huang, X., Pearce, R. & Zhang, Y. EvoEF2: accurate and fast energy function for computational protein design. Bioinformatics 36, 1135–1142 (2020).

  14. Jiang, L. et al. De novo computational design of retro-aldol enzymes. Science 319, 1387–1391 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Renata, H., Wang, Z. J. & Arnold, F. H. Expanding the enzyme universe: accessing non-natural reactions by mechanism-guided directed evolution. Angew. Chem. Int. Ed. Engl. 54, 3351–3367 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Richter, F., Leaver-Fay, A., Khare, S. D., Bjelic, S. & Baker, D. De novo enzyme design using Rosetta3. PLoS ONE 6, e19230 (2011).

  17. Steiner, K. & Schwab, H. Recent advances in rational approaches for enzyme engineering. Comput. Struct. Biotechnol. J. 2, e201209010 (2012).

  18. Sáez-Jiménez, V. et al. Improving the pH-stability of versatile peroxidase by comparative structural analysis with a naturally-stable manganese peroxidase. PLoS ONE 10, e0140984 (2015).

  19. Park, H. J., Joo, J. C., Park, K., Kim, Y. H. & Yoo, Y. J. Prediction of the solvent affecting site and the computational design of stable Candida antarctica lipase B in a hydrophilic organic solvent. J. Biotechnol. 163, 346–352 (2013).

    Article  CAS  PubMed  Google Scholar 

  20. Jiang, C. et al. An orphan protein of Fusarium graminearum modulates host immunity by mediating proteasomal degradation of TaSnRK1α. Nat. Commun. 11, 4382 (2020).

  21. Tautz, D. & Domazet-Lošo, T. The evolutionary origin of orphan genes. Nat. Rev. Genet. 12, 692–702 (2011).

    Article  CAS  PubMed  Google Scholar 

  22. AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Ingraham, J., Riesselman, A., Sander, C. & Marks, D. Learning protein structure with a differentiable simulator. in 7th International Conference on Learning Representations. https://openreview.net/forum?id=Byg3y3C9Km (2019).

  24. Li, J. Universal transforming geometric network. Preprint at https://arxiv.org/abs/1908.00723 (2019).

  25. Kandathil, S. M., Greener, J. G., Lau, A. M. & Jones, D. T. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterised proteins. Proc. Natl Acad. Sci. USA 119, e2113348119 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 10, eabj8754 (2021).

    Google Scholar 

  28. Roney, J. P. & Ovchinnikov, S. State-of-the-art estimation of protein model accuracy using AlphaFold. Preprint at https://www.biorxiv.org/content/10.1101/2022.03.11.484043v3 (2022).

  29. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1, 4171–4186. https://aclanthology.org/N19-1423/ (2019).

  30. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Proc. Syst. 30, (2017).

  31. Leinonen, R. et al. UniProt archive. Bioinformatics 20, 3236–3237 (2004).

    Article  CAS  PubMed  Google Scholar 

  32. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

    Google Scholar 

  33. Elnaggar, A. et al. CodeTrans: towards cracking the language of silicone’s code through self-supervised deep learning and high performance computing. Preprint at https://arxiv.org/abs/2104.02443 (2021).

  34. Alley, E., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. Unified rational protein engineering with sequence-only deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Heinzinger, M. et al. Modeling the language of life—deep learning protein sequences. Preprint at https://www.biorxiv.org/content/10.1101/614313v1 (2019).

  36. Madani, A. et al. ProGen: language modeling for protein generation. Preprint at https://arxiv.org/abs/2004.03497 (2020).

  37. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2021.3095381 (2021).

  38. Hu, S., Lundgren, M. & Niemi, A. J. Discrete Frenet frame, inflection point solitons, and curve visualization with applications to folded proteins. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 83, 061908 (2011).

    Article  PubMed  Google Scholar 

  39. Penner, R. C., Knudsen, M., Wiuf, C. & Andersen, J. E. Fatgraph models of proteins. Commun. Pure Appl. Math. 63, 1249–1297 (2010).

    Article  Google Scholar 

  40. AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics 20, 311 (2019).

  41. Fox, N. K., Brenner, S. E. & Chandonia, J. M. SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).

    Article  CAS  PubMed  Google Scholar 

  42. Burley, S. K. et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 49, D437–D451 (2021).

    Article  CAS  PubMed  Google Scholar 

  43. Touw, W. G. et al. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 43, D364–D368 (2015).

    Article  CAS  PubMed  Google Scholar 

  44. Outeiral, C., Nissley, D. A. & Deane, C. M. Current structure predictors are not learning the physics of protein folding. Bioinformatics 38, 1881–1887 (2022).

    Article  CAS  PubMed Central  Google Scholar 

  45. Hartrampf, N. et al. Synthesis of proteins by automated flow chemistry. Science 368, 980–987 (2020).

    Article  CAS  PubMed  Google Scholar 

  46. Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. Preprint at https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1 (2020).

  47. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).

  48. Rao, R. et al. MSA Transformer. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 8844–8856 (2021).

    Google Scholar 

  49. Anfinsen, C. B., Haber, E., Sela, M. & White, F. H. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc. Natl Acad. Sci. USA 47, 1309–1314 (1961).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Mikolov, T. et al. Strategies for training large scale neural network language models. 2011 IEEE Workshop on Automatic Speech Recognition & Understanding. 196–211. https://doi.org/10.1109/ASRU.2011.6163930 (2011).

  51. Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).

    Article  CAS  PubMed  Google Scholar 

  52. Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Xu, D. & Zhang, Y. Improving the physical realism and structural accuracy of protein models by a two-step atomic-level energy minimization. Biophys. J. 101, 2525–2534 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Fleishman, S. J. et al. Rosettascripts: a scripting language interface to the Rosetta macromolecular modeling suite. PLoS ONE 6, e20161 (2011).

Download references

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation for the donation of GPUs used for this research. This work is supported by DARPA PANACEA program grant HR0011-19-2-0022 and National Cancer Institute grant U54-CA225088 to P.K.S. We also acknowledge support from the TensorFlow Research Cloud for graciously providing the TPU resources used for training AminoBERT.

Author information

Authors and Affiliations

Authors

Contributions

R.C., N.B., S.B. and M.A. conceived of and designed the study. R.C. and C.F. developed the refinement module. R.C., C.F., A.K. and K.R. performed the analyses. N.B. developed the geometry module and trained RGN2 models. S.B. developed and trained the AminoBERT protein language model and helped integrate its embeddings within RGN2. C.R. trained several RGN2 models and performed RF predictions. C.F. prepared the docker image and helped package the standalone software along with a Python-based user interface (notebook) for generating RGN2 predictions. G.A. performed MSAs to identify orphans. J.Z. helped C.F. in preparation of the RGN2 prediction notebook. P.K.S. and G.M.C. supervised the research and provided funding. R.C., N.B., S.B., M.A. and P.K.S. wrote the manuscript, and all authors discussed the results and edited the final version.

Corresponding authors

Correspondence to Nazim Bouatta, Peter K. Sorger or Mohammed AlQuraishi.

Ethics declarations

Competing interests

M.A. is a member of the Scientific Advisory Board of FL2021-002, a Foresite Labs company, and consults for Interline Therapeutics. P.K.S. is a member of the Scientific Advisory Board or Board of Directors of Glencoe Software, Applied Biomath, RareCyte and NanoString and is an advisor to Merck and Montai Health. A full list of G.M.C.ʼs tech transfer, advisory roles, 559 and funding sources can be found on the lab’s website: http://arep.med.harvard.edu/gmc/tech.html. S.B. is employed by and holds equity in Nabla Bio, Inc. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks James Fraser and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chowdhury, R., Bouatta, N., Biswas, S. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol 40, 1617–1623 (2022). https://doi.org/10.1038/s41587-022-01432-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-022-01432-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing