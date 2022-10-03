Skip to main content

Single-sequence protein structure prediction using a language model and deep learning

Nature Biotechnology (2022)Cite this article

Abstract

AlphaFold2 and related computational systems predict protein structure using deep learning and co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite high prediction accuracy achieved by these systems, challenges remain in (1) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated; (2) rapid exploration of designed structures; and (3) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model (AminoBERT) to learn latent structural information from unaligned proteins. A linked geometric module compactly represents Cα backbone geometry in a translationally and rotationally invariant way. On average, RGN2 outperforms AlphaFold2 and RoseTTAFold on orphan proteins and classes of designed proteins while achieving up to a 106-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.

Fig. 1: Organization and application of RGN2.
Fig. 2: Prediction performance on orphan proteins.
Fig. 3: Comparing RGN2 and AF2 structure predictions for orphan proteins.
Fig. 4: Prediction performance on designed proteins.
Fig. 5: Comparing RGN2 and AF2 structure predictions for designed proteins.

Data availability

The AminoBERT module was trained using the UniParc sequence database (https://www.uniprot.org/help/uniparc). Homologous sequence searches to determine orphan sequences were performed across UniRef90 (https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/), PDB70 (http://prodata.swmed.edu/procain/info/database.html) and MGnify (https://www.ebi.ac.uk/metagenomics/) metagenomic sequence alignment datasets. The six PDB structures discussed in detail in the article (5FKP, 2KWZ, 6E5N, 2L96, 5UP5 and 7KBQ) were all sourced from the Protein Data Bank.

Code availability

RGN2 is available freely as a standalone tool from https://github.com/aqlaboratory/rgn2. Users can make structure predictions using a Python3-based web user interface by uploading the protein sequence in FASTA format (https://colab.research.google.com/github/aqlaboratory/rgn2/blob/master/rgn2_prediction.ipynb).

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation for the donation of GPUs used for this research. This work is supported by DARPA PANACEA program grant HR0011-19-2-0022 and National Cancer Institute grant U54-CA225088 to P.K.S. We also acknowledge support from the TensorFlow Research Cloud for graciously providing the TPU resources used for training AminoBERT.

  1. These authors contributed equally: Ratul Chowdhury, Nazim Bouatta, Surojit Biswas, Christina Floristean.

  1. Laboratory of Systems Pharmacology, Program in Therapeutic Science, Harvard Medical School, Boston, MA, USA

    Ratul Chowdhury, Nazim Bouatta, George M. Church & Peter K. Sorger

  2. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA

    Surojit Biswas & George M. Church

  3. Nabla Bio, Inc., Boston, MA, USA

    Surojit Biswas

  4. Department of Computer Science, Columbia University, New York, NY, USA

    Christina Floristean, Anant Kharkare, Koushik Roye, Joanna Zhang & Mohammed AlQuraishi

  5. Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA

    Charlotte Rochereau

  6. Department of Systems Biology, Columbia University, New York, NY, USA

    Gustaf Ahdritz & Mohammed AlQuraishi

  7. Department of Systems Biology, Harvard Medical School, Boston, MA, USA

    Peter K. Sorger

R.C., N.B., S.B. and M.A. conceived of and designed the study. R.C. and C.F. developed the refinement module. R.C., C.F., A.K. and K.R. performed the analyses. N.B. developed the geometry module and trained RGN2 models. S.B. developed and trained the AminoBERT protein language model and helped integrate its embeddings within RGN2. C.R. trained several RGN2 models and performed RF predictions. C.F. prepared the docker image and helped package the standalone software along with a Python-based user interface (notebook) for generating RGN2 predictions. G.A. performed MSAs to identify orphans. J.Z. helped C.F. in preparation of the RGN2 prediction notebook. P.K.S. and G.M.C. supervised the research and provided funding. R.C., N.B., S.B., M.A. and P.K.S. wrote the manuscript, and all authors discussed the results and edited the final version.

Correspondence to Nazim Bouatta, Peter K. Sorger or Mohammed AlQuraishi.

Competing interests

M.A. is a member of the Scientific Advisory Board of FL2021-002, a Foresite Labs company, and consults for Interline Therapeutics. P.K.S. is a member of the Scientific Advisory Board or Board of Directors of Glencoe Software, Applied Biomath, RareCyte and NanoString and is an advisor to Merck and Montai Health. A full list of G.M.C.ʼs tech transfer, advisory roles, 559 and funding sources can be found on the lab’s website: http://arep.med.harvard.edu/gmc/tech.html. S.B. is employed by and holds equity in Nabla Bio, Inc. The remaining authors declare no competing interests.

Peer review information

Nature Biotechnology thanks James Fraser and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Chowdhury, R., Bouatta, N., Biswas, S. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-022-01432-w

