Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Single-sequence protein structure prediction using supervised transformer protein language models

A preprint version of the article is available at bioRxiv.

Abstract

Significant progress has been made in protein structure prediction in recent years. However, it remains challenging for AlphaFold2 and other deep learning-based methods to predict protein structure with single-sequence input. Here we introduce trRosettaX-Single, an automated algorithm for single-sequence protein structure prediction. It incorporates the sequence embedding from a supervised transformer protein language model into a multi-scale network enhanced by knowledge distillation to predict inter-residue two-dimensional geometry, which is then used to reconstruct three-dimensional structures via energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on orphan proteins and works well on human-designed proteins (with an average template modeling score (TM-score) of 0.79). An experimental test shows that the full trRosettaX-Single pipeline is two times faster than AlphaFold2, using much fewer computing resources (<10%). On 2,000 designed proteins from network hallucination, trRosettaX-Single generates structure models with high confidence. As a demonstration, trRosettaX-Single is applied to missense mutation analysis. These data suggest that trRosettaX-Single may find potential applications in protein design and related studies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The architecture and performance of trRosettaX-Single.
Fig. 2: Comparison between trRosettaX-Single, AlphaFold2 and RoseTTAFold on two example proteins.
Fig. 3: Comparison between trRosettaX-Single and SPOT-Contact-LM.
Fig. 4: Application to hallucinated proteins.
Fig. 5: Mutation analysis on human-designed proteins and three deep mutational scanning datasets.
Fig. 6: Ablation study and estimation of model accuracy.

Similar content being viewed by others

Data availability

The data supporting the findings and conclusions of this study are available in this paper and its Supplementary Information. All of the training and test data used in this work are available at Zenodo39 and our website (https://yanglab.nankai.edu.cn/trRosetta/benchmark_single/). The experimental 3D structures can be downloaded from PDB (https://www.rcsb.org/). The Orphan25 dataset includes 25 natural proteins that were published after May 2020 and have no sequence homologs in UniRef50_2018_03 with MMseqs2 search at an e-value cut-off of 0.05. The Design55 dataset includes 55 human-designed proteins that have no sequence homologs in UniRef50_2018_03. The designed proteins are of size between 50 and 300 amino acids. We removed proteins that are in simple topologies (for example, a single alpha helix) or have hits in the training sets at an e-value cut-off of 0.1 by PSI-BLAST. Source data for Figs. 1b,c and 26 are provided with this paper.

Code availability

The source code is available at Zenodo39 and our website (https://yanglab.nankai.edu.cn/trRosetta/benchmark_single/).

References

  1. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  2. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

    Article  Google Scholar 

  3. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496 (2020).

    Article  Google Scholar 

  4. Su, H. et al. Improved protein structure prediction using a new multi-scale network and homologous templates. Adv. Sci. 8, 2102592 (2021).

    Article  Google Scholar 

  5. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  6. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  7. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process Syst. 32, 9689–9701 (2019).

    Google Scholar 

  8. Madani, A. et al. ProGen: Language modeling for protein generation. Preprint at bioRxiv https://doi.org/10.1101/2020.03.07.982272 (2020).

  9. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    Google Scholar 

  10. Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. in International Conference on Learning Representations 2021 (OpenReview.net, 2021).

  11. Vaswani, A. et al. Attention is All you Need. in Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates, 2017).

  12. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics 4171–4186 (Association for Computational Linguistics, 2019).

  13. Chen, M. C., Li, Y., Zhu, Y. H., Ge, F. & Yu, D. J. SSCpred: single-sequence-based protein contact prediction using deep fully convolutional network. J. Chem. Inf. Model. 60, 3295–3303 (2020).

    Article  Google Scholar 

  14. Singh, J., Litfin, T., Singh, J., Paliwal, K. & Zhou, Y. SPOT-Contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model. Bioinformatics 38, 1888–1894 (2022).

    Article  Google Scholar 

  15. Chowdhury, R. et al. Single-sequence protein structure prediction using language models from deep learning. Nat Biotechnol 40, 1617–1623 (2022). https://doi.org/10.1038/s41587-022-01432-w

  16. Du, Z., Peng, Z. & Yang, J. Toward the assessment of predicted inter-residue distance. Bioinformatics 38, 962–969 (2022).

    Article  Google Scholar 

  17. Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinf. 57, 702–710 (2004).

    Article  Google Scholar 

  18. Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2005).

    Article  Google Scholar 

  19. Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).

    Article  Google Scholar 

  20. Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 5, 602-610 (Springer, 2005).

  21. Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003).

    Article  Google Scholar 

  22. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

    Article  Google Scholar 

  23. Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).

    Article  Google Scholar 

  24. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  Google Scholar 

  25. Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. Rna 19, 1537–1551 (2013).

    Article  Google Scholar 

  26. Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).

    Article  Google Scholar 

  27. Zeng, H. et al. ComplexContact: a web server for inter-protein contact prediction using deep learning. Nucleic Acids Res. 46, W432–W437 (2018).

    Article  Google Scholar 

  28. Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).

    Article  Google Scholar 

  29. Baek, M., Anishchenko, I., Park, H., Humphreys, I. R. & Baker, D. Protein oligomer modeling guided by predicted interchain contacts in CASP14. Proteins Struct. Funct. Bioinf. 89, 1824–1833 (2021).

    Article  Google Scholar 

  30. Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).

    Article  Google Scholar 

  31. Basu, S. & Wallner, B. DockQ: a quality measure for protein-protein docking models. PLoS ONE 11, e0161879 (2016).

    Article  Google Scholar 

  32. Du, Z. et al. The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 16, 5634–5651 (2021).

    Article  Google Scholar 

  33. Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  Google Scholar 

  34. Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  Google Scholar 

  35. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).

    Article  Google Scholar 

  36. Gao, S. H. et al. Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 652–662 (2021).

    Article  Google Scholar 

  37. Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).

    Article  Google Scholar 

  38. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).

  39. Wang, W., Peng, Z. & Yang, J. Source code and data for the paper “Single-sequence protein structure prediction using supervised transformer protein language models”. Zenodo https://doi.org/10.5281/zenodo.7264646 (2022).

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (NSFC T2225007, T2222012, 11871290 and 61873185), and the Foundation for Innovative Research Groups of State Key Laboratory of Microbial Technology (WZCX2021-03).

Author information

Authors and Affiliations

Authors

Contributions

J.Y. designed the research. W.W. developed the pipeline and carried out the experiments. J.Y. and P.Z. supervised the research. All authors analyzed data, wrote and revised the manuscript.

Corresponding author

Correspondence to Jianyi Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Arne Elofsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–6 and Figs. 1–11.

Reporting Summary

Source data

Source Data Fig. 1

The distance precisions and TM scores in Fig. 1.

Source Data Fig. 2

The predicted distances and structures for 7JJV and 2LSE in Fig. 2.

Source Data Fig. 3

Contact precision data for trRosettaX-Single and SPOT-Contact-LM in Fig. 3.

Source Data Fig. 4

Estimated TM scores for 2,000 hallucinated proteins and PDB files for three examples in Fig. 4.

Source Data Fig. 5

Data for mutation effect analysis in Fig. 5.

Source Data Fig. 6

Data for ablation study and estimated TM score in Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, W., Peng, Z. & Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat Comput Sci 2, 804–814 (2022). https://doi.org/10.1038/s43588-022-00373-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-022-00373-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing