Single-sequence protein structure prediction using supervised transformer protein language models

Wang, Wenkai; Peng, Zhenling; Yang, Jianyi

doi:10.1038/s43588-022-00373-3

Article
Published: 19 December 2022

Single-sequence protein structure prediction using supervised transformer protein language models

Nature Computational Science volume 2, pages 804–814 (2022)Cite this article

3336 Accesses
34 Citations
19 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Significant progress has been made in protein structure prediction in recent years. However, it remains challenging for AlphaFold2 and other deep learning-based methods to predict protein structure with single-sequence input. Here we introduce trRosettaX-Single, an automated algorithm for single-sequence protein structure prediction. It incorporates the sequence embedding from a supervised transformer protein language model into a multi-scale network enhanced by knowledge distillation to predict inter-residue two-dimensional geometry, which is then used to reconstruct three-dimensional structures via energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on orphan proteins and works well on human-designed proteins (with an average template modeling score (TM-score) of 0.79). An experimental test shows that the full trRosettaX-Single pipeline is two times faster than AlphaFold2, using much fewer computing resources (<10%). On 2,000 designed proteins from network hallucination, trRosettaX-Single generates structure models with high confidence. As a demonstration, trRosettaX-Single is applied to missense mutation analysis. These data suggest that trRosettaX-Single may find potential applications in protein design and related studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The architecture and performance of trRosettaX-Single.**

**Fig. 2: Comparison between trRosettaX-Single, AlphaFold2 and RoseTTAFold on two example proteins.**

**Fig. 3: Comparison between trRosettaX-Single and SPOT-Contact-LM.**

**Fig. 4: Application to hallucinated proteins.**

**Fig. 5: Mutation analysis on human-designed proteins and three deep mutational scanning datasets.**

**Fig. 6: Ablation study and estimation of model accuracy.**

Single-sequence protein structure prediction using a language model and deep learning

Article 03 October 2022

Unified rational protein engineering with sequence-based deep representation learning

Article 21 October 2019

The trRosetta server for fast and accurate protein structure prediction

Article 10 November 2021

Data availability

The data supporting the findings and conclusions of this study are available in this paper and its Supplementary Information. All of the training and test data used in this work are available at Zenodo³⁹ and our website (https://yanglab.nankai.edu.cn/trRosetta/benchmark_single/). The experimental 3D structures can be downloaded from PDB (https://www.rcsb.org/). The Orphan25 dataset includes 25 natural proteins that were published after May 2020 and have no sequence homologs in UniRef50_2018_03 with MMseqs2 search at an e-value cut-off of 0.05. The Design55 dataset includes 55 human-designed proteins that have no sequence homologs in UniRef50_2018_03. The designed proteins are of size between 50 and 300 amino acids. We removed proteins that are in simple topologies (for example, a single alpha helix) or have hits in the training sets at an e-value cut-off of 0.1 by PSI-BLAST. Source data for Figs. 1b,c and 2–6 are provided with this paper.

Code availability

The source code is available at Zenodo³⁹ and our website (https://yanglab.nankai.edu.cn/trRosetta/benchmark_single/).

References

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article Google Scholar
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496 (2020).
Article Google Scholar
Su, H. et al. Improved protein structure prediction using a new multi-scale network and homologous templates. Adv. Sci. 8, 2102592 (2021).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article Google Scholar
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process Syst. 32, 9689–9701 (2019).
Google Scholar
Madani, A. et al. ProGen: Language modeling for protein generation. Preprint at bioRxiv https://doi.org/10.1101/2020.03.07.982272 (2020).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Google Scholar
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. in International Conference on Learning Representations 2021 (OpenReview.net, 2021).
Vaswani, A. et al. Attention is All you Need. in Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates, 2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics 4171–4186 (Association for Computational Linguistics, 2019).
Chen, M. C., Li, Y., Zhu, Y. H., Ge, F. & Yu, D. J. SSCpred: single-sequence-based protein contact prediction using deep fully convolutional network. J. Chem. Inf. Model. 60, 3295–3303 (2020).
Article Google Scholar
Singh, J., Litfin, T., Singh, J., Paliwal, K. & Zhou, Y. SPOT-Contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model. Bioinformatics 38, 1888–1894 (2022).
Article Google Scholar
Chowdhury, R. et al. Single-sequence protein structure prediction using language models from deep learning. Nat Biotechnol 40, 1617–1623 (2022). https://doi.org/10.1038/s41587-022-01432-w
Du, Z., Peng, Z. & Yang, J. Toward the assessment of predicted inter-residue distance. Bioinformatics 38, 962–969 (2022).
Article Google Scholar
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinf. 57, 702–710 (2004).
Article Google Scholar
Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2005).
Article Google Scholar
Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).
Article Google Scholar
Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 5, 602-610 (Springer, 2005).
Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003).
Article Google Scholar
Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).
Article Google Scholar
Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Article Google Scholar
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article Google Scholar
Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. Rna 19, 1537–1551 (2013).
Article Google Scholar
Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).
Article Google Scholar
Zeng, H. et al. ComplexContact: a web server for inter-protein contact prediction using deep learning. Nucleic Acids Res. 46, W432–W437 (2018).
Article Google Scholar
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
Article Google Scholar
Baek, M., Anishchenko, I., Park, H., Humphreys, I. R. & Baker, D. Protein oligomer modeling guided by predicted interchain contacts in CASP14. Proteins Struct. Funct. Bioinf. 89, 1824–1833 (2021).
Article Google Scholar
Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).
Article Google Scholar
Basu, S. & Wallner, B. DockQ: a quality measure for protein-protein docking models. PLoS ONE 11, e0161879 (2016).
Article Google Scholar
Du, Z. et al. The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 16, 5634–5651 (2021).
Article Google Scholar
Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article Google Scholar
Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).
Article Google Scholar
Gao, S. H. et al. Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 652–662 (2021).
Article Google Scholar
Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).
Article Google Scholar
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).
Wang, W., Peng, Z. & Yang, J. Source code and data for the paper “Single-sequence protein structure prediction using supervised transformer protein language models”. Zenodo https://doi.org/10.5281/zenodo.7264646 (2022).

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (NSFC T2225007, T2222012, 11871290 and 61873185), and the Foundation for Innovative Research Groups of State Key Laboratory of Microbial Technology (WZCX2021-03).

Author information

Authors and Affiliations

School of Mathematical Sciences, Nankai University, Tianjin, China
Wenkai Wang
Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
Zhenling Peng & Jianyi Yang

Authors

Wenkai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenling Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jianyi Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Y. designed the research. W.W. developed the pipeline and carried out the experiments. J.Y. and P.Z. supervised the research. All authors analyzed data, wrote and revised the manuscript.

Corresponding author

Correspondence to Jianyi Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Arne Elofsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–6 and Figs. 1–11.

Reporting Summary

Source data

Source Data Fig. 1

The distance precisions and TM scores in Fig. 1.

Source Data Fig. 2

The predicted distances and structures for 7JJV and 2LSE in Fig. 2.

Source Data Fig. 3

Contact precision data for trRosettaX-Single and SPOT-Contact-LM in Fig. 3.

Source Data Fig. 4

Estimated TM scores for 2,000 hallucinated proteins and PDB files for three examples in Fig. 4.

Source Data Fig. 5

Data for mutation effect analysis in Fig. 5.

Source Data Fig. 6

Data for ablation study and estimated TM score in Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, W., Peng, Z. & Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat Comput Sci 2, 804–814 (2022). https://doi.org/10.1038/s43588-022-00373-3

Download citation

Received: 03 March 2022
Accepted: 06 November 2022
Published: 19 December 2022
Issue Date: December 2022
DOI: https://doi.org/10.1038/s43588-022-00373-3

This article is cited by

Improvements in viral gene annotation using large language models and soft alignments
- William L. Harrigan
- Barbra D. Ferrell
- Mahdi Belcaid
BMC Bioinformatics (2024)
Freeprotmap: waiting-free prediction method for protein distance map
- Jiajian Huang
- Jinpeng Li
- Jin Tang
BMC Bioinformatics (2024)
State-specific protein–ligand complex structure prediction with a multiscale deep generative model
- Zhuoran Qiao
- Weili Nie
- Animashree Anandkumar
Nature Machine Intelligence (2024)
Protein–protein contact prediction by geometric triangle-aware protein language models
- Peicong Lin
- Huanyu Tao
- Sheng-You Huang
Nature Machine Intelligence (2023)
Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes
- Peicong Lin
- Yumeng Yan
- Sheng-You Huang
Nature Communications (2023)