Abstract
Significant progress has been made in protein structure prediction in recent years. However, it remains challenging for AlphaFold2 and other deep learning-based methods to predict protein structure with single-sequence input. Here we introduce trRosettaX-Single, an automated algorithm for single-sequence protein structure prediction. It incorporates the sequence embedding from a supervised transformer protein language model into a multi-scale network enhanced by knowledge distillation to predict inter-residue two-dimensional geometry, which is then used to reconstruct three-dimensional structures via energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on orphan proteins and works well on human-designed proteins (with an average template modeling score (TM-score) of 0.79). An experimental test shows that the full trRosettaX-Single pipeline is two times faster than AlphaFold2, using much fewer computing resources (<10%). On 2,000 designed proteins from network hallucination, trRosettaX-Single generates structure models with high confidence. As a demonstration, trRosettaX-Single is applied to missense mutation analysis. These data suggest that trRosettaX-Single may find potential applications in protein design and related studies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
The data supporting the findings and conclusions of this study are available in this paper and its Supplementary Information. All of the training and test data used in this work are available at Zenodo39 and our website (https://yanglab.nankai.edu.cn/trRosetta/benchmark_single/). The experimental 3D structures can be downloaded from PDB (https://www.rcsb.org/). The Orphan25 dataset includes 25 natural proteins that were published after May 2020 and have no sequence homologs in UniRef50_2018_03 with MMseqs2 search at an e-value cut-off of 0.05. The Design55 dataset includes 55 human-designed proteins that have no sequence homologs in UniRef50_2018_03. The designed proteins are of size between 50 and 300 amino acids. We removed proteins that are in simple topologies (for example, a single alpha helix) or have hits in the training sets at an e-value cut-off of 0.1 by PSI-BLAST. Source data for Figs. 1b,c and 2–6 are provided with this paper.
Code availability
The source code is available at Zenodo39 and our website (https://yanglab.nankai.edu.cn/trRosetta/benchmark_single/).
References
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496 (2020).
Su, H. et al. Improved protein structure prediction using a new multi-scale network and homologous templates. Adv. Sci. 8, 2102592 (2021).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process Syst. 32, 9689–9701 (2019).
Madani, A. et al. ProGen: Language modeling for protein generation. Preprint at bioRxiv https://doi.org/10.1101/2020.03.07.982272 (2020).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. in International Conference on Learning Representations 2021 (OpenReview.net, 2021).
Vaswani, A. et al. Attention is All you Need. in Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates, 2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics 4171–4186 (Association for Computational Linguistics, 2019).
Chen, M. C., Li, Y., Zhu, Y. H., Ge, F. & Yu, D. J. SSCpred: single-sequence-based protein contact prediction using deep fully convolutional network. J. Chem. Inf. Model. 60, 3295–3303 (2020).
Singh, J., Litfin, T., Singh, J., Paliwal, K. & Zhou, Y. SPOT-Contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model. Bioinformatics 38, 1888–1894 (2022).
Chowdhury, R. et al. Single-sequence protein structure prediction using language models from deep learning. Nat Biotechnol 40, 1617–1623 (2022). https://doi.org/10.1038/s41587-022-01432-w
Du, Z., Peng, Z. & Yang, J. Toward the assessment of predicted inter-residue distance. Bioinformatics 38, 962–969 (2022).
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinf. 57, 702–710 (2004).
Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2005).
Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).
Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 5, 602-610 (Springer, 2005).
Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003).
Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).
Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. Rna 19, 1537–1551 (2013).
Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).
Zeng, H. et al. ComplexContact: a web server for inter-protein contact prediction using deep learning. Nucleic Acids Res. 46, W432–W437 (2018).
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
Baek, M., Anishchenko, I., Park, H., Humphreys, I. R. & Baker, D. Protein oligomer modeling guided by predicted interchain contacts in CASP14. Proteins Struct. Funct. Bioinf. 89, 1824–1833 (2021).
Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).
Basu, S. & Wallner, B. DockQ: a quality measure for protein-protein docking models. PLoS ONE 11, e0161879 (2016).
Du, Z. et al. The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 16, 5634–5651 (2021).
Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).
Gao, S. H. et al. Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 652–662 (2021).
Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).
Wang, W., Peng, Z. & Yang, J. Source code and data for the paper “Single-sequence protein structure prediction using supervised transformer protein language models”. Zenodo https://doi.org/10.5281/zenodo.7264646 (2022).
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China (NSFC T2225007, T2222012, 11871290 and 61873185), and the Foundation for Innovative Research Groups of State Key Laboratory of Microbial Technology (WZCX2021-03).
Author information
Authors and Affiliations
Contributions
J.Y. designed the research. W.W. developed the pipeline and carried out the experiments. J.Y. and P.Z. supervised the research. All authors analyzed data, wrote and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Arne Elofsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Tables 1–6 and Figs. 1–11.
Source data
Source Data Fig. 1
The distance precisions and TM scores in Fig. 1.
Source Data Fig. 2
The predicted distances and structures for 7JJV and 2LSE in Fig. 2.
Source Data Fig. 3
Contact precision data for trRosettaX-Single and SPOT-Contact-LM in Fig. 3.
Source Data Fig. 4
Estimated TM scores for 2,000 hallucinated proteins and PDB files for three examples in Fig. 4.
Source Data Fig. 5
Data for mutation effect analysis in Fig. 5.
Source Data Fig. 6
Data for ablation study and estimated TM score in Fig. 6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, W., Peng, Z. & Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat Comput Sci 2, 804–814 (2022). https://doi.org/10.1038/s43588-022-00373-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00373-3
This article is cited by
-
Structural biology at the scale of proteomes
Nature Structural & Molecular Biology (2023)