Abstract
Several previously proposed deep learning methods to design amino acid sequences that autonomously fold into a given protein backbone yielded promising results in computational tests but did not outperform conventional energy function-based methods in wet experiments. Here we present the ABACUS-R method, which uses an encoder–decoder network trained using a multitask learning strategy to predict the sidechain type of a central residue from its three-dimensional local environment, which includes, besides other features, the types but not the conformations of the surrounding sidechains. This eliminates the need to reconstruct and optimize sidechain structures, and drastically simplifies the sequence design process. Thus iteratively applying the encoder–decoder to different central residues is able to produce self-consistent overall sequences for a target backbone. Results of wet experiments, including five structures solved by X-ray crystallography, show that ABACUS-R outperforms state-of-the-art energy function-based methods in success rate and design precision.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
The following data are available from Zenodo63: complete lists of proteins for training and testing the models; the amino acid sequences designed for the 100 targets by Modeleval; the amino acid sequences and DNA sequences of the experimentally examined proteins. The experimentally solved protein structures have been deposited in the PDB under accession codes: 7VQL (1r26-A3, 10.2210/pdb7VQL/pdb); 7VQV (1r26-A6, 10.2210/pdb7VQV/pdb); 7VQW (1r26-A7, 10.2210/pdb7VQW/pdb); 7VTY(1cy5-A7, 10.2210/pdb7VTY/pdb); 7VU4(1r26-B4, 10.2210/pdb7VU4/pdb). Source Data are provided with this paper.
Code availability
The source code is available from Code Ocean64 at https://doi.org/10.24433/CO.3351944.v1.
Change history
02 August 2022
A Correction to this paper has been published: https://doi.org/10.1038/s43588-022-00305-1
References
Kuhlman, B. & Bradley, P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 20, 681–697 (2019).
Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
Silva, D.-A. et al. De novo design of potent and selective mimics of IL-2 and IL-15. Nature 565, 186–191 (2019).
Cao, L. et al. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science 370, 426–431 (2020).
Siegel, J. B. et al. Computational design of an enzyme catalyst for a stereoselective bimolecular Diels–Alder reaction. Science 329, 309–313 (2010).
Cui, Y. et al. Development of a versatile and efficient C–N lyase platform for asymmetric hydroamination via computational enzyme redesign. Nat. Catal. 4, 364–373 (2021).
Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003).
Leaver-Fay, A. et al. ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol. 487, 545–574 (2011).
Xiong, P. et al. Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat. Commun. 5, 1–9 (2014).
Xiong, P. et al. Increasing the efficiency and accuracy of the ABACUS protein sequence design method. Bioinformatics 36, 136–144 (2020).
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Chevalier, A. et al. Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017).
Johansson, K. E. et al. Computational redesign of thioredoxin is hypersensitive toward minor conformational changes in the backbone template. J. Mol. Biol. 428, 4361–4377 (2016).
Marin, F. I., Johansson, K. E., O’Shea, C., Lindorff-Larsen, K. & Winther, J. R. Computational and experimental assessment of backbone templates for computational redesign of the thioredoxin fold. J. Phy. Chem. B 125, 11141–11149 (2021).
Murphy, G. S. et al. Increasing sequence diversity with flexible backbone protein design: the complete redesign of a protein hydrophobic core. Structure 20, 1086–1096 (2012).
Zhou, J., Panaitiu, A. E. & Grigoryan, G. A general-purpose protein design framework based on mining sequence-structure relationships in known protein structures. Proc. Natl Acad. Sci. USA 117, 1059–1068 (2020).
Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 1–11 (2022).
Dahiyat, B. I. & Mayo, S. L. De novo protein design: fully automated sequence selection. Science 278, 82–87 (1997).
Simonson, T. et al. Computational protein design: the proteus software and selected applications. J. Comput. Chem. 34, 2472–2484 (2013).
Huang, X., Pearce, R. & Zhang, Y. EvoEF2: accurate and fast energy function for computational protein design. Bioinformatics 36, 1135–1142 (2020).
Liang, S., Li, Z., Zhan, J. & Zhou, Y. De novo protein design by an energy function based on series expansion in distance and orientation dependence. Bioinformatics 38, 86–93 (2021).
Zhou, X. et al. Proteins of well-defined structures can be designed without backbone readjustment by a statistical model. J. Struct. Biol. 196, 350–357 (2016).
Han, M. et al. Selection and analyses of variants of a designed protein suggest importance of hydrophobicity of partially buried sidechains for protein stability at high temperatures. Protein Sci. 28, 1437–1447 (2019).
Liu, R., Wang, J., Xiong, P., Chen, Q. & Liu, H. De novo sequence redesign of a functional Ras-binding domain globally inverted the surface charge distribution and led to extreme thermostability. Biotechnol. Bioeng. 118, 2031–2042 (2021).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Wang, S., Li, W., Liu, S. & Xu, J. RaptorX-property: a web server for protein structure property prediction. Nucl. Acids Res. 44, W430–W435 (2016).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Advances in Neural Information Processing Systems Vol 32 (NeurIPS, 2019).
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411 (2020).
Qi, Y. & Zhang, J. Z. H. DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with densenet. J. Chem. Inform. Model. 60, 1245–1252 (2020).
Zhang, Y. et al. ProDCoNN: protein design using a convolutional neural network. Proteins Struct. Funct. Bioinform. 88, 819–829 (2020).
Torng, W. & Altman, R. B. 3D Deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinform. 18, 1–23 (2017).
Chen, S. et al. To improve protein sequence profile prediction through image captioning on pairwise residue distance map. J. Chem. Inform. Model. 60, 391–399 (2019).
Ovchinnikov, S. & Huang, P.-S. Structure-based protein design with deep learning. Cur. Opin. Chem. Biol. 65, 136–144 (2021).
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucl Acids Res. 49, D266–D273 (2021).
Ruder, S. An overview of multi-task learning in deep neural networks. Preprint at https://arxiv.org/abs/1706.05098 (2017).
Bava, K. A., Gromiha, M. M., Uedaira, H., Kitajima, K. & Sarai, A. ProTherm, version 4.0: thermodynamic database for proteins and mutants. Nucl. Acids Res. 32, D120–D121 (2004).
Jing, B, Eismann, S., Suriana, P., Townshend, R. J. L. & Dror, R. Learning from Protein Structure with Geometric Vector Perceptrons. In International Conference on Learning Representations (ICLR, 2021).
Li, A. J., Sundar, V., Grigoryan, G. & Keating, A. E. TERMinator: a neural framework for structure-based protein design using tertiary repeating motifs. Preprint at https://arxiv.org/abs/2204.13048 (2022).
Conway, P., Tyka, M. D., DiMaio, F., Konerding, D. E. & Baker, D. Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23, 47–55 (2014).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucl. Acids Res. 33, 2302–2309 (2005).
Huang, B. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022).
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2012).
Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 29, 1–2 (2022).
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
Krivov, G. G., Shapovalov, M. V. & Dunbrack, R. L. Jr. Improved prediction of protein side-chain conformations with SCWRL4. Proteins Struct. Funct. Bioinform. 77, 778–795 (2009).
Bengio, S., Vinyals, O., Jaitly, N. & Shazeer, S. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems Vol. 28 (NeurIPS, 2015).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30 (NeurIPS, 2017).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
Frishman, D. & Argos, P. Knowledge-based protein secondary structure assignment. Proteins Struct. Funct. Bioinform. 23, 566–579 (1995).
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).
Kingma, D., & Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR, 2015).
Wang, G. & Dunbrack, R. L. Jr. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591 (2003).
The PyMOL Molecular Graphics System v.1.8 (Schrödinger, LLC, 2015).
Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277–293 (1995).
Lee, W., Westler, W. M., Bahrami, A., Eghbalnia, H. R. & Markley, J. L. PINE-SPARKY: graphical interface for evaluating automated probabilistic peak assignments in protein NMR spectroscopy. Bioinformatics 25, 2085–2087 (2009).
Zhang, W.-Z. et al. The protein complex crystallography beamline (bl19u1) at the Shanghai synchrotron radiation facility. Nucl. Sci. Tech. 30, 1–11 (2019).
Otwinowski, Z. & Minor, W. Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 276, 307–326 (1997).
Kabsch, W. Integration, scaling, space-group assignment and post-refinement. Acta Crystallogr. D 66, 133–144 (2010).
Vagin, A. & Teplyakov, A. Molecular replacement with molrep. Acta Crystallogr. D 66, 22–25 (2010).
Adams, P. D. et al. Phenix: building new software for automated crystallographic structure determination. Acta Crystallogr. D 58, 1948–1954 (2002).
Liu, Y. Rotamer-Free Protein Sequence Design Based on Deep Learning and Self-Consistency (Zenodo, 2022); https://doi.org/10.5281/zenodo.6592054.
Liu, Y. et al. ABACUS-R: Rotamer-free protein sequence design based on deep learning and self-consistency (Code Ocean, 2022); https://doi.org/10.24433/CO.3351944.v1.
Acknowledgements
This work was supported by the National Key R&D Program of China (grant no. 2018YFA0900703 to H.Y.L. and 2018YFA090 1600 to Q.C.), National Natural Science Foundation of China (grant no. 21773220 to H.Y.L., 31971175 and 32171411 to Q.C.), the Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project (grant no. TSBICIP-PTJS-001 to H.Y.L.), and Youth Innovation Promotion Association, Chinese Academy of Sciences (grant no. 2017494 to Q.C.). We thank the members of staff from BL19U1 and BL02U1 beamlines of National Facility for Protein Science in Shanghai (NFPS) and of Shanghai Synchrotron Radiation Facility for assistance during crystallographic data collection. We thank M. Lv and Y. Yun for their assistance with X-ray diffraction data collection and processing.
Author information
Authors and Affiliations
Contributions
H.Y.L., H.Q.L, Y.F.L. and W.L.W. conceived the computational framework. Q.C., L.Z. and Y.F.L. designed the experimental study. Y.F.L. and W.L.W. wrote the computer programs and performed the calculations under the supervision of H.Y.L. and H.Q.L. L.Z. performed experimental analyses under the supervision of Q.C. and H.Y.L. M.Z., C.C.W. and F.D.L. analyzed the crystallographic data. J.H.Z. collected and processed NMR data. Y.F.L., W.L.W. and H.Y.L. wrote the paper with input from all of the other authors.
Corresponding authors
Ethics declarations
Competing interests
H.Y.L. Q.C., H.Q.L., Y.F.L. and W.L.W. have filed patent application (no. 202210091553.7) relating to rotamer-free protein seuqence design in the name of University of Science and Technology of China. The other authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Jue Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Jie Pan, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–21, Tables 1–14 and references.
Supplementary Data 1
Raw data for Supplementary Figs. 1a–d, 2a,b, 3a,c, 5c, 6a, 7 and 11, and protein lists.
Source data
Source Data Fig. 2
Raw data of recovery rate of Modeleval for single residues in a test set.
Source Data Fig. 3
Intermediate results for designing the overall sequences for 100 targets during self-consistency iteration. Overall sequence recovery rate and Rosetta energy for ABACUS-R-designed sequences. ABACUS-R-designed sequences for 100 targets.
Source Data Fig. 4
Raw data for three fast protein liquid chromatography experiments, three 1H NMR spectra, three DSC spectra, one HSQC spectrum and two crystal structures for three ABACUS-R designed sequences.
Source Data Fig. 5
PDB files and validation reports of X-ray structures, including those shown in this figure.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Zhang, L., Wang, W. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci 2, 451–462 (2022). https://doi.org/10.1038/s43588-022-00273-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00273-6
This article is cited by
-
Protein sequence design by deep learning
Nature Computational Science (2022)