A backbone-centred energy function of neural networks for protein design

Huang, Bin; Xu, Yang; Hu, Xiuhong; Liu, Yongrui; Liao, Shanhui; Zhang, Jiahai; Huang, Chengdong; Hong, Jingjun; Chen, Quan; Liu, Haiyan

doi:10.1038/s41586-021-04383-5

Article
Published: 09 February 2022

A backbone-centred energy function of neural networks for protein design

Nature volume 602, pages 523–528 (2022)Cite this article

22k Accesses
44 Citations
122 Altmetric
Metrics details

Subjects

Abstract

A protein backbone structure is designable if a substantial number of amino acid sequences exist that autonomously fold into it^1,2. It has been suggested that the designability of backbones is governed mainly by side chain-independent or side chain type-insensitive molecular interactions^3,4,5, indicating an approach for designing new backbones (ready for amino acid selection) based on continuous sampling and optimization of the backbone-centred energy surface. However, a sufficiently comprehensive and precise energy function has yet to be established for this purpose. Here we show that this goal is met by a statistical model named SCUBA (for Side Chain-Unknown Backbone Arrangement) that uses neural network-form energy terms. These terms are learned with a two-step approach that comprises kernel density estimation followed by neural network training and can analytically represent multidimensional, high-order correlations in known protein structures. We report the crystal structures of nine de novo proteins whose backbones were designed to high precision using SCUBA, four of which have novel, non-natural overall architectures. By eschewing use of fragments from existing protein structures, SCUBA-driven structure design facilitates far-reaching exploration of the designable backbone space, thus extending the novelty and diversity of the proteins amenable to de novo design.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Template-free protein design facilitated by explicit representation of the backbone-centred energy landscape.**

**Fig. 2: The de novo protein EXTD-3 integrates pre-existing and newly designed parts to form a single rigid architecture not yet observed in nature.**

**Fig. 3: Successfully designed two-layered α/β proteins and four-helix bundle proteins.**

**Fig. 4: Structures of successfully designed de novo proteins that fold into novel architectures.**

De novo protein design by deep network hallucination

Article 01 December 2021

Protein sequence design with a learned potential

Article Open access 08 February 2022

Rotamer-free protein sequence design based on deep learning and self-consistency

Article 21 July 2022

Data availability

Coordinates and structure files for designed proteins have been deposited to PDB under the following accession codes: 7DMF (EXTD-3), 7DKK (XM2H), 7DKO (AM2M), 7DGU (H4A1R), 7DGW (H4A2S), 7DGY (H4C2R), 7FBB (D12), 7FBC (D22) and 7FBD (D53). Other relevant data are available in the main text or the Supplementary Information.

Code availability

Executable computer programs, source code and model parameters for SCUBA and ABACUS2 are available for public download and free non-commercial use from https://doi.org/10.5281/zenodo.4533424.

References

Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).
Article ADS CAS Google Scholar
England, J. L. & Shakhnovich, E. I. Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003).
Article ADS Google Scholar
Hoang, T. X., Trovato, A., Seno, F., Banavar, J. R. & Maritan, A. Geometry and symmetry presculpt the free-energy landscape of proteins. Proc. Natl Acad. Sci. USA 101, 7960–7964 (2004).
Article ADS CAS Google Scholar
Rose, G. D., Fleming, P. J., Banavar, J. R. & Maritan, A. A backbone-based theory of protein folding. Proc. Natl Acad. Sci. USA 103, 16623–16633 (2006).
Article ADS CAS Google Scholar
Skolnick, J. & Gao, M. The role of local versus nonlocal physicochemical restraints in determining protein native structure. Curr. Opin. Struct. Biol. 68, 1–8 (2021).
Article CAS Google Scholar
Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003).
Article ADS CAS Google Scholar
Jiang, L. et al. De novo computational design of retro-aldol enzymes. Science 319, 1387–1391 (2008).
Article ADS CAS Google Scholar
Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).
Article ADS CAS Google Scholar
Marcos, E. et al. Principles for designing proteins with cavities formed by curved β sheets. Science 355, 201–206 (2017).
Article ADS CAS Google Scholar
Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).
Article ADS CAS Google Scholar
Lu, P. et al. Accurate computational design of multipass transmembrane proteins. Science 359, 1042–1046 (2018).
Article ADS CAS Google Scholar
Glasgow, A. A. et al. Computational design of a modular protein sense–response system. Science 366, 1024–1028 (2019).
Article ADS CAS Google Scholar
Huang, P. S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
Article ADS CAS Google Scholar
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Article CAS Google Scholar
Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079–1100 (2011).
Article CAS Google Scholar
Thomson, A. R. et al. Computational design of water-soluble α-helical barrels. Science 346, 485–488 (2014).
Article ADS CAS Google Scholar
Brunette, T. J. et al. Exploring the repeat protein universe through computational protein design. Nature 528, 580–584 (2015).
Article ADS CAS Google Scholar
Jacobs, T. et al. Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016).
Article ADS CAS Google Scholar
Pan, X. et al. Expanding the space of protein geometries by computational design of de novo fold families. Science 369, 1132–1136 (2020).
Article ADS CAS Google Scholar
Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).
Article CAS Google Scholar
Otten, R. et al. How directed evolution reshapes the energy landscape in an enzyme to boost catalysis. Science 370, 1442–1446 (2020).
Article ADS CAS Google Scholar
Zhang, Y., Hubner, I. A., Arakaki, A. K., Shakhnovich, E. & Skolnick, J. On the origin and highly likely completeness of single-domain protein structures. Proc. Natl Acad. Sci. USA 103, 2605–2610 (2006).
Article ADS CAS Google Scholar
Kukic, P. et al. Mapping the protein fold universe using the CamTube force field in molecular dynamics simulations. PLoS Comput. Biol. 11, e1004435 (2015).
Article Google Scholar
MacDonald, J. T., Maksimiak, K., Sadowski, M. I. & Taylor, W. R. De novo backbone scaffolds for protein design. Proteins Struct. Funct. Bioinf. 78, 1311–1325 (2010).
Article CAS Google Scholar
MacDonald, J. T. et al. Synthetic β-solenoid proteins with the fragment-free computational design of a β-hairpin extension. Proc. Natl Acad. Sci. USA 113, 10346–10351 (2016).
Article CAS Google Scholar
Van Gunsteren, W. F., Berendsen, H. J. C. & Rullmann, J. A. C. Stochastic dynamics for molecules with constraints: Brownian dynamics of n-alkanes. Mol. Phys. 44, 69–95 (1981).
Article ADS Google Scholar
Xiong, P. et al. Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat. Commun. 5, 5330 (2014).
Article ADS CAS Google Scholar
Xiong, P. et al. Increasing the efficiency and accuracy of the ABACUS protein sequence design method. Bioinformatics 36, 136–144 (2020).
Article CAS Google Scholar
Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007).
Article ADS Google Scholar
Wang, G. & Dunbrack, R. L., Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 33, W94–W98 (2005).
Article CAS Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article ADS CAS Google Scholar
Taylor, W. R. A ‘pperiodic table’ for protein structures. Nature 416, 657–662 (2002).
Article ADS CAS Google Scholar
Baker, D. What has de novo protein design taught us about protein folding and biophysics? Protein Sci. 28, 678–683 (2019).
Article CAS Google Scholar
Liu, R., Wang, J., Xiong, P., Chen, Q. & Liu, H. De novo sequence redesign of a functional Ras-binding domain globally inverted the surface charge distribution and led to extreme thermostability. Biotechnol. Bioeng. 118, 2031–2042 (2021).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2018YFA0900703 to H.L. and 2018YFA0901600 to Q.C.), the National Natural Science Foundation of China (21773220 to H.L., 31971175 to Q.C. and 32090040 to J. Z.) and the Youth Innovation Promotion Association, Chinese Academy of Sciences (2017494 to Q.C.). We thank the staff from the Core Facility Centre for Life Sciences, USTC, and from the BL18U1, BL19U1 and BL02U1 beamlines of the National Facility for Protein Science in Shanghai (NFPS) and the Shanghai Synchrotron Radiation Facility for assistance during crystallographic data collection. We thank the USTC Supercomputing Center for computing resource. We thank Z. Zhu, F. Li, Y. Wang, M. Lv and Y. Yun for assistance with X-ray diffraction data collection and processing and T. Jin for sharing MBP expression plasmids.

Author information

These authors contributed equally: Bin Huang, Yang Xu, Xiuhong Hu

Authors and Affiliations

MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, Hefei National Laboratory for Physical Sciences at the Microscale, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
Bin Huang, Yang Xu, Xiuhong Hu, Yongrui Liu, Shanhui Liao, Jiahai Zhang, Chengdong Huang, Jingjun Hong, Quan Chen & Haiyan Liu
Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, China
Chengdong Huang, Quan Chen & Haiyan Liu
School of Data Science, University of Science and Technology of China, Hefei, China
Haiyan Liu

Authors

Bin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiuhong Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yongrui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shanhui Liao
View author publications
You can also search for this author in PubMed Google Scholar
Jiahai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chengdong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jingjun Hong
View author publications
You can also search for this author in PubMed Google Scholar
Quan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.L., B.H. and Y.X. developed computational models and code, and B.H., Y.X., X.H. and Q.C. performed protein design and experimental characterization. S.L. and Y.L. collected and analysed crystallographic data. J.H., J.Z. and C.H. collected and helped process NMR data. H.L. and Q.C. supervised the project. H.L., Q.C. and B.H. wrote the manuscript, and other authors were involved in discussion.

Corresponding authors

Correspondence to Quan Chen or Haiyan Liu.

Ethics declarations

Competing interests

H.L., Q.C., B.H., Y.X. and X.H. have filed a patent application (202111197820.0) relating to the template-free protein design method in the name of the University of Science and Technology of China. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Statistical energies learned by NC-NN capture correlations in high-dimensional space.

a–d, Scatter graphs showing projections of a NC-NN-learned term for the through-space interactions between two backbone positions. In total, the term depends on 14 variables. The variables used for projections include the Cα-Cα distance (a, b) between the two positions, and additionally the mainchain torsional angles φ₁ and ψ₁ at one position (c, d). Points are colored according to statistical energy values (in arbitrary units) as indicated by the color bar. Points in a and c correspond to observed configurations, while those in b and d correspond to configurations randomly drawn according to the reference distribution.

Extended Data Fig. 2 NC-NN-learned components in SCUBA and simulations of natural protein structures by SCUBA.

a, Types of NC-NN-learned statistical energy terms in SCUBA. b, The deviations of conformations sampled in SCUBA-driven SD simulations from native conformations for 33 natural proteins. Each protein was simulated for 900 ps at reduced temperature T_r = 1.0 and the r.m.s.d. values (noted as RMSD in the figure) are for mainchain atoms in secondary structures averaged over the last 50 ps. Simulations were carried out either with or without a radius of gyration (R_g) restraint, which, as described in the Supplementary Methods, was optionally applied in later backbone design simulations both to bias the sampling of more compact structures and to compensate for thermal expansion in simulated annealing simulations involving higher temperatures. The restraint energy took the form \({E}_{{Rg}-{restraint}}({R}_{g})=-{k}_{{res}}{ln}\left(\frac{{R}_{g}}{{R}_{g}^{0}}\right)\) when \({R}_{g} > {R}_{g}^{0}\) and \({E}_{{Rg}-{restraint}}({R}_{g})=0\) for \({R}_{g}\le {R}_{g}^{0}\) (k_res = 300 in reduced energy unit and \({R}_{g}^{0}=5{\rm{\AA }}\)). This energy term leads to only weak compressing forces in comparisons with the strong inter-atomic steric repulsions, and does not distort the tightly-packed native-like minimum structures. The median r.m.s.d. values across the 33 proteins are 1.60 Å (native sequences, without R_g restraint, red bars), 1.25 Å (native sequences, with R_g restraint, orange bars), 2.78 Å (LVG sequences, without R_g restraint, blue bars), and 2.23 Å (LVG sequences, with R_g restraint, violet bars).

Extended Data Fig. 3 Generating initial backbone for a given sketch or topological architecture.

A sketch is represented as an abstracted architecture comprising regularly arranged layers of secondary structures, the layers in parallel planes. From the abstraction, coordinates of starting or ending positions (indicated by “×”) of secondary structure segments are determined as regular grid points on parallel straight lines in different planes. The N to C directions of the segments are perpendicular to the lines. The approximate lengths of the segments may also be pre-specified. Then peptide segments of corresponding local conformations are geometrically generated using coordinates of their terminal positions and directions determined from the sketch. Connecting the segments with closed loops leads to the initial backbone structure to be used by SCUBA-driven SASD.

Extended Data Fig. 4 SCUBA-driven SASD produced backbones similar to natural proteins.

Different boxes correspond to different design sketches. From left to right in each box: initial backbone, optimized backbone, a stereo view of the optimized backbone superposed with the closest natural structure, and deviations of Cα atom positions between the designed and the closest natural backbones. In each box, the text string indicates the type, approximate size, and order of secondary structure segments of the corresponding sketch (“H” for helix, “E” for strand, and the subscripts indicate lengths). The closest natural structures with given PDB IDs and chain IDs were identified using Dali searches. The r.m.s.d. values (noted as RMSD in the figures) are of Cα atoms in aligned secondary structure elements.

Extended Data Fig. 5 Examples of backbone changes at different design stages.

a, Initial and optimized backbones for the H2E4 sketch, whose secondary structure sequence is E₇H₁₆E₇H₁₆E₇E₇ (“H” for helix, “E” for strand, and the subscripts indicate approximate lengths). The top row shows artificially constructed initial structures, while the bottom row shows substage-1 backbones optimized without sidechain (yellow) superimposed with substage-2 backbones optimized with LVG-simplified sidechains (violet). b, The r.m.s.d. values of mainchain atoms (in Å) between the successively generated structures at different design stages of backbone optimization or relaxation. The results have been averaged over the H2E4 designs (standard deviations are given in parentheses). The meanings of the notations are: “Init” for the initial structure, “Substage-1” for substage-1 backbones optimized without sidechains, “Substage-2” for substage-2 backbones optimized with LVG-simplified sidechains, and “Iter1” to “Iter3” for backbones relaxed with the designed sidechains in the sequence design-backbone relaxation iterations. c, ABACUS2 and Rosetta energies of ABACUS2-selected sequences for initial and SCUBA-optimized backbones of different topological architectures. The secondary structure compositions of the architectures are: 1: E₁₀E₁₀E₁₀E₁₀, 2: E₇H₁₆E₇E₇, 3: E₇E₇H₁₆E₇, 4: E₇H₁₆E₇E₇H₁₆E₇, 5: E₁₀H₂₀E₁₀H₂₀E₁₀E₁₀, 6: E₇H₁₆E₇H₁₆E₇E₇, 7: E₁₀H₂₀E₁₀E₁₀H₂₀E₁₀, 8: E₇H₁₆E₇E₇H₁₆E₇, 9: H₁₅H₁₅H₁₅, 10: H₂₁H₂₁H₂₁H₂₁. For each sketch, 10 initial backbones have been optimized to generate 10 optimized backbones. Sketch 10 led to optimized backbones of both left-handed and right-handed twists, as shown in two boxes in Extended Data Fig. 4. Each energy value has been averaged over 100 sequences selected on a group of 10 initial or optimized backbones (10 sequences selected using ABACUS2 for each backbone), with standard deviations between 0.08 and 0.38. Rosetta energies have been calculated on relaxed structures with selected sequences. d, Amino acid usage frequencies in sequences selected with ABACUS2 on the H2E4 backbones at different optimization stages. Averaged values are shown separately for sequences designed using the substage-1 backbones optimized without any explicit sidechain (blue bars), using second stage backbones optimized with LVG-simplified sidechains (orange bars), and using backbones relaxed with the first round ABACUS2-selected sidechains (gray bars) (the sidechain atom radius parameters had been downscaled by multiplying 0.9 to introduce larger sidechains in the first round of sequence selection). The green bars correspond to the distribution in the training proteins.

Extended Data Fig. 6 Effects of loop resampling and optimization.

a, The distribution of the per-residue SCUBA energy changes of loop residues caused by loop resampling and optimization. For the H2E4 backbone structures, the changes were calculated as the energies after loop re-optimization minus the energies before loop re-optimization. b, The distribution of the lowest r.m.s.d. values (noted as RMSD in the figure) of predicted structures from designed structures. c, The distribution of per-residue Rosetta energy of the lowest-r.m.s.d. predicted structures. For b and c, the predictions were carried out using Rosetta biased forward folding for sequences designed from the loop re-optimized H2E4 backbone structures (thinner blue lines), or for sequences designed from H2E4 backbone structures not yet subjected to loop re-optimization (thicker red lines).

Extended Data Fig. 7 Experimental characterizations of designed proteins.

a, X-ray data collection and refinement of crystal structure models. b, NMR ¹⁵N-¹H HSQC spectra of ten designed H2E4 proteins and three novel helical proteins. c, Size exclusion chromatography results of the designed H2E4 proteins XM2H (left) and AM2M (right) in solution. The chromatograms were obtained for samples purified by gel filtration, and the molecular weights were estimated from the peak positions. d, Circular dichroism spectroscopy of the designed proteins XM2H (top) and H4A1R (bottom) at different temperatures. The slow varying temperature-dependent curves shown on the right suggest that there are only small changes in the secondary structure contents of these proteins over the temperatures range from 25 to 95 °C. For XM2H, its helical content (calculated from the CD curves) decreased from 54.9% at 20 °C to 48.2% at 95 °C, while its β-sheet content changed from 9% to 11%. For H4A1R, its helical content changed from 85.2% at 20 °C to 71.8% at 95 °C.

Extended Data Fig. 8 The structures of the loops in the H2E4 and H4 proteins.

a–e, Superimpositions of experimentally determined structures (cyan) with corresponding designed structures (green) for loops in the designed proteins XM2H(a), AM2M(b), H4A1R(c), H4A2S(d), and H4C2R(e). The 2Fo-Fc (at 1.0 σ level) electron density surfaces are also shown. The r.m.s.d. for main chain atoms are displayed. f, The experimentally determined structures of the two H2E4 proteins are superimposed (XM2H in cyan and AM2M in orange) to show their different loop structures connecting similarly arranged secondary structure segments.

Extended Data Fig. 9 Designed backbone structures of the experimentally examined all-helical proteins in Batch 3.

We note that the average per-residue Rosetta energy of the proteins with experimentally solved structures (D12, D22 and D53) is −3.32 ± 0.07 (in arbitrary unit), while for the remaining ten Batch-3 proteins, the same average value is −3.22 ± 0.14.

Extended Data Table 1 Summary of experimentally examined designs

Full size table

Supplementary information

Supplementary Information

This file contains Supplementary Methods, references and Tables 1 and 2.

Reporting Summary

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, B., Xu, Y., Hu, X. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022). https://doi.org/10.1038/s41586-021-04383-5

Download citation

Received: 05 March 2021
Accepted: 23 December 2021
Published: 09 February 2022
Issue Date: 17 February 2022
DOI: https://doi.org/10.1038/s41586-021-04383-5

This article is cited by

Tpgen: a language model for stable protein design with a specific topology structure
- Xiaoping Min
- Chongzhou Yang
- Ningshao Xia
BMC Bioinformatics (2024)
Deep learning for protein structure prediction and design—progress and applications
- Jürgen Jänes
- Pedro Beltrao
Molecular Systems Biology (2024)
Programmable synthetic receptors: the next-generation of cell and gene therapies
- Fei Teng
- Tongtong Cui
- Wei Li
Signal Transduction and Targeted Therapy (2024)
Opportunities and challenges in design and optimization of protein function
- Dina Listov
- Casper A. Goverde
- Sarel Jacob Fleishman
Nature Reviews Molecular Cell Biology (2024)
Sparks of function by de novo protein design
- Alexander E. Chu
- Tianyu Lu
- Po-Ssu Huang
Nature Biotechnology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.