Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A backbone-centred energy function of neural networks for protein design

Abstract

A protein backbone structure is designable if a substantial number of amino acid sequences exist that autonomously fold into it1,2. It has been suggested that the designability of backbones is governed mainly by side chain-independent or side chain type-insensitive molecular interactions3,4,5, indicating an approach for designing new backbones (ready for amino acid selection) based on continuous sampling and optimization of the backbone-centred energy surface. However, a sufficiently comprehensive and precise energy function has yet to be established for this purpose. Here we show that this goal is met by a statistical model named SCUBA (for Side Chain-Unknown Backbone Arrangement) that uses neural network-form energy terms. These terms are learned with a two-step approach that comprises kernel density estimation followed by neural network training and can analytically represent multidimensional, high-order correlations in known protein structures. We report the crystal structures of nine de novo proteins whose backbones were designed to high precision using SCUBA, four of which have novel, non-natural overall architectures. By eschewing use of fragments from existing protein structures, SCUBA-driven structure design facilitates far-reaching exploration of the designable backbone space, thus extending the novelty and diversity of the proteins amenable to de novo design.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Template-free protein design facilitated by explicit representation of the backbone-centred energy landscape.
Fig. 2: The de novo protein EXTD-3 integrates pre-existing and newly designed parts to form a single rigid architecture not yet observed in nature.
Fig. 3: Successfully designed two-layered α/β proteins and four-helix bundle proteins.
Fig. 4: Structures of successfully designed de novo proteins that fold into novel architectures.

Similar content being viewed by others

Data availability

Coordinates and structure files for designed proteins have been deposited to PDB under the following accession codes: 7DMF (EXTD-3), 7DKK (XM2H), 7DKO (AM2M), 7DGU (H4A1R), 7DGW (H4A2S), 7DGY (H4C2R), 7FBB (D12), 7FBC (D22) and 7FBD (D53). Other relevant data are available in the main text or the Supplementary Information.

Code availability

Executable computer programs, source code and model parameters for SCUBA and ABACUS2 are available for public download and free non-commercial use from https://doi.org/10.5281/zenodo.4533424.

References

  1. Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).

    Article  ADS  CAS  Google Scholar 

  2. England, J. L. & Shakhnovich, E. I. Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003).

    Article  ADS  Google Scholar 

  3. Hoang, T. X., Trovato, A., Seno, F., Banavar, J. R. & Maritan, A. Geometry and symmetry presculpt the free-energy landscape of proteins. Proc. Natl Acad. Sci. USA 101, 7960–7964 (2004).

    Article  ADS  CAS  Google Scholar 

  4. Rose, G. D., Fleming, P. J., Banavar, J. R. & Maritan, A. A backbone-based theory of protein folding. Proc. Natl Acad. Sci. USA 103, 16623–16633 (2006).

    Article  ADS  CAS  Google Scholar 

  5. Skolnick, J. & Gao, M. The role of local versus nonlocal physicochemical restraints in determining protein native structure. Curr. Opin. Struct. Biol. 68, 1–8 (2021).

    Article  CAS  Google Scholar 

  6. Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003).

    Article  ADS  CAS  Google Scholar 

  7. Jiang, L. et al. De novo computational design of retro-aldol enzymes. Science 319, 1387–1391 (2008).

    Article  ADS  CAS  Google Scholar 

  8. Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

    Article  ADS  CAS  Google Scholar 

  9. Marcos, E. et al. Principles for designing proteins with cavities formed by curved β sheets. Science 355, 201–206 (2017).

    Article  ADS  CAS  Google Scholar 

  10. Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).

    Article  ADS  CAS  Google Scholar 

  11. Lu, P. et al. Accurate computational design of multipass transmembrane proteins. Science 359, 1042–1046 (2018).

    Article  ADS  CAS  Google Scholar 

  12. Glasgow, A. A. et al. Computational design of a modular protein sense–response system. Science 366, 1024–1028 (2019).

    Article  ADS  CAS  Google Scholar 

  13. Huang, P. S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

    Article  ADS  CAS  Google Scholar 

  14. Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

    Article  CAS  Google Scholar 

  15. Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079–1100 (2011).

    Article  CAS  Google Scholar 

  16. Thomson, A. R. et al. Computational design of water-soluble α-helical barrels. Science 346, 485–488 (2014).

    Article  ADS  CAS  Google Scholar 

  17. Brunette, T. J. et al. Exploring the repeat protein universe through computational protein design. Nature 528, 580–584 (2015).

    Article  ADS  CAS  Google Scholar 

  18. Jacobs, T. et al. Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016).

    Article  ADS  CAS  Google Scholar 

  19. Pan, X. et al. Expanding the space of protein geometries by computational design of de novo fold families. Science 369, 1132–1136 (2020).

    Article  ADS  CAS  Google Scholar 

  20. Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).

    Article  CAS  Google Scholar 

  21. Otten, R. et al. How directed evolution reshapes the energy landscape in an enzyme to boost catalysis. Science 370, 1442–1446 (2020).

    Article  ADS  CAS  Google Scholar 

  22. Zhang, Y., Hubner, I. A., Arakaki, A. K., Shakhnovich, E. & Skolnick, J. On the origin and highly likely completeness of single-domain protein structures. Proc. Natl Acad. Sci. USA 103, 2605–2610 (2006).

    Article  ADS  CAS  Google Scholar 

  23. Kukic, P. et al. Mapping the protein fold universe using the CamTube force field in molecular dynamics simulations. PLoS Comput. Biol. 11, e1004435 (2015).

    Article  Google Scholar 

  24. MacDonald, J. T., Maksimiak, K., Sadowski, M. I. & Taylor, W. R. De novo backbone scaffolds for protein design. Proteins Struct. Funct. Bioinf. 78, 1311–1325 (2010).

    Article  CAS  Google Scholar 

  25. MacDonald, J. T. et al. Synthetic β-solenoid proteins with the fragment-free computational design of a β-hairpin extension. Proc. Natl Acad. Sci. USA 113, 10346–10351 (2016).

    Article  CAS  Google Scholar 

  26. Van Gunsteren, W. F., Berendsen, H. J. C. & Rullmann, J. A. C. Stochastic dynamics for molecules with constraints: Brownian dynamics of n-alkanes. Mol. Phys. 44, 69–95 (1981).

    Article  ADS  Google Scholar 

  27. Xiong, P. et al. Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat. Commun. 5, 5330 (2014).

    Article  ADS  CAS  Google Scholar 

  28. Xiong, P. et al. Increasing the efficiency and accuracy of the ABACUS protein sequence design method. Bioinformatics 36, 136–144 (2020).

    Article  CAS  Google Scholar 

  29. Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007).

    Article  ADS  Google Scholar 

  30. Wang, G. & Dunbrack, R. L., Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 33, W94–W98 (2005).

    Article  CAS  Google Scholar 

  31. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  ADS  CAS  Google Scholar 

  32. Taylor, W. R. A ‘pperiodic table’ for protein structures. Nature 416, 657–662 (2002).

    Article  ADS  CAS  Google Scholar 

  33. Baker, D. What has de novo protein design taught us about protein folding and biophysics? Protein Sci. 28, 678–683 (2019).

    Article  CAS  Google Scholar 

  34. Liu, R., Wang, J., Xiong, P., Chen, Q. & Liu, H. De novo sequence redesign of a functional Ras-binding domain globally inverted the surface charge distribution and led to extreme thermostability. Biotechnol. Bioeng. 118, 2031–2042 (2021).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2018YFA0900703 to H.L. and 2018YFA0901600 to Q.C.), the National Natural Science Foundation of China (21773220 to H.L., 31971175 to Q.C. and 32090040 to J. Z.) and the Youth Innovation Promotion Association, Chinese Academy of Sciences (2017494 to Q.C.). We thank the staff from the Core Facility Centre for Life Sciences, USTC, and from the BL18U1, BL19U1 and BL02U1 beamlines of the National Facility for Protein Science in Shanghai (NFPS) and the Shanghai Synchrotron Radiation Facility for assistance during crystallographic data collection. We thank the USTC Supercomputing Center for computing resource. We thank Z. Zhu, F. Li, Y. Wang, M. Lv and Y. Yun for assistance with X-ray diffraction data collection and processing and T. Jin for sharing MBP expression plasmids.

Author information

Authors and Affiliations

Authors

Contributions

H.L., B.H. and Y.X. developed computational models and code, and B.H., Y.X., X.H. and Q.C. performed protein design and experimental characterization. S.L. and Y.L. collected and analysed crystallographic data. J.H., J.Z. and C.H. collected and helped process NMR data. H.L. and Q.C. supervised the project. H.L., Q.C. and B.H. wrote the manuscript, and other authors were involved in discussion.

Corresponding authors

Correspondence to Quan Chen or Haiyan Liu.

Ethics declarations

Competing interests

H.L., Q.C., B.H., Y.X. and X.H. have filed a patent application (202111197820.0) relating to the template-free protein design method in the name of the University of Science and Technology of China. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Statistical energies learned by NC-NN capture correlations in high-dimensional space.

ad, Scatter graphs showing projections of a NC-NN-learned term for the through-space interactions between two backbone positions. In total, the term depends on 14 variables. The variables used for projections include the Cα-Cα distance (a, b) between the two positions, and additionally the mainchain torsional angles φ1 and ψ1 at one position (c, d). Points are colored according to statistical energy values (in arbitrary units) as indicated by the color bar. Points in a and c correspond to observed configurations, while those in b and d correspond to configurations randomly drawn according to the reference distribution.

Extended Data Fig. 2 NC-NN-learned components in SCUBA and simulations of natural protein structures by SCUBA.

a, Types of NC-NN-learned statistical energy terms in SCUBA. b, The deviations of conformations sampled in SCUBA-driven SD simulations from native conformations for 33 natural proteins. Each protein was simulated for 900 ps at reduced temperature Tr = 1.0 and the r.m.s.d. values (noted as RMSD in the figure) are for mainchain atoms in secondary structures averaged over the last 50 ps. Simulations were carried out either with or without a radius of gyration (Rg) restraint, which, as described in the Supplementary Methods, was optionally applied in later backbone design simulations both to bias the sampling of more compact structures and to compensate for thermal expansion in simulated annealing simulations involving higher temperatures. The restraint energy took the form \({E}_{{Rg}-{restraint}}({R}_{g})=-{k}_{{res}}{ln}\left(\frac{{R}_{g}}{{R}_{g}^{0}}\right)\) when \({R}_{g} > {R}_{g}^{0}\) and \({E}_{{Rg}-{restraint}}({R}_{g})=0\) for \({R}_{g}\le {R}_{g}^{0}\) (kres = 300 in reduced energy unit and \({R}_{g}^{0}=5{\rm{\AA }}\)). This energy term leads to only weak compressing forces in comparisons with the strong inter-atomic steric repulsions, and does not distort the tightly-packed native-like minimum structures. The median r.m.s.d. values across the 33 proteins are 1.60 Å (native sequences, without Rg restraint, red bars), 1.25 Å (native sequences, with Rg restraint, orange bars), 2.78 Å (LVG sequences, without Rg restraint, blue bars), and 2.23 Å (LVG sequences, with Rg restraint, violet bars).

Extended Data Fig. 3 Generating initial backbone for a given sketch or topological architecture.

A sketch is represented as an abstracted architecture comprising regularly arranged layers of secondary structures, the layers in parallel planes. From the abstraction, coordinates of starting or ending positions (indicated by “×”) of secondary structure segments are determined as regular grid points on parallel straight lines in different planes. The N to C directions of the segments are perpendicular to the lines. The approximate lengths of the segments may also be pre-specified. Then peptide segments of corresponding local conformations are geometrically generated using coordinates of their terminal positions and directions determined from the sketch. Connecting the segments with closed loops leads to the initial backbone structure to be used by SCUBA-driven SASD.

Extended Data Fig. 4 SCUBA-driven SASD produced backbones similar to natural proteins.

Different boxes correspond to different design sketches. From left to right in each box: initial backbone, optimized backbone, a stereo view of the optimized backbone superposed with the closest natural structure, and deviations of Cα atom positions between the designed and the closest natural backbones. In each box, the text string indicates the type, approximate size, and order of secondary structure segments of the corresponding sketch (“H” for helix, “E” for strand, and the subscripts indicate lengths). The closest natural structures with given PDB IDs and chain IDs were identified using Dali searches. The r.m.s.d. values (noted as RMSD in the figures) are of Cα atoms in aligned secondary structure elements.

Extended Data Fig. 5 Examples of backbone changes at different design stages.

a, Initial and optimized backbones for the H2E4 sketch, whose secondary structure sequence is E7H16E7H16E7E7 (“H” for helix, “E” for strand, and the subscripts indicate approximate lengths). The top row shows artificially constructed initial structures, while the bottom row shows substage-1 backbones optimized without sidechain (yellow) superimposed with substage-2 backbones optimized with LVG-simplified sidechains (violet). b, The r.m.s.d. values of mainchain atoms (in Å) between the successively generated structures at different design stages of backbone optimization or relaxation. The results have been averaged over the H2E4 designs (standard deviations are given in parentheses). The meanings of the notations are: “Init” for the initial structure, “Substage-1” for substage-1 backbones optimized without sidechains, “Substage-2” for substage-2 backbones optimized with LVG-simplified sidechains, and “Iter1” to “Iter3” for backbones relaxed with the designed sidechains in the sequence design-backbone relaxation iterations. c, ABACUS2 and Rosetta energies of ABACUS2-selected sequences for initial and SCUBA-optimized backbones of different topological architectures. The secondary structure compositions of the architectures are: 1: E10E10E10E10, 2: E7H16E7E7, 3: E7E7H16E7, 4: E7H16E7E7H16E7, 5: E10H20E10H20E10E10, 6: E7H16E7H16E7E7, 7: E10H20E10E10H20E10, 8: E7H16E7E7H16E7, 9: H15H15H15, 10: H21H21H21H21. For each sketch, 10 initial backbones have been optimized to generate 10 optimized backbones. Sketch 10 led to optimized backbones of both left-handed and right-handed twists, as shown in two boxes in Extended Data Fig. 4. Each energy value has been averaged over 100 sequences selected on a group of 10 initial or optimized backbones (10 sequences selected using ABACUS2 for each backbone), with standard deviations between 0.08 and 0.38. Rosetta energies have been calculated on relaxed structures with selected sequences. d, Amino acid usage frequencies in sequences selected with ABACUS2 on the H2E4 backbones at different optimization stages. Averaged values are shown separately for sequences designed using the substage-1 backbones optimized without any explicit sidechain (blue bars), using second stage backbones optimized with LVG-simplified sidechains (orange bars), and using backbones relaxed with the first round ABACUS2-selected sidechains (gray bars) (the sidechain atom radius parameters had been downscaled by multiplying 0.9 to introduce larger sidechains in the first round of sequence selection). The green bars correspond to the distribution in the training proteins.

Extended Data Fig. 6 Effects of loop resampling and optimization.

a, The distribution of the per-residue SCUBA energy changes of loop residues caused by loop resampling and optimization. For the H2E4 backbone structures, the changes were calculated as the energies after loop re-optimization minus the energies before loop re-optimization. b, The distribution of the lowest r.m.s.d. values (noted as RMSD in the figure) of predicted structures from designed structures. c, The distribution of per-residue Rosetta energy of the lowest-r.m.s.d. predicted structures. For b and c, the predictions were carried out using Rosetta biased forward folding for sequences designed from the loop re-optimized H2E4 backbone structures (thinner blue lines), or for sequences designed from H2E4 backbone structures not yet subjected to loop re-optimization (thicker red lines).

Extended Data Fig. 7 Experimental characterizations of designed proteins.

a, X-ray data collection and refinement of crystal structure models. b, NMR 15N-1H HSQC spectra of ten designed H2E4 proteins and three novel helical proteins. c, Size exclusion chromatography results of the designed H2E4 proteins XM2H (left) and AM2M (right) in solution. The chromatograms were obtained for samples purified by gel filtration, and the molecular weights were estimated from the peak positions. d, Circular dichroism spectroscopy of the designed proteins XM2H (top) and H4A1R (bottom) at different temperatures. The slow varying temperature-dependent curves shown on the right suggest that there are only small changes in the secondary structure contents of these proteins over the temperatures range from 25 to 95 °C. For XM2H, its helical content (calculated from the CD curves) decreased from 54.9% at 20 °C to 48.2% at 95 °C, while its β-sheet content changed from 9% to 11%. For H4A1R, its helical content changed from 85.2% at 20 °C to 71.8% at 95 °C.

Extended Data Fig. 8 The structures of the loops in the H2E4 and H4 proteins.

ae, Superimpositions of experimentally determined structures (cyan) with corresponding designed structures (green) for loops in the designed proteins XM2H(a), AM2M(b), H4A1R(c), H4A2S(d), and H4C2R(e). The 2Fo-Fc (at 1.0 σ level) electron density surfaces are also shown. The r.m.s.d. for main chain atoms are displayed. f, The experimentally determined structures of the two H2E4 proteins are superimposed (XM2H in cyan and AM2M in orange) to show their different loop structures connecting similarly arranged secondary structure segments.

Extended Data Fig. 9 Designed backbone structures of the experimentally examined all-helical proteins in Batch 3.

We note that the average per-residue Rosetta energy of the proteins with experimentally solved structures (D12, D22 and D53) is −3.32 ± 0.07 (in arbitrary unit), while for the remaining ten Batch-3 proteins, the same average value is −3.22 ± 0.14.

Extended Data Table 1 Summary of experimentally examined designs

Supplementary information

Supplementary Information

This file contains Supplementary Methods, references and Tables 1 and 2.

Reporting Summary

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, B., Xu, Y., Hu, X. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022). https://doi.org/10.1038/s41586-021-04383-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-021-04383-5

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing