Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# A backbone-centred energy function of neural networks for protein design

## Abstract

A protein backbone structure is designable if a substantial number of amino acid sequences exist that autonomously fold into it1,2. It has been suggested that the designability of backbones is governed mainly by side chain-independent or side chain type-insensitive molecular interactions3,4,5, indicating an approach for designing new backbones (ready for amino acid selection) based on continuous sampling and optimization of the backbone-centred energy surface. However, a sufficiently comprehensive and precise energy function has yet to be established for this purpose. Here we show that this goal is met by a statistical model named SCUBA (for Side Chain-Unknown Backbone Arrangement) that uses neural network-form energy terms. These terms are learned with a two-step approach that comprises kernel density estimation followed by neural network training and can analytically represent multidimensional, high-order correlations in known protein structures. We report the crystal structures of nine de novo proteins whose backbones were designed to high precision using SCUBA, four of which have novel, non-natural overall architectures. By eschewing use of fragments from existing protein structures, SCUBA-driven structure design facilitates far-reaching exploration of the designable backbone space, thus extending the novelty and diversity of the proteins amenable to de novo design.

This is a preview of subscription content, access via your institution

## Relevant articles

• ### Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP

Nature Communications Open Access 10 June 2022

## Access options

\$39.95

Prices may be subject to local taxes which are calculated during checkout

## Data availability

Coordinates and structure files for designed proteins have been deposited to PDB under the following accession codes: 7DMF (EXTD-3), 7DKK (XM2H), 7DKO (AM2M), 7DGU (H4A1R), 7DGW (H4A2S), 7DGY (H4C2R), 7FBB (D12), 7FBC (D22) and 7FBD (D53). Other relevant data are available in the main text or the Supplementary Information.

## Code availability

Executable computer programs, source code and model parameters for SCUBA and ABACUS2 are available for public download and free non-commercial use from https://doi.org/10.5281/zenodo.4533424.

## References

1. Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).

2. England, J. L. & Shakhnovich, E. I. Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003).

3. Hoang, T. X., Trovato, A., Seno, F., Banavar, J. R. & Maritan, A. Geometry and symmetry presculpt the free-energy landscape of proteins. Proc. Natl Acad. Sci. USA 101, 7960–7964 (2004).

4. Rose, G. D., Fleming, P. J., Banavar, J. R. & Maritan, A. A backbone-based theory of protein folding. Proc. Natl Acad. Sci. USA 103, 16623–16633 (2006).

5. Skolnick, J. & Gao, M. The role of local versus nonlocal physicochemical restraints in determining protein native structure. Curr. Opin. Struct. Biol. 68, 1–8 (2021).

6. Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003).

7. Jiang, L. et al. De novo computational design of retro-aldol enzymes. Science 319, 1387–1391 (2008).

8. Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

9. Marcos, E. et al. Principles for designing proteins with cavities formed by curved β sheets. Science 355, 201–206 (2017).

10. Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).

11. Lu, P. et al. Accurate computational design of multipass transmembrane proteins. Science 359, 1042–1046 (2018).

12. Glasgow, A. A. et al. Computational design of a modular protein sense–response system. Science 366, 1024–1028 (2019).

13. Huang, P. S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

14. Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

15. Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079–1100 (2011).

16. Thomson, A. R. et al. Computational design of water-soluble α-helical barrels. Science 346, 485–488 (2014).

17. Brunette, T. J. et al. Exploring the repeat protein universe through computational protein design. Nature 528, 580–584 (2015).

18. Jacobs, T. et al. Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016).

19. Pan, X. et al. Expanding the space of protein geometries by computational design of de novo fold families. Science 369, 1132–1136 (2020).

20. Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).

21. Otten, R. et al. How directed evolution reshapes the energy landscape in an enzyme to boost catalysis. Science 370, 1442–1446 (2020).

22. Zhang, Y., Hubner, I. A., Arakaki, A. K., Shakhnovich, E. & Skolnick, J. On the origin and highly likely completeness of single-domain protein structures. Proc. Natl Acad. Sci. USA 103, 2605–2610 (2006).

23. Kukic, P. et al. Mapping the protein fold universe using the CamTube force field in molecular dynamics simulations. PLoS Comput. Biol. 11, e1004435 (2015).

24. MacDonald, J. T., Maksimiak, K., Sadowski, M. I. & Taylor, W. R. De novo backbone scaffolds for protein design. Proteins Struct. Funct. Bioinf. 78, 1311–1325 (2010).

25. MacDonald, J. T. et al. Synthetic β-solenoid proteins with the fragment-free computational design of a β-hairpin extension. Proc. Natl Acad. Sci. USA 113, 10346–10351 (2016).

26. Van Gunsteren, W. F., Berendsen, H. J. C. & Rullmann, J. A. C. Stochastic dynamics for molecules with constraints: Brownian dynamics of n-alkanes. Mol. Phys. 44, 69–95 (1981).

27. Xiong, P. et al. Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat. Commun. 5, 5330 (2014).

28. Xiong, P. et al. Increasing the efficiency and accuracy of the ABACUS protein sequence design method. Bioinformatics 36, 136–144 (2020).

29. Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007).

30. Wang, G. & Dunbrack, R. L., Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 33, W94–W98 (2005).

31. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

32. Taylor, W. R. A ‘pperiodic table’ for protein structures. Nature 416, 657–662 (2002).

33. Baker, D. What has de novo protein design taught us about protein folding and biophysics? Protein Sci. 28, 678–683 (2019).

34. Liu, R., Wang, J., Xiong, P., Chen, Q. & Liu, H. De novo sequence redesign of a functional Ras-binding domain globally inverted the surface charge distribution and led to extreme thermostability. Biotechnol. Bioeng. 118, 2031–2042 (2021).

## Acknowledgements

This work was supported by the National Key R&D Program of China (2018YFA0900703 to H.L. and 2018YFA0901600 to Q.C.), the National Natural Science Foundation of China (21773220 to H.L., 31971175 to Q.C. and 32090040 to J. Z.) and the Youth Innovation Promotion Association, Chinese Academy of Sciences (2017494 to Q.C.). We thank the staff from the Core Facility Centre for Life Sciences, USTC, and from the BL18U1, BL19U1 and BL02U1 beamlines of the National Facility for Protein Science in Shanghai (NFPS) and the Shanghai Synchrotron Radiation Facility for assistance during crystallographic data collection. We thank the USTC Supercomputing Center for computing resource. We thank Z. Zhu, F. Li, Y. Wang, M. Lv and Y. Yun for assistance with X-ray diffraction data collection and processing and T. Jin for sharing MBP expression plasmids.

## Author information

Authors

### Contributions

H.L., B.H. and Y.X. developed computational models and code, and B.H., Y.X., X.H. and Q.C. performed protein design and experimental characterization. S.L. and Y.L. collected and analysed crystallographic data. J.H., J.Z. and C.H. collected and helped process NMR data. H.L. and Q.C. supervised the project. H.L., Q.C. and B.H. wrote the manuscript, and other authors were involved in discussion.

### Corresponding authors

Correspondence to Quan Chen or Haiyan Liu.

## Ethics declarations

### Competing interests

H.L., Q.C., B.H., Y.X. and X.H. have filed a patent application (202111197820.0) relating to the template-free protein design method in the name of the University of Science and Technology of China. The other authors declare no competing interests.

## Peer review

### Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data figures and tables

### Extended Data Fig. 1 Statistical energies learned by NC-NN capture correlations in high-dimensional space.

ad, Scatter graphs showing projections of a NC-NN-learned term for the through-space interactions between two backbone positions. In total, the term depends on 14 variables. The variables used for projections include the Cα-Cα distance (a, b) between the two positions, and additionally the mainchain torsional angles φ1 and ψ1 at one position (c, d). Points are colored according to statistical energy values (in arbitrary units) as indicated by the color bar. Points in a and c correspond to observed configurations, while those in b and d correspond to configurations randomly drawn according to the reference distribution.

### Extended Data Fig. 2 NC-NN-learned components in SCUBA and simulations of natural protein structures by SCUBA.

a, Types of NC-NN-learned statistical energy terms in SCUBA. b, The deviations of conformations sampled in SCUBA-driven SD simulations from native conformations for 33 natural proteins. Each protein was simulated for 900 ps at reduced temperature Tr = 1.0 and the r.m.s.d. values (noted as RMSD in the figure) are for mainchain atoms in secondary structures averaged over the last 50 ps. Simulations were carried out either with or without a radius of gyration (Rg) restraint, which, as described in the Supplementary Methods, was optionally applied in later backbone design simulations both to bias the sampling of more compact structures and to compensate for thermal expansion in simulated annealing simulations involving higher temperatures. The restraint energy took the form $${E}_{{Rg}-{restraint}}({R}_{g})=-{k}_{{res}}{ln}\left(\frac{{R}_{g}}{{R}_{g}^{0}}\right)$$ when $${R}_{g} > {R}_{g}^{0}$$ and $${E}_{{Rg}-{restraint}}({R}_{g})=0$$ for $${R}_{g}\le {R}_{g}^{0}$$ (kres = 300 in reduced energy unit and $${R}_{g}^{0}=5{\rm{\AA }}$$). This energy term leads to only weak compressing forces in comparisons with the strong inter-atomic steric repulsions, and does not distort the tightly-packed native-like minimum structures. The median r.m.s.d. values across the 33 proteins are 1.60 Å (native sequences, without Rg restraint, red bars), 1.25 Å (native sequences, with Rg restraint, orange bars), 2.78 Å (LVG sequences, without Rg restraint, blue bars), and 2.23 Å (LVG sequences, with Rg restraint, violet bars).

### Extended Data Fig. 3 Generating initial backbone for a given sketch or topological architecture.

A sketch is represented as an abstracted architecture comprising regularly arranged layers of secondary structures, the layers in parallel planes. From the abstraction, coordinates of starting or ending positions (indicated by “×”) of secondary structure segments are determined as regular grid points on parallel straight lines in different planes. The N to C directions of the segments are perpendicular to the lines. The approximate lengths of the segments may also be pre-specified. Then peptide segments of corresponding local conformations are geometrically generated using coordinates of their terminal positions and directions determined from the sketch. Connecting the segments with closed loops leads to the initial backbone structure to be used by SCUBA-driven SASD.

### Extended Data Fig. 4 SCUBA-driven SASD produced backbones similar to natural proteins.

Different boxes correspond to different design sketches. From left to right in each box: initial backbone, optimized backbone, a stereo view of the optimized backbone superposed with the closest natural structure, and deviations of Cα atom positions between the designed and the closest natural backbones. In each box, the text string indicates the type, approximate size, and order of secondary structure segments of the corresponding sketch (“H” for helix, “E” for strand, and the subscripts indicate lengths). The closest natural structures with given PDB IDs and chain IDs were identified using Dali searches. The r.m.s.d. values (noted as RMSD in the figures) are of Cα atoms in aligned secondary structure elements.

### Extended Data Fig. 5 Examples of backbone changes at different design stages.

a, Initial and optimized backbones for the H2E4 sketch, whose secondary structure sequence is E7H16E7H16E7E7 (“H” for helix, “E” for strand, and the subscripts indicate approximate lengths). The top row shows artificially constructed initial structures, while the bottom row shows substage-1 backbones optimized without sidechain (yellow) superimposed with substage-2 backbones optimized with LVG-simplified sidechains (violet). b, The r.m.s.d. values of mainchain atoms (in Å) between the successively generated structures at different design stages of backbone optimization or relaxation. The results have been averaged over the H2E4 designs (standard deviations are given in parentheses). The meanings of the notations are: “Init” for the initial structure, “Substage-1” for substage-1 backbones optimized without sidechains, “Substage-2” for substage-2 backbones optimized with LVG-simplified sidechains, and “Iter1” to “Iter3” for backbones relaxed with the designed sidechains in the sequence design-backbone relaxation iterations. c, ABACUS2 and Rosetta energies of ABACUS2-selected sequences for initial and SCUBA-optimized backbones of different topological architectures. The secondary structure compositions of the architectures are: 1: E10E10E10E10, 2: E7H16E7E7, 3: E7E7H16E7, 4: E7H16E7E7H16E7, 5: E10H20E10H20E10E10, 6: E7H16E7H16E7E7, 7: E10H20E10E10H20E10, 8: E7H16E7E7H16E7, 9: H15H15H15, 10: H21H21H21H21. For each sketch, 10 initial backbones have been optimized to generate 10 optimized backbones. Sketch 10 led to optimized backbones of both left-handed and right-handed twists, as shown in two boxes in Extended Data Fig. 4. Each energy value has been averaged over 100 sequences selected on a group of 10 initial or optimized backbones (10 sequences selected using ABACUS2 for each backbone), with standard deviations between 0.08 and 0.38. Rosetta energies have been calculated on relaxed structures with selected sequences. d, Amino acid usage frequencies in sequences selected with ABACUS2 on the H2E4 backbones at different optimization stages. Averaged values are shown separately for sequences designed using the substage-1 backbones optimized without any explicit sidechain (blue bars), using second stage backbones optimized with LVG-simplified sidechains (orange bars), and using backbones relaxed with the first round ABACUS2-selected sidechains (gray bars) (the sidechain atom radius parameters had been downscaled by multiplying 0.9 to introduce larger sidechains in the first round of sequence selection). The green bars correspond to the distribution in the training proteins.

### Extended Data Fig. 6 Effects of loop resampling and optimization.

a, The distribution of the per-residue SCUBA energy changes of loop residues caused by loop resampling and optimization. For the H2E4 backbone structures, the changes were calculated as the energies after loop re-optimization minus the energies before loop re-optimization. b, The distribution of the lowest r.m.s.d. values (noted as RMSD in the figure) of predicted structures from designed structures. c, The distribution of per-residue Rosetta energy of the lowest-r.m.s.d. predicted structures. For b and c, the predictions were carried out using Rosetta biased forward folding for sequences designed from the loop re-optimized H2E4 backbone structures (thinner blue lines), or for sequences designed from H2E4 backbone structures not yet subjected to loop re-optimization (thicker red lines).

### Extended Data Fig. 7 Experimental characterizations of designed proteins.

a, X-ray data collection and refinement of crystal structure models. b, NMR 15N-1H HSQC spectra of ten designed H2E4 proteins and three novel helical proteins. c, Size exclusion chromatography results of the designed H2E4 proteins XM2H (left) and AM2M (right) in solution. The chromatograms were obtained for samples purified by gel filtration, and the molecular weights were estimated from the peak positions. d, Circular dichroism spectroscopy of the designed proteins XM2H (top) and H4A1R (bottom) at different temperatures. The slow varying temperature-dependent curves shown on the right suggest that there are only small changes in the secondary structure contents of these proteins over the temperatures range from 25 to 95 °C. For XM2H, its helical content (calculated from the CD curves) decreased from 54.9% at 20 °C to 48.2% at 95 °C, while its β-sheet content changed from 9% to 11%. For H4A1R, its helical content changed from 85.2% at 20 °C to 71.8% at 95 °C.

### Extended Data Fig. 8 The structures of the loops in the H2E4 and H4 proteins.

ae, Superimpositions of experimentally determined structures (cyan) with corresponding designed structures (green) for loops in the designed proteins XM2H(a), AM2M(b), H4A1R(c), H4A2S(d), and H4C2R(e). The 2Fo-Fc (at 1.0 σ level) electron density surfaces are also shown. The r.m.s.d. for main chain atoms are displayed. f, The experimentally determined structures of the two H2E4 proteins are superimposed (XM2H in cyan and AM2M in orange) to show their different loop structures connecting similarly arranged secondary structure segments.

### Extended Data Fig. 9 Designed backbone structures of the experimentally examined all-helical proteins in Batch 3.

We note that the average per-residue Rosetta energy of the proteins with experimentally solved structures (D12, D22 and D53) is −3.32 ± 0.07 (in arbitrary unit), while for the remaining ten Batch-3 proteins, the same average value is −3.22 ± 0.14.

## Supplementary information

### Supplementary Information

This file contains Supplementary Methods, references and Tables 1 and 2.

## Rights and permissions

Reprints and Permissions

Huang, B., Xu, Y., Hu, X. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022). https://doi.org/10.1038/s41586-021-04383-5

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41586-021-04383-5

• ### Large language models generate functional protein sequences across diverse families

• Ben Krause
• Nikhil Naik

Nature Biotechnology (2023)

• ### Enabling technology and core theory of synthetic biology

• Xian-En Zhang
• Chenli Liu
• Tong Si

Science China Life Sciences (2023)

• ### Rotamer-free protein sequence design based on deep learning and self-consistency

• Yufeng Liu
• Lu Zhang
• Haiyan Liu

Nature Computational Science (2022)

• ### Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP

• Shuangjia Zheng
• Tao Zeng
• Ruibo Wu

Nature Communications (2022)

• ### Controllable protein design with language models

• Noelia Ferruz
• Birte Höcker

Nature Machine Intelligence (2022)