Abstract
There has been considerable recent progress in protein structure prediction using deep neural networks to predict inter-residue distances from amino acid sequences1,2,3. Here we investigate whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occurring proteins used in training the models. We generate random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting residue–residue distance maps, which, as expected, are quite featureless. We then carry out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (Kullback–Leibler divergence) between the inter-residue distance distributions predicted by the network and background distributions averaged over all proteins. Optimization from different random starting points resulted in novel proteins spanning a wide range of sequences and predicted structures. We obtained synthetic genes encoding 129 of the network-‘hallucinated’ sequences, and expressed and purified the proteins in Escherichia coli; 27 of the proteins yielded monodisperse species with circular dichroism spectra consistent with the hallucinated structures. We determined the three-dimensional structures of three of the hallucinated proteins, two by X-ray crystallography and one by NMR, and these closely matched the hallucinated models. Thus, deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute alongside traditional physics-based models to the de novo design of proteins with new functions.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The atomic coordinates of the crystal structures for designs 0217 and 0738_mod, as well as the NMR structure for design 0515 have been deposited in the RCSB Protein Data Bank with the accession numbers 7K3H, 7M0Q and 7M5T, respectively. NMR chemical shifts, NOESY peak lists, and spectral data have been deposited in the BioMagResDB, BMRB ID 30890. Amino acid sequences and structure models for all 2K designs described in the manuscript are freely available for download at https://files.ipd.uw.edu/pub/trRosetta/hallucinations2K.tar.gz. Amino acid sequences and 3D structures of the generated designs were compared to known protein sequences and structures in UniProt (https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2017_12/uniref/) and the Protein Data Bank (11 March 2020), respectively.
Code availability
The computer code used to generate the hallucinated proteins described in the manuscript was made publicly available as a part of trDesign Github package (https://github.com/gjoni/trDesign); corresponding structural models were generated by the trRosetta structure modelling script available for free download at https://yanglab.nankai.edu.cn/trRosetta/download/. The Rosetta software suite was used to perform ab initio prediction calculations. Rosetta is freely available for academic users on Github, and can be licensed for commercial use by the University of Washington CoMotion Express License Program.
References
Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl Acad. Sci. USA 116, 16856–16865 (2019).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Madani, A. et al. ProGen: language modeling for protein generation. Preprint at https://arxiv.org/abs/2004.03497 (2020).
Anand, N., Eguchi, R. & Huang, P. S. Fully differentiable full-atom protein backbone generation. In ICLR 2019 Workshop https://openreview.net/forum?id=SJxnVL8YOV (2019).
Wang, J., Cao, H., Zhang, J. Z. H. & Qi, Y. Computational protein design with deep learning neural networks. Sci Rep. 8, 6349 (2018).
Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. in ICLR 2019 Workshop https://openreview.net/forum?id=SJgxrLLKOE (2019).
Anand, N., Eguchi, R. R., Derry, A., Altman, R. B. & Huang, P.-S. Protein sequence design with a learned potential. Preprint at https://doi.org/10.1101/2020.01.06.895466 (2020).
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411.e4 (2020).
Karimi, M., Zhu, S., Cao, Y. & Shen, Y. De novo protein design for novel folds using guided conditional Wasserstein generative adversarial networks. J. Chem. Inf. Model. 60, 5667–5681 (2020).
Davidsen, K. et al. Deep generative models for T cell receptor protein sequences. eLife 8, e46935 (2019).
Costello, Z. & Martin, H. G. How to hallucinate functional proteins. Preprint at https://arxiv.org/abs/1903.00458 (2019).
Eguchi, R. R., Anand, N., Choe, C. A. & Huang, P.-S. IG-VAE: generative modeling of immunoglobulin proteins by direct 3D coordinate generation. Preprint at https://doi.org/10.1101/2020.08.07.242347 (2020).
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).
Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Senior, A. W. et al. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins 87, 1141–1148 (2019).
Mordvintsev, A., Olah, C. & Tyka, M. Inceptionism: going deeper into neural networks. Google AI Blog https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html (2015).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).
Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).
Rossi, P. et al. A microscale protein NMR sample screening pipeline. J. Biomol. NMR 46, 11–22 (2010).
Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).
Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).
Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl. Acad Sci. USA 118, e2017228118 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Wang, J. et al. Deep learning methods for designing proteins scaffolding functional sites. Preprint at https://doi.org/10.1101/2021.11.10.468128 (2021).
Jendrusch, M., Korbel, J. O. & Sadiq, S. K. AlphaDesign: A de novo protein design framework based on AlphaFold. Preprint at https://doi.org/10.1101/2021.10.11.463937 (2021).
Tischer, D. et al. Design of proteins presenting discontinuous functional sites using deep learning. Preprint at https://doi.org/10.1101/2020.11.29.402743 (2020).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Studier, F. W. Protein production by auto-induction in high density shaking cultures. Protein Expr. Purif. 41, 207–234 (2005).
Pace, C. N., Vajdos, F., Fee, L., Grimsley, G. & Gray, T. How to measure and predict the molar absorption coefficient of a protein. Protein Sci. 4, 2411–2423 (1995).
Acton, T. B. et al. Preparation of protein samples for NMR structure, function, and small-molecule screening studies. Methods Enzymol. 493, 21–60 (2011).
Xiao, R. et al. The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium. J. Struct. Biol. 172, 21–33 (2010).
Jansson, M. et al. High-level production of uniformly 15N-and 13C-enriched fusion proteins in Escherichia coli. J. Biomol. NMR 7, 131–141 (1996).
Ottiger, M., Delaglio, F. & Bax, A. Measurement of J and dipolar couplings from simplified two-dimensional NMR spectra. J. Magn. Reson. 131, 373–378 (1998).
Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277–293 (1995).
Lee, W., Tonelli, M. & Markley, J. L. NMRFAM-SPARKY: enhanced software for biomolecular NMR spectroscopy. Bioinformatics 31, 1325–1327 (2015).
Favier, A. & Brutscher, B. NMRlib: user-friendly pulse sequence tools for Bruker NMR spectrometers. J. Biomol. NMR 73, 199–211 (2019).
Hyberts, S. G., Milbradt, A. G., Wagner, A. B., Arthanari, H. & Wagner, G. Application of iterative soft thresholding for fast reconstruction of NMR data non-uniformly sampled with multidimensional Poisson gap scheduling. J. Biomol. NMR 52, 315–327 (2012).
Ying, J., Delaglio, F., Torchia, D. A. & Bax, A. Sparse multidimensional iterative lineshape-enhanced (SMILE) reconstruction of both non-uniformly sampled and conventional NMR data. J. Biomol. NMR 68, 101–118 (2017).
Lee, W. et al. I-PINE web server: an integrative probabilistic NMR assignment system for proteins. J. Biomol. NMR 73, 213–222 (2019).
Moseley, H. N. B., Sahota, G. & Montelione, G. T. Assignment validation software suite for the evaluation and presentation of protein resonance assignment data. J. Biomol. NMR 28, 341–355 (2004).
Shen, Y. & Bax, A. Protein backbone and sidechain torsion angles predicted from NMR chemical shifts using artificial neural networks. J. Biomol. NMR 56, 227–241 (2013).
Güntert, P., Mumenthaler, C. & Wüthrich, K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol. 273, 283–298 (1997).
Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure determination with automated NOE-identification in the NOESY spectra using the new software ATNOS. J. Biomol. NMR 24, 171–189 (2002).
Huang, Y. J., Powers, R. & Montelione, G. T. Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics. J. Am. Chem. Soc. 127, 1665–1674 (2005).
Huang, Y. J., Tejero, R., Powers, R. & Montelione, G. T. A topology-constrained distance network algorithm for protein structure determination from NOESY data. Proteins 62, 587–603 (2006).
Brünger, A. T. et al. Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta Crystallogr. D 54, 905–921 (1998).
Bhattacharya, A., Tejero, R. & Montelione, G. T. Evaluating protein structures determined by structural genomics consortia. Proteins 66, 778–795 (2007).
Otwinowski, Z. & Minor, W. Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 276, 307–326 (1997).
McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).
DiMaio, F. et al. Improved low-resolution crystallographic refinement with Phenix and Rosetta. Nat. Methods 10, 1102–1104 (2013).
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D 66, 486–501 (2010).
Liebschner, D. et al. Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. D 75, 861–877 (2019).
Theobald, D. L. & Wuttke, D. S. Accurate structural correlations from maximum likelihood superpositions. PLoS Comput. Biol. 4, e43 (2008).
The PyMOL Molecular Graphics System version 2.4 (Schrödinger, 2021).
Zweckstetter, M. NMR: prediction of molecular alignment from structure using the PALES software. Nat. Protoc. 3, 679–690 (2008).
Montelione, G. T. & Wagner, G. 2D Chemical exchange NMR spectroscopy by proton-detected heteronuclear correlation. J. Am. Chem. Soc. 111, 3096–3098 (1989).
Acknowledgements
We thank R. Xiao, G. Liu and A. Wu (Nexomics Biosciences), for assistance in initial NMR protein production; J. Aramini for assistance with NMR data collection for initial HSQC screening; R. Ballard and X. Li for mass spectrometry assistance; and R. Divine and R. Kibler for AKTA scripting. This work was funded by grants from the NSF (DBI 1937533 to D.B. and I.A., and MCB 2032259 to S.O.), the NIH (DP5OD026389 to S.O.), Open Philanthropy (C.C. and A.B.), Eric and Wendy Schmidt by recommendation of the Schmidt Futures program (F.D. and L.C.), and the Audacious project (A.K.), the Washington Research Foundation (S.J.P.), Novo Nordisk Foundation Grant NNF17OC0030446 (C.N.). This work was also supported in part by NIH grants R01 GM120574 (G.T.M.) and R35GM141818 (G.T.M.), and the Howard Hughes Medical Institute (D.B. and T.M.C.). We also acknowledge computing resources provided by the Hyak supercomputer system funded by the STF at the University of Washington, and Rosetta@Home volunteers in ab initio structure prediction calculations, and thank staff at Northeastern Collaborative Access Team at Advanced Photon Source for the beamline, supported by NIH grants P30GM124165 and S10OD021527, and DOE contract DE-AC02-06CH11357. We acknowledge the NMR Core Facility resources at Renssealaer Polytechnic Institute and thank S. McCallum for providing valuable support.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
G.T.M. is a founder of Nexomics Biosciences. The other authors declare no competing interests.
Additional information
Peer review information Nature thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Comparison of the hallucinated designs to proteins with known structure and of similar length (100 +/− 10 aa) from the trRosetta training set.
a,b) Multidimensional scaling plots of the sequence (a) and structure (b) spaces covered by the 2,000 hallucinated proteins (blue dots) along with 1,110 proteins of similar length from the trRosetta training set (red dots). These scatter plots show that subspaces spanned by hallucinated proteins and natural proteins of similar size (100 +/− 10 aa) are quite distinct; the network is not simply recapitulating native proteins of the same length. Soluble and structurally characterized hallucinations are marked by black and magenta dots respectively. c,d) Distributions of pairwise structure (c) and sequence (d) similarities for hallucinated and natural proteins. The hallucinated proteins are more similar to each other (blue lines) than they are to natural proteins (grey lines). e) Sequence comparisons (gappless threading) of fragments of various size (15,20,...,60 aa) from the hallucinated designs (blue) and natural 100 (+/− 10) aa-long proteins (red) to other proteins from the trRosetta training set. There is no apparent tendency for the trRosetta-based design procedure to “copy over” sequence fragments from the proteins in the training set into the hallucinated designs. f,g) Secondary structure content of the hallucinated designs and natural 100 aa-long proteins from the training set. Hallucinations are more ideal than natural proteins in having less loops but longer secondary structure elements.
Extended Data Fig. 2 Additional data on the experimentally characterized all-α and mixed α–β network-hallucinated proteins.
a,e) Dendrograms showing representative hallucinated protein designs clustered by TM-score; thermostable designs with CD spectra consistent with the target structure are labelled by their IDs. b,f) Three-dimensional models of the hallucinated designs. c,g) Predicted distance maps at the end of the hallucination trajectory. d,h) Temperature dependence of CD signal at 220 nm in the 25-95 °C temperature range.
Extended Data Fig. 3 Additional examples of thermostable hallucinations with CD spectra consistent with the target structure.
a,g) 3D structure models of the hallucinated designs. b,h) Predicted distance maps at the end of the hallucination trajectory. c,i) ab initio folding funnels from Rosetta. d,j) Size-exclusion chromatography traces. e,k) Circular dichroism spectra at 25 °C (blue) and 95 °C (red). f,l) Temperature dependence of Circular Dichroism signal at 220 nm in the 25 to 95 °C temperature range.
Extended Data Fig. 4 Comparison of 0515 NMR structure to hallucinated model.
a) Superposition of hallucinated model (blue) and NMR medoid structure (gray) of 0515 reveal 1.82 Å backbone r.m.s.d. over 100 residues b) Hallucinated model of 0515 colored by distance between Cɑ-Cɑ pairs between model and NMR medoid structure after structural superposition and b) corresponding plot of per-residue Cɑ-Cɑ distance difference between model and NMR medoid structure.
Extended Data Fig.5 Structural analysis of 0217 and comparison to hallucinated model.
a) Representative electron density (2Fo-Fc, 1𝞂) over entire asymmetric unit (left) and core packing regions (right) of hallucination 0217. b) Both chains of the crystal structure colored by B-factor. c) Structural superposition of chains observed in the asymmetric unit reveal a 2.8 Å backbone r.m.s.d. over 91 residues. d) Crystal lattice contacts for chain A (green) and chain B (yellow) may explain structural differences observed between chains. Circled regions highlight where chain A is an ordered helix-loop-helix and chain B is disordered. e) Hallucinated model of 0217 colored by distance between Cɑ-Cɑ pairs between model and crystal structure after structural superposition and corresponding plot of per-residue Cɑ-Cɑ distance difference between model and crystal structure. f) Structural superposition of the hallucinated model and chain B of the 0217 crystal structure (left), 0217 model colored by Cɑ-Cɑ distance between hallucination and crystal structure (middle), and per residue Cɑ-Cɑ distance between hallucination and crystal structure per residue (right).
Extended Data Fig. 6 Structural analysis, NMR characterization, and SEC analysis of hallucinated sequence 0417.
a) Hallucinated model with surface hydrophobics shown as sticks and b) [1H-15N]-SOFAST-HMQC spectra of hallucinated sequence 0417 before (red) and after (blue) buffer optimization. Spectrum before optimization (red) was obtained using a protein concentration of ~0.3 mM at 298K in 20 mM Tris-HCl, pH 7.2, 100 mM NaCl and spectrum acquired after optimization (blue) was obtained using a protein concentration of ~0.3 mM, at temperature of 323 K in a buffer of 20 mM sodium phosphate at pH 6.5, 50 mM NaCl, and 20% glycerol. The NMR data are consistent with a folded structure containing a mix of alpha and beta secondary structure. Even under optimized conditions, there is still evidence of exchange broadening (e.g. Trp side chain NεHs are weak), resonances that appear only at high temperature and high glycerol concentrations, and some resonances that are doubled; all indications of transient self-association. c) Size-exclusion chromatography trace of 0417 displays a small additional peak corresponding to a larger oligomeric species which corroborates the NMR analysis.
Extended Data Fig. 7 Structural analysis of 0738_mod and comparison to hallucinated model 0738.
a) Representative electron density (2Fo-Fc, 1𝞂) over entire asymmetric unit (left) and core packing regions (right) of hallucination 0738_mod. b) Both chains of the crystal structure colored by B-factor. c) Structural superposition of the hallucinated model and chain A of the 0738_mod crystal structure (left), 0738_mod model colored by Cɑ-Cɑ distance between hallucination and crystal structure (middle), and per residue Cɑ-Cɑ distance between hallucination and crystal structure per residue (right). d) Hallucinated model of 0738_mod colored by distance between Cɑ-Cɑ pairs between model and crystal structure after structural superposition and corresponding plot of per-residue Cɑ-Cɑ distance difference between model and crystal structure.
Extended Data Fig. 8 NMR and biochemical analysis of hallucinated sequences 0515, 0738_mod, and 0217.
a) 1H-15N heteronuclear NOE (hetNOE) histograms for 0515 (82 non-overlapped peaks), 0738_mod (144 peaks), and 0217 (47 peaks), together with their average values. 1H-15N steady state heteronuclear NOEs were obtained from the ratio of cross peak intensities (Isaturated/Iequilibrium) with (Isaturated) and without (Iequilibrium) 3 s of proton saturation during the presat delay and recorded in an interleaved manner, split in TopSpin, processed identically using NMRPipe, and peak picked in SPARKY to obtain peak intensities. b) 1H-15N HSQC spectra of corresponding proteins collected at 800 MHz at 298 K in 25 mM HEPES, pH 7.4, 50 mM NaCl buffer and prepared in a 5-mm Shigemi NMR tubes for data collection with addition of 5% D2O (v/v). These 15N-enriched protein samples were prepared at concentrations of 0.4 mM, 0.15 mM, and 0.2 mM, respectively. c) SEC data demonstrating monodispersity of these proteins in solution, with predominantly monomer for 0515 and 0738_mod and predominantly dimer for 0217. SDS-PAGE data (not shown) show that each is > 95% homogeneous, which together with MALDI-TOF mass spectrometry indicate that the spectral heterogeneity observed is not due to chemical heterogeneity. d) Ribbon diagrams of the corresponding monomeric or dimeric protein structures. These results show that the three designs have characteristic dynamics in solution. The average hetNOE for the homodimer 0217 is lower than for 0515 and 0738_mod, and it has fewer peaks than expected due to exchange broadening. Although 0738_mod has a similar hetNOE distribution as monomeric 0515, it has more than double the expected number of peaks, indicating at least two folded conformations (for all or parts of the protein) in solution that are in slow conformational exchange on the NMR time-scale. This was further validated by the appearance of new peaks in spectra at lower temperature (288 K), and different peaks at higher temperatures (308 and 318 K), and confirmed by detection of 15N ZZ-exchange cross peaks at 318 K with 600 and 750 ms mixing times (Bruker pulse sequence hsqcetexf3gp, data not shown)60.
Supplementary information
Supplementary Information
This file contains a Supplementary Discussion, Supplementary Table 1 and Supplementary Figs 1–7.
Rights and permissions
About this article
Cite this article
Anishchenko, I., Pellock, S.J., Chidyausiku, T.M. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021). https://doi.org/10.1038/s41586-021-04184-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-021-04184-w
This article is cited by
-
Tpgen: a language model for stable protein design with a specific topology structure
BMC Bioinformatics (2024)
-
Towards glycan foldamers and programmable assemblies
Nature Reviews Materials (2024)
-
Opportunities and challenges in design and optimization of protein function
Nature Reviews Molecular Cell Biology (2024)
-
Sparks of function by de novo protein design
Nature Biotechnology (2024)
-
Deep learning for protein structure prediction and design—progress and applications
Molecular Systems Biology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.