De novo protein design by deep network hallucination

Anishchenko, Ivan; Pellock, Samuel J.; Chidyausiku, Tamuka M.; Ramelot, Theresa A.; Ovchinnikov, Sergey; Hao, Jingzhou; Bafna, Khushboo; Norn, Christoffer; Kang, Alex; Bera, Asim K.; DiMaio, Frank; Carter, Lauren; Chow, Cameron M.; Montelione, Gaetano T.; Baker, David

doi:10.1038/s41586-021-04184-w

Article
Published: 01 December 2021

De novo protein design by deep network hallucination

Nature volume 600, pages 547–552 (2021)Cite this article

59k Accesses
201 Citations
592 Altmetric
Metrics details

Subjects

Abstract

There has been considerable recent progress in protein structure prediction using deep neural networks to predict inter-residue distances from amino acid sequences^1,2,3. Here we investigate whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occurring proteins used in training the models. We generate random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting residue–residue distance maps, which, as expected, are quite featureless. We then carry out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (Kullback–Leibler divergence) between the inter-residue distance distributions predicted by the network and background distributions averaged over all proteins. Optimization from different random starting points resulted in novel proteins spanning a wide range of sequences and predicted structures. We obtained synthetic genes encoding 129 of the network-‘hallucinated’ sequences, and expressed and purified the proteins in Escherichia coli; 27 of the proteins yielded monodisperse species with circular dichroism spectra consistent with the hallucinated structures. We determined the three-dimensional structures of three of the hallucinated proteins, two by X-ray crystallography and one by NMR, and these closely matched the hallucinated models. Thus, deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute alongside traditional physics-based models to the de novo design of proteins with new functions.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of protein hallucination approach.**

**Fig. 2: Overview of computational results.**

**Fig. 3: Experimental characterization of α-helical network-hallucinated proteins.**

**Fig. 4: Experimental characterization of network-hallucinated proteins with mixed α–β structures.**

**Fig. 5: Structural analysis of network-hallucinated proteins.**

Emergence of fractal geometries in the evolution of a metabolic enzyme

Article Open access 10 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Data availability

The atomic coordinates of the crystal structures for designs 0217 and 0738_mod, as well as the NMR structure for design 0515 have been deposited in the RCSB Protein Data Bank with the accession numbers 7K3H, 7M0Q and 7M5T, respectively. NMR chemical shifts, NOESY peak lists, and spectral data have been deposited in the BioMagResDB, BMRB ID 30890. Amino acid sequences and structure models for all 2K designs described in the manuscript are freely available for download at https://files.ipd.uw.edu/pub/trRosetta/hallucinations2K.tar.gz. Amino acid sequences and 3D structures of the generated designs were compared to known protein sequences and structures in UniProt (https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2017_12/uniref/) and the Protein Data Bank (11 March 2020), respectively.

Code availability

The computer code used to generate the hallucinated proteins described in the manuscript was made publicly available as a part of trDesign Github package (https://github.com/gjoni/trDesign); corresponding structural models were generated by the trRosetta structure modelling script available for free download at https://yanglab.nankai.edu.cn/trRosetta/download/. The Rosetta software suite was used to perform ab initio prediction calculations. Rosetta is freely available for academic users on Github, and can be licensed for commercial use by the University of Washington CoMotion Express License Program.

References

Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl Acad. Sci. USA 116, 16856–16865 (2019).
Article CAS PubMed PubMed Central Google Scholar
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Article ADS CAS PubMed Google Scholar
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Article CAS PubMed PubMed Central Google Scholar
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Article CAS PubMed Google Scholar
Madani, A. et al. ProGen: language modeling for protein generation. Preprint at https://arxiv.org/abs/2004.03497 (2020).
Anand, N., Eguchi, R. & Huang, P. S. Fully differentiable full-atom protein backbone generation. In ICLR 2019 Workshop https://openreview.net/forum?id=SJxnVL8YOV (2019).
Wang, J., Cao, H., Zhang, J. Z. H. & Qi, Y. Computational protein design with deep learning neural networks. Sci Rep. 8, 6349 (2018).
Article ADS PubMed PubMed Central Google Scholar
Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. in ICLR 2019 Workshop https://openreview.net/forum?id=SJgxrLLKOE (2019).
Anand, N., Eguchi, R. R., Derry, A., Altman, R. B. & Huang, P.-S. Protein sequence design with a learned potential. Preprint at https://doi.org/10.1101/2020.01.06.895466 (2020).
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411.e4 (2020).
Article CAS PubMed Google Scholar
Karimi, M., Zhu, S., Cao, Y. & Shen, Y. De novo protein design for novel folds using guided conditional Wasserstein generative adversarial networks. J. Chem. Inf. Model. 60, 5667–5681 (2020).
Article CAS PubMed PubMed Central Google Scholar
Davidsen, K. et al. Deep generative models for T cell receptor protein sequences. eLife 8, e46935 (2019).
Article PubMed PubMed Central Google Scholar
Costello, Z. & Martin, H. G. How to hallucinate functional proteins. Preprint at https://arxiv.org/abs/1903.00458 (2019).
Eguchi, R. R., Anand, N., Choe, C. A. & Huang, P.-S. IG-VAE: generative modeling of immunoglobulin proteins by direct 3D coordinate generation. Preprint at https://doi.org/10.1101/2020.08.07.242347 (2020).
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).
Article Google Scholar
Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Article CAS PubMed PubMed Central Google Scholar
Senior, A. W. et al. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins 87, 1141–1148 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mordvintsev, A., Olah, C. & Tyka, M. Inceptionism: going deeper into neural networks. Google AI Blog https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html (2015).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).
Article CAS PubMed Google Scholar
Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rossi, P. et al. A microscale protein NMR sample screening pipeline. J. Biomol. NMR 46, 11–22 (2010).
Article CAS PubMed Google Scholar
Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Dou, J. et al. De novo design of a fluorescence-activating β-barrel. Nature 561, 485–491 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl. Acad Sci. USA 118, e2017228118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Wang, J. et al. Deep learning methods for designing proteins scaffolding functional sites. Preprint at https://doi.org/10.1101/2021.11.10.468128 (2021).
Jendrusch, M., Korbel, J. O. & Sadiq, S. K. AlphaDesign: A de novo protein design framework based on AlphaFold. Preprint at https://doi.org/10.1101/2021.10.11.463937 (2021).
Tischer, D. et al. Design of proteins presenting discontinuous functional sites using deep learning. Preprint at https://doi.org/10.1101/2020.11.29.402743 (2020).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Article CAS PubMed PubMed Central Google Scholar
Studier, F. W. Protein production by auto-induction in high density shaking cultures. Protein Expr. Purif. 41, 207–234 (2005).
Article CAS PubMed Google Scholar
Pace, C. N., Vajdos, F., Fee, L., Grimsley, G. & Gray, T. How to measure and predict the molar absorption coefficient of a protein. Protein Sci. 4, 2411–2423 (1995).
Article CAS PubMed PubMed Central Google Scholar
Acton, T. B. et al. Preparation of protein samples for NMR structure, function, and small-molecule screening studies. Methods Enzymol. 493, 21–60 (2011).
Article CAS PubMed PubMed Central Google Scholar
Xiao, R. et al. The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium. J. Struct. Biol. 172, 21–33 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jansson, M. et al. High-level production of uniformly 15N-and 13C-enriched fusion proteins in Escherichia coli. J. Biomol. NMR 7, 131–141 (1996).
Article CAS PubMed Google Scholar
Ottiger, M., Delaglio, F. & Bax, A. Measurement of J and dipolar couplings from simplified two-dimensional NMR spectra. J. Magn. Reson. 131, 373–378 (1998).
Article ADS CAS PubMed Google Scholar
Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277–293 (1995).
Article CAS PubMed Google Scholar
Lee, W., Tonelli, M. & Markley, J. L. NMRFAM-SPARKY: enhanced software for biomolecular NMR spectroscopy. Bioinformatics 31, 1325–1327 (2015).
Article PubMed Google Scholar
Favier, A. & Brutscher, B. NMRlib: user-friendly pulse sequence tools for Bruker NMR spectrometers. J. Biomol. NMR 73, 199–211 (2019).
Article CAS PubMed Google Scholar
Hyberts, S. G., Milbradt, A. G., Wagner, A. B., Arthanari, H. & Wagner, G. Application of iterative soft thresholding for fast reconstruction of NMR data non-uniformly sampled with multidimensional Poisson gap scheduling. J. Biomol. NMR 52, 315–327 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ying, J., Delaglio, F., Torchia, D. A. & Bax, A. Sparse multidimensional iterative lineshape-enhanced (SMILE) reconstruction of both non-uniformly sampled and conventional NMR data. J. Biomol. NMR 68, 101–118 (2017).
Article CAS PubMed Google Scholar
Lee, W. et al. I-PINE web server: an integrative probabilistic NMR assignment system for proteins. J. Biomol. NMR 73, 213–222 (2019).
Article CAS PubMed PubMed Central Google Scholar
Moseley, H. N. B., Sahota, G. & Montelione, G. T. Assignment validation software suite for the evaluation and presentation of protein resonance assignment data. J. Biomol. NMR 28, 341–355 (2004).
Article CAS PubMed Google Scholar
Shen, Y. & Bax, A. Protein backbone and sidechain torsion angles predicted from NMR chemical shifts using artificial neural networks. J. Biomol. NMR 56, 227–241 (2013).
Article CAS PubMed PubMed Central Google Scholar
Güntert, P., Mumenthaler, C. & Wüthrich, K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol. 273, 283–298 (1997).
Article PubMed Google Scholar
Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure determination with automated NOE-identification in the NOESY spectra using the new software ATNOS. J. Biomol. NMR 24, 171–189 (2002).
Article CAS PubMed Google Scholar
Huang, Y. J., Powers, R. & Montelione, G. T. Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics. J. Am. Chem. Soc. 127, 1665–1674 (2005).
Article CAS PubMed Google Scholar
Huang, Y. J., Tejero, R., Powers, R. & Montelione, G. T. A topology-constrained distance network algorithm for protein structure determination from NOESY data. Proteins 62, 587–603 (2006).
Article CAS PubMed Google Scholar
Brünger, A. T. et al. Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta Crystallogr. D 54, 905–921 (1998).
Article PubMed Google Scholar
Bhattacharya, A., Tejero, R. & Montelione, G. T. Evaluating protein structures determined by structural genomics consortia. Proteins 66, 778–795 (2007).
Article CAS PubMed Google Scholar
Otwinowski, Z. & Minor, W. Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 276, 307–326 (1997).
Article CAS PubMed Google Scholar
McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).
Article CAS PubMed PubMed Central Google Scholar
DiMaio, F. et al. Improved low-resolution crystallographic refinement with Phenix and Rosetta. Nat. Methods 10, 1102–1104 (2013).
Article CAS PubMed PubMed Central Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D 66, 486–501 (2010).
Article CAS PubMed PubMed Central Google Scholar
Liebschner, D. et al. Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. D 75, 861–877 (2019).
Article CAS Google Scholar
Theobald, D. L. & Wuttke, D. S. Accurate structural correlations from maximum likelihood superpositions. PLoS Comput. Biol. 4, e43 (2008).
Article ADS PubMed PubMed Central Google Scholar
The PyMOL Molecular Graphics System version 2.4 (Schrödinger, 2021).
Zweckstetter, M. NMR: prediction of molecular alignment from structure using the PALES software. Nat. Protoc. 3, 679–690 (2008).
Article CAS PubMed Google Scholar
Montelione, G. T. & Wagner, G. 2D Chemical exchange NMR spectroscopy by proton-detected heteronuclear correlation. J. Am. Chem. Soc. 111, 3096–3098 (1989).
Article CAS Google Scholar

Download references

Acknowledgements

We thank R. Xiao, G. Liu and A. Wu (Nexomics Biosciences), for assistance in initial NMR protein production; J. Aramini for assistance with NMR data collection for initial HSQC screening; R. Ballard and X. Li for mass spectrometry assistance; and R. Divine and R. Kibler for AKTA scripting. This work was funded by grants from the NSF (DBI 1937533 to D.B. and I.A., and MCB 2032259 to S.O.), the NIH (DP5OD026389 to S.O.), Open Philanthropy (C.C. and A.B.), Eric and Wendy Schmidt by recommendation of the Schmidt Futures program (F.D. and L.C.), and the Audacious project (A.K.), the Washington Research Foundation (S.J.P.), Novo Nordisk Foundation Grant NNF17OC0030446 (C.N.). This work was also supported in part by NIH grants R01 GM120574 (G.T.M.) and R35GM141818 (G.T.M.), and the Howard Hughes Medical Institute (D.B. and T.M.C.). We also acknowledge computing resources provided by the Hyak supercomputer system funded by the STF at the University of Washington, and Rosetta@Home volunteers in ab initio structure prediction calculations, and thank staff at Northeastern Collaborative Access Team at Advanced Photon Source for the beamline, supported by NIH grants P30GM124165 and S10OD021527, and DOE contract DE-AC02-06CH11357. We acknowledge the NMR Core Facility resources at Renssealaer Polytechnic Institute and thank S. McCallum for providing valuable support.

Author information

These authors contributed equally: Ivan Anishchenko, Samuel J. Pellock

Authors and Affiliations

Department of Biochemistry, University of Washington, Seattle, WA, USA
Ivan Anishchenko, Samuel J. Pellock, Tamuka M. Chidyausiku, Christoffer Norn, Alex Kang, Asim K. Bera, Frank DiMaio, Lauren Carter, Cameron M. Chow & David Baker
Institute for Protein Design, University of Washington, Seattle, WA, USA
Ivan Anishchenko, Samuel J. Pellock, Tamuka M. Chidyausiku, Christoffer Norn, Alex Kang, Asim K. Bera, Frank DiMaio, Lauren Carter, Cameron M. Chow & David Baker
Department of Chemistry and Chemical Biology, Rensselaer Polytechnic Institute, Troy, NY, USA
Theresa A. Ramelot, Jingzhou Hao, Khushboo Bafna & Gaetano T. Montelione
Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY, USA
Theresa A. Ramelot, Jingzhou Hao, Khushboo Bafna & Gaetano T. Montelione
John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
Sergey Ovchinnikov
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
David Baker

Authors

Ivan Anishchenko
View author publications
You can also search for this author in PubMed Google Scholar
Samuel J. Pellock
View author publications
You can also search for this author in PubMed Google Scholar
Tamuka M. Chidyausiku
View author publications
You can also search for this author in PubMed Google Scholar
Theresa A. Ramelot
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Ovchinnikov
View author publications
You can also search for this author in PubMed Google Scholar
Jingzhou Hao
View author publications
You can also search for this author in PubMed Google Scholar
Khushboo Bafna
View author publications
You can also search for this author in PubMed Google Scholar
Christoffer Norn
View author publications
You can also search for this author in PubMed Google Scholar
Alex Kang
View author publications
You can also search for this author in PubMed Google Scholar
Asim K. Bera
View author publications
You can also search for this author in PubMed Google Scholar
Frank DiMaio
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Carter
View author publications
You can also search for this author in PubMed Google Scholar
Cameron M. Chow
View author publications
You can also search for this author in PubMed Google Scholar
Gaetano T. Montelione
View author publications
You can also search for this author in PubMed Google Scholar
David Baker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Baker.

Ethics declarations

Competing interests

G.T.M. is a founder of Nexomics Biosciences. The other authors declare no competing interests.

Additional information

Peer review information Nature thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Comparison of the hallucinated designs to proteins with known structure and of similar length (100 +/− 10 aa) from the trRosetta training set.

a,b) Multidimensional scaling plots of the sequence (a) and structure (b) spaces covered by the 2,000 hallucinated proteins (blue dots) along with 1,110 proteins of similar length from the trRosetta training set (red dots). These scatter plots show that subspaces spanned by hallucinated proteins and natural proteins of similar size (100 +/− 10 aa) are quite distinct; the network is not simply recapitulating native proteins of the same length. Soluble and structurally characterized hallucinations are marked by black and magenta dots respectively. c,d) Distributions of pairwise structure (c) and sequence (d) similarities for hallucinated and natural proteins. The hallucinated proteins are more similar to each other (blue lines) than they are to natural proteins (grey lines). e) Sequence comparisons (gappless threading) of fragments of various size (15,20,...,60 aa) from the hallucinated designs (blue) and natural 100 (+/− 10) aa-long proteins (red) to other proteins from the trRosetta training set. There is no apparent tendency for the trRosetta-based design procedure to “copy over” sequence fragments from the proteins in the training set into the hallucinated designs. f,g) Secondary structure content of the hallucinated designs and natural 100 aa-long proteins from the training set. Hallucinations are more ideal than natural proteins in having less loops but longer secondary structure elements.

Extended Data Fig. 2 Additional data on the experimentally characterized all-α and mixed α–β network-hallucinated proteins.

a,e) Dendrograms showing representative hallucinated protein designs clustered by TM-score; thermostable designs with CD spectra consistent with the target structure are labelled by their IDs. b,f) Three-dimensional models of the hallucinated designs. c,g) Predicted distance maps at the end of the hallucination trajectory. d,h) Temperature dependence of CD signal at 220 nm in the 25-95 °C temperature range.

Extended Data Fig. 3 Additional examples of thermostable hallucinations with CD spectra consistent with the target structure.

a,g) 3D structure models of the hallucinated designs. b,h) Predicted distance maps at the end of the hallucination trajectory. c,i) ab initio folding funnels from Rosetta. d,j) Size-exclusion chromatography traces. e,k) Circular dichroism spectra at 25 °C (blue) and 95 °C (red). f,l) Temperature dependence of Circular Dichroism signal at 220 nm in the 25 to 95 °C temperature range.

Extended Data Fig. 4 Comparison of 0515 NMR structure to hallucinated model.

a) Superposition of hallucinated model (blue) and NMR medoid structure (gray) of 0515 reveal 1.82 Å backbone r.m.s.d. over 100 residues b) Hallucinated model of 0515 colored by distance between Cɑ-Cɑ pairs between model and NMR medoid structure after structural superposition and b) corresponding plot of per-residue Cɑ-Cɑ distance difference between model and NMR medoid structure.

Extended Data Fig.5 Structural analysis of 0217 and comparison to hallucinated model.

a) Representative electron density (2Fo-Fc, 1𝞂) over entire asymmetric unit (left) and core packing regions (right) of hallucination 0217. b) Both chains of the crystal structure colored by B-factor. c) Structural superposition of chains observed in the asymmetric unit reveal a 2.8 Å backbone r.m.s.d. over 91 residues. d) Crystal lattice contacts for chain A (green) and chain B (yellow) may explain structural differences observed between chains. Circled regions highlight where chain A is an ordered helix-loop-helix and chain B is disordered. e) Hallucinated model of 0217 colored by distance between Cɑ-Cɑ pairs between model and crystal structure after structural superposition and corresponding plot of per-residue Cɑ-Cɑ distance difference between model and crystal structure. f) Structural superposition of the hallucinated model and chain B of the 0217 crystal structure (left), 0217 model colored by Cɑ-Cɑ distance between hallucination and crystal structure (middle), and per residue Cɑ-Cɑ distance between hallucination and crystal structure per residue (right).

Extended Data Fig. 6 Structural analysis, NMR characterization, and SEC analysis of hallucinated sequence 0417.

a) Hallucinated model with surface hydrophobics shown as sticks and b) [¹H-¹⁵N]-SOFAST-HMQC spectra of hallucinated sequence 0417 before (red) and after (blue) buffer optimization. Spectrum before optimization (red) was obtained using a protein concentration of ~0.3 mM at 298K in 20 mM Tris-HCl, pH 7.2, 100 mM NaCl and spectrum acquired after optimization (blue) was obtained using a protein concentration of ~0.3 mM, at temperature of 323 K in a buffer of 20 mM sodium phosphate at pH 6.5, 50 mM NaCl, and 20% glycerol. The NMR data are consistent with a folded structure containing a mix of alpha and beta secondary structure. Even under optimized conditions, there is still evidence of exchange broadening (e.g. Trp side chain N^εHs are weak), resonances that appear only at high temperature and high glycerol concentrations, and some resonances that are doubled; all indications of transient self-association. c) Size-exclusion chromatography trace of 0417 displays a small additional peak corresponding to a larger oligomeric species which corroborates the NMR analysis.

Extended Data Fig. 7 Structural analysis of 0738_mod and comparison to hallucinated model 0738.

a) Representative electron density (2Fo-Fc, 1𝞂) over entire asymmetric unit (left) and core packing regions (right) of hallucination 0738_mod. b) Both chains of the crystal structure colored by B-factor. c) Structural superposition of the hallucinated model and chain A of the 0738_mod crystal structure (left), 0738_mod model colored by Cɑ-Cɑ distance between hallucination and crystal structure (middle), and per residue Cɑ-Cɑ distance between hallucination and crystal structure per residue (right). d) Hallucinated model of 0738_mod colored by distance between Cɑ-Cɑ pairs between model and crystal structure after structural superposition and corresponding plot of per-residue Cɑ-Cɑ distance difference between model and crystal structure.

Extended Data Fig. 8 NMR and biochemical analysis of hallucinated sequences 0515, 0738_mod, and 0217.

a) ¹H-¹⁵N heteronuclear NOE (hetNOE) histograms for 0515 (82 non-overlapped peaks), 0738_mod (144 peaks), and 0217 (47 peaks), together with their average values. ¹H-¹⁵N steady state heteronuclear NOEs were obtained from the ratio of cross peak intensities (I_saturated/I_equilibrium) with (I_saturated) and without (I_equilibrium) 3 s of proton saturation during the presat delay and recorded in an interleaved manner, split in TopSpin, processed identically using NMRPipe, and peak picked in SPARKY to obtain peak intensities. b) ¹H-¹⁵N HSQC spectra of corresponding proteins collected at 800 MHz at 298 K in 25 mM HEPES, pH 7.4, 50 mM NaCl buffer and prepared in a 5-mm Shigemi NMR tubes for data collection with addition of 5% D₂O (v/v). These ¹⁵N-enriched protein samples were prepared at concentrations of 0.4 mM, 0.15 mM, and 0.2 mM, respectively. c) SEC data demonstrating monodispersity of these proteins in solution, with predominantly monomer for 0515 and 0738_mod and predominantly dimer for 0217. SDS-PAGE data (not shown) show that each is > 95% homogeneous, which together with MALDI-TOF mass spectrometry indicate that the spectral heterogeneity observed is not due to chemical heterogeneity. d) Ribbon diagrams of the corresponding monomeric or dimeric protein structures. These results show that the three designs have characteristic dynamics in solution. The average hetNOE for the homodimer 0217 is lower than for 0515 and 0738_mod, and it has fewer peaks than expected due to exchange broadening. Although 0738_mod has a similar hetNOE distribution as monomeric 0515, it has more than double the expected number of peaks, indicating at least two folded conformations (for all or parts of the protein) in solution that are in slow conformational exchange on the NMR time-scale. This was further validated by the appearance of new peaks in spectra at lower temperature (288 K), and different peaks at higher temperatures (308 and 318 K), and confirmed by detection of ¹⁵N ZZ-exchange cross peaks at 318 K with 600 and 750 ms mixing times (Bruker pulse sequence hsqcetexf3gp, data not shown)⁶⁰.

Extended Data Table 1 NMR refinement statistics and quality scores for 0515

Full size table

Extended Data Table 2 Crystallographic data collection and refinement statistics

Full size table

Supplementary information

Supplementary Information

This file contains a Supplementary Discussion, Supplementary Table 1 and Supplementary Figs 1–7.

Reporting Summary

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anishchenko, I., Pellock, S.J., Chidyausiku, T.M. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021). https://doi.org/10.1038/s41586-021-04184-w

Download citation

Received: 18 September 2020
Accepted: 21 October 2021
Published: 01 December 2021
Issue Date: 16 December 2021
DOI: https://doi.org/10.1038/s41586-021-04184-w

This article is cited by

Tpgen: a language model for stable protein design with a specific topology structure
- Xiaoping Min
- Chongzhou Yang
- Ningshao Xia
BMC Bioinformatics (2024)
Towards glycan foldamers and programmable assemblies
- Surusch Djalali
- Nishu Yadav
- Martina Delbianco
Nature Reviews Materials (2024)
Opportunities and challenges in design and optimization of protein function
- Dina Listov
- Casper A. Goverde
- Sarel Jacob Fleishman
Nature Reviews Molecular Cell Biology (2024)
Sparks of function by de novo protein design
- Alexander E. Chu
- Tianyu Lu
- Po-Ssu Huang
Nature Biotechnology (2024)
Deep learning for protein structure prediction and design—progress and applications
- Jürgen Jänes
- Pedro Beltrao
Molecular Systems Biology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.