Three billion years of evolution has produced a tremendous diversity of protein molecules1, but the full potential of proteins is likely to be much greater. Accessing this potential has been challenging for both computation and experiments because the space of possible protein molecules is much larger than the space of those likely to have functions. Here we introduce Chroma, a generative model for proteins and protein complexes that can directly sample novel protein structures and sequences, and that can be conditioned to steer the generative process towards desired properties and functions. To enable this, we introduce a diffusion process that respects the conformational statistics of polymer ensembles, an efficient neural architecture for molecular systems that enables long-range reasoning with sub-quadratic scaling, layers for efficiently synthesizing three-dimensional structures of proteins from predicted inter-residue geometries and a general low-temperature sampling algorithm for diffusion models. Chroma achieves protein design as Bayesian inference under external constraints, which can involve symmetries, substructure, shape, semantics and even natural-language prompts. The experimental characterization of 310 proteins shows that sampling from Chroma results in proteins that are highly expressed, fold and have favourable biophysical properties. The crystal structures of two designed proteins exhibit atomistic agreement with Chroma samples (a backbone root-mean-square deviation of around 1.0 Å). With this unified approach to protein design, we hope to accelerate the programming of protein matter to benefit human health, materials science and synthetic biology.
Protein molecules perform most of the biological functions necessary for life, but creating them is a complicated task that has taken billions of years of evolution. The field of computational protein design aims to shorten this process by automating the design of functional proteins in a programmable manner. Although there has been considerable progress towards this goal over the past three decades2,3, including the design of previously unknown topologies, assemblies, binders, catalysts and materials4,5,6,7, most de novo designs have yet to approach the complexity and variety of macromolecules that are found in nature. Reasons for this include the fact that modelling the relationship between sequence, structure and function is difficult, and most methods of computational design rely on iterative search and sampling processes that, just like evolution, must navigate a rugged fitness landscape incrementally8. Although many computational techniques have been developed to accelerate this search3 and to improve the prediction of natural protein structures9, the space of possible proteins remains combinatorially large and is only partly accessible to conventional computational methods. Determining how to efficiently explore the space of designable protein structures remains an open challenge.
An alternative and potentially appealing approach to protein design is to sample directly from the space of proteins that is compatible with a set of desired functions. Although this approach could address the fundamental limitation of iterative search methods, it would require a way to parameterize the a priori ‘plausible’ protein space, a way to draw samples from this space, and a way to bias this sampling towards desired properties and functions. Deep generative models have proven successful in solving these kinds of high-dimensional modelling and inference problems in other domains, for example in the text-conditioned generation of photorealistic images10,11,12. For this reason, there has been considerable work to develop generative models of protein space, applied to both protein sequences13,14,15,16,17,18,19 and structures20,21,22,23,24,25,26.
Despite recent advances in generative models for proteins, we argue that there are three properties that have yet to be realized simultaneously in one system. These are: modelling the joint, all-atom likelihood of sequences and three-dimensional structures of full protein complexes; achieving this with computation that scales sub-quadratically with the size of the protein system; and enabling conditional sampling under diverse design constraints without retraining. The first of these, generating full complexes, is important because proteins function by interacting with other molecules, including other proteins. The second, the sub-quadratic scaling of computation, is important because it has been an essential ingredient for managing complexity in other modelling disciplines, such as computer vision, in which convolutional neural networks scale linearly with the number of pixels in an image, and in computational physics, which uses fast N-body methods for the efficient simulation of everything from stellar systems to molecular ones27. Finally, the requirement to sample from a model without having to retrain it on new target functions is of considerable interest because protein design projects often involve many complex and composite requirements that may vary over time.
Here we introduce Chroma, a generative model for proteins that achieves all three of these requirements by modelling full complexes with quasi-linear computational scaling and by allowing arbitrary conditional sampling at generation time. It builds on the framework of diffusion models28,29, which model high-dimensional distributions by learning to gradually transform them into simple distributions in a reversible manner, and of graph neural networks30,31, which can efficiently process geometric information in complex molecular systems. We show that Chroma generates high-quality, diverse and innovative structures that refold both in silico and in crystallographic experiments, and that it enables the programmable generation of proteins conditioned on diverse properties such as symmetry, shape, protein class and even textual input. We anticipate that scalable generative models such as Chroma will enable a widespread and rapid increase in our ability to design and build protein systems that are fit for function.
A scalable generative model for protein systems
Chroma achieves high-fidelity, efficient generation of proteins by introducing a new diffusion process, neural-network architecture, and sampling algorithm based on principles from contemporary generative modelling and biophysical knowledge. Diffusion models generate data by learning to reverse a ‘noising’ process, which for previous image-modelling applications has typically been uncorrelated Gaussian noise. By contrast, our model learns to reverse a correlated noise process to match the distance statistics of natural proteins, which have scaling laws that are well understood from biophysics (Fig. 1a, Supplementary Appendix D). Previous generative models for protein structure have typically leveraged computation that scales quadratically, O(N2) (refs. 24,25), or cubically, O(N3) (refs. 9,23), in the number of residues N. This has either limited their application to small systems or required large amounts of computation for modestly sized systems. To overcome this problem, Chroma introduces a novel neural-network architecture (Fig. 1b, Supplementary Figs. 4–8, Supplementary Tables 2–3 and Supplementary Appendices E–G) for processing and updating molecular coordinates that uses random long-range graph connections with connectivity statistics inspired by fast N-body methods27 and that scales sub-quadratically (O(N) or O(Nlog[N]); Supplementary Fig. 4 and Supplementary Appendix E). We found that these modelling components improved performance, as measured by likelihood and in silico refolding across an ablation study of seven different model configurations (Supplementary Fig. 22 and Supplementary Appendix L). Finally, we introduce methods for low-temperature sampling with a modified diffusion process that allows us to trade an increased quality of sampled backbones (increasing likelihood) for reduced conformational diversity (reducing entropy; Supplementary Figs. 1–2, Supplementary Table 4 and Supplementary Appendix C). Given backbones from this diffusion process, the Chroma design network then generates sequence and side-chain conformations that are conditioned on the sampled backbone to yield a joint generative model for the sequences and structure of a protein complex. The design network is based on a similar graph neural-network architecture (Supplementary Figs. 7, 8 and 15), but with conditional sequence and side-chain decoding layers that build on previous studies15,16 that have seen further refinement and experimental validation32,33,34.
An important aspect of our diffusion-based framework is that it enables programmability of proteins through conditional sampling under combinations of user-specified constraints. This is made possible by a key property of diffusion models: they learn a process that transforms a simple distribution into a complex data distribution through a sequence of many infinitesimal steps. These ‘microscopic’ steps, therefore, can be biased or constrained by different user-specified requirements to produce a new conditional diffusion process at design time. We built on this with a diffusion-conditioner framework that allows us to automatically sample from arbitrary mixtures of hard constraints and soft penalties implemented as composable primitives (Fig. 1c and Supplementary Appendix M). We explored several conditioner primitives including geometrical constraints that can outfill proteins from fixed substructures (Supplementary Appendix N), enforce particular distances between atoms (Supplementary Appendix O), graft motifs into larger structures (Supplementary Appendix P), symmetrize complexes under arbitrary symmetry groups (Supplementary Appendix Q) and enforce shape adherence to arbitrary point clouds (Supplementary Appendix R). We also explored the possibilities of semantic prompting by training neural guidance networks that predict multi-scale protein classifications (Supplementary Appendix S) and natural language annotations (Supplementary Appendix T) from protein structures. We can invert these predictive models by sampling proteins that optimize classifier predictions. Any subset of conditioners may then be composed for bespoke, on-demand protein generation subject to problem-specific requirements.
Analysis of unconditional samples
We sought to characterize the space of possible proteins parameterized by Chroma by generating a large number of unconditional samples of proteins and protein complexes (100,000 single-chain proteins and 20,000 complexes across two versions of the models, v.0 and v.1; Supplementary Appendix G and Supplementary Table 2). As can be seen in Fig. 2a, unconditional samples display many properties shared by natural proteins, such as complex layering of bundled α-helices and β-sheets in cooperative, unknotted folds. In some cases, we observed recognizable protein-complex configurations, including what seems to be an antibody–antigen complex in Fig. 2a (centre-right); note that the closest Protein Data Bank (PDB) structural matches to the two ‘antigen’ chains of this complex are at template-modelling (TM) scores41 of 0.46 and 0.43, indicating that this sample is not a result of memorization. We provide grids of random samples in Supplementary Figs. 9 and 10 for single-chain and complex structures, respectively. To quantitatively characterize the agreement of Chroma samples with natural proteins, we computed distributions of several key structural properties, including secondary-structure utilization, contact order35, length-dependent radius of gyration36, length-dependent long-range contact frequency and density of inter-residue contacts (Supplementary Table 5 and Supplementary Appendix J). We observe a general agreement of these statistics with corresponding distributions from the PDB (Supplementary Fig. 11), although we do see an overrepresentation of α-helices in the later version of Chroma (v.1) that seems to be a consequence of low-temperature sampling, which accentuates the already increased frequency of helices relative to strands in natural proteins (Supplementary Fig. 11b). Because these protein properties focus on low-order structural statistics, we also sought to characterize the extent to which they reproduce higher-order atomic geometries of natural protein structures. Natural protein structures exhibit considerable degeneracy in their use of local tertiary backbone geometries, such that completely unrelated proteins tend to use very similar tertiary motifs37,38. Chroma-generated structures exhibit the same type of degeneracy, utilizing natural tertiary motifs in a way that closely resembles native proteins, including complex tertiary geometries with four or five disjoint backbone fragments (Supplementary Fig. 11c and Supplementary Appendix J).
Although reproducing native-like properties of backbone geometries is important in design, our top priority is the extent to which the proteins can be realized as sequences that fold and function as intended. The definitive answer to this question involves experimental characterization (see below), but in silico evidence can be gathered more systematically. We sought to evaluate the fidelity of sequence–structure pairs generated by Chroma by measuring their agreement with three state-of-the-art methods for structure prediction9,39,40. We sampled one sequence for each backbone with Chroma’s design network and assessed whether each structure-prediction method would predict these sequences to fold into the corresponding generated structures (Supplementary Fig. 14 and Supplementary Appendix J). We observed widespread refolding of Chroma samples, whether stratified by protein length (Fig. 2e) or helical content and novelty (Supplementary Fig. 14). It is not surprising that successful refolding is less frequent for longer proteins, but it is remarkable that high TM scores41 are routinely achieved even for proteins more than 800 residues in length. Interestingly, helix content does not seem to be as strong of a predictor of refolding as the distance to the nearest neighbour in the PDB (Supplementary Fig. 15, middle and bottom rows, respectively). We note that this sequence–structure consistency test is not perfect because it rests on the assumption that structure-prediction models will generalize to new folds and topologies. However, the test does provide partial supporting evidence for the generation of realizable protein models in instances in which the predicted and generated structures have strong agreement.
Quantification of the structural homology between Chroma-generated samples and proteins in the PDB indicates that the model generates previously unseen structures at a frequency that increases sharply with length (Fig. 2c and Supplementary Fig. 12a). However, this analysis suffers from the problem that coverage of longer structures is expected to be lower in any finite database. To get a better understanding of the novelty of Chroma samples at different lengths, we defined a novelty score as the number of CATH42 domains required to greedily cover 80% of the residues in a protein at a TM score above 0.5, normalized by protein length (Supplementary Appendix J). Note that most valid proteins will be covered by at least some finite number of CATH domains because we retain even very small domains (such as single secondary-structural elements) in the coverage test. As shown in Supplementary Fig. 12c,d, there is a clear gap between native and Chroma-generated proteins by this metric, with most native backbones requiring approximately 2–5 times fewer CATH domains to be covered per length than generated backbones.
We also find that samples from Chroma are diverse and cover natural protein space. In Supplementary Fig. 13, we present samples from Chroma and a set of native structures with global topology descriptors derived from knot theory43,44 and embed them into two dimensions with UMAP45. The resulting embedding seems to be semantically meaningful because subsets of structures belonging to different categories by size and secondary structures cluster in this projection (sub-panels on the left in Supplementary Fig. 13a). False colour of the points in the embedding shows that novelty is spread broadly and is not biased to only certain types of structure space. This is especially clear when looking at a representative selection of samples shown in Supplementary Fig. 13b.
An important aspect of Chroma is its programmability, which means it is straightforward to specify high-level desired protein properties (such as symmetry groups) that are compiled into a set of sampling conditioners that bias the diffusion process towards these properties (Fig. 1c, Supplementary Fig. 23 and Supplementary Appendix M). To demonstrate the range of protein properties that can be programmed with conditional generation, we explored several composable conditioning primitives (Supplementary Table 6, Supplementary Figs. 23–33 and Supplementary Appendices N–T). Although we believe that each of these represents only a preliminary demonstration of possible conditioning modes, they provide a glimpse of the potential for programmable protein design.
We began by considering analytic conditioners that can control protein backbone geometry. We found that conditioning on the symmetry of protein complexes can readily generate samples under arbitrary symmetry groups (Fig. 3a, Supplementary Figs. 17, 27–29 and Supplementary Appendix Q). Figure 3a illustrates symmetry-conditioned generation across many groups, from simple four-subunit cyclic symmetries up to a capsid-sized icosahedral complex with 60,000 total residues and more than 240,000 atoms. This also demonstrates why favourable computational scaling properties, such as quasilinear computation time (Supplementary Appendix E), are important, as efficient computation enables scaling to larger systems. Symmetric assemblies are common in nature and there have been some successes with de novo symmetric designs46,47, but it has generally been difficult to simultaneously optimize for both the desired overall symmetry and the molecular interaction details between protomers. Symmetry conditioning within the generation process in Chroma should make it simpler to sample structures that simultaneously meet both requirements.
We next explored substructure conditioning (Fig. 3b, Supplementary Figs. 16, 24–26, Supplementary Appendices N–P), which is a central problem for protein design because it can enable the preservation of one part of the structure of a protein (such as an active site) while modifying another part of the structure (and potentially function). In the top row, we cut the structure of human dihydrofolate reductase (DHFR; PDB code 1DRF) into two halves with a plane, remove one of the halves and regenerate the missing half. The cut plane introduces multiple discontinuities in the chain simultaneously, and the generative process must sample a solution that simultaneously satisfies these boundary conditions while being biophysically plausible. Nevertheless, the samples achieve both goals and, interestingly, do so in a manner very different from each other and from natural DHFR. In the second row of Fig. 3b, we cut out the complementarity-determining regions of a VHH antibody and rebuilt them conditioned on the remaining framework structure. Finally, the bottom three rows of Fig. 3b condition on sub-structure in an unregistered manner, meaning that the exact alignment of the substructure (motif) within the chain is not specified a priori, as it was in the previous examples. We outfilled the protein structure around several structural and functional motifs, including an αββ packing motif, backbone fragments encoding the catalytic triad active site of chymotrypsin and the EF-hand Ca-binding motif. Again, these motifs are accommodated in a realistic manner using diverse and structured solutions.
In Fig. 3c we provide an early demonstration of a more exotic kind of conditioning in which we attempted to solve for backbone configurations subjected to arbitrary volumetric shape specifications. We accomplished this by adding heuristic classifier gradients based on optimal transport distances48 between atoms in the structures and user-provided point clouds (Supplementary Appendix R). As a stress test of this capability, we conditioned the generation of single protein chains on the shapes of the Latin alphabet and Arabic numerals (Supplementary Fig. 18 and Supplementary Appendix K.3). We see the model routinely implementing several core phenomena of protein backbones, such as high secondary-structure content, close packing with room for designed side chains, and volume-spanning α-helical bundle and β-sheet elements. Although these shapes represent purely a challenging set of test geometries, more generally shape is intimately related to function in biology, for example, with membrane transporters, receptors and structured assemblies that organize molecular events in space. Being able to control shape would be a useful subroutine for generalized programmable protein engineering.
Finally, we demonstrate in Fig. 4 that it is possible to condition on protein semantics, such as secondary structure, fold class (Fig. 4a, Supplementary Figs. 19, 30 and Supplementary Appendix S) and natural language (Fig. 4b, Supplementary Figs. 20, 31–33, and Supplementary Appendix T). Unlike geometric conditioning, in which the classifier is correct by construction (for example, the presence of a motif with less than a certain root-mean-square deviation is unambiguous), here the classifiers are neural networks trained on structure data, so there can be a discrepancy between the label assigned by the classifier and the ground truth class. Thus, for the fold-conditioned generation (Fig. 4a), we see that conditional samples always improve classifier probabilities over unconditioned samples taken from the same random seed, but the classification is not always perfect. For example, for the ‘Rossman fold’ class, the generated samples reproduce the canonical mixed topology. However, in the ‘Ig fold’ and ‘β-barrel fold’ examples, the structures exhibit some of the features characteristic of the classes (for example, β-sheets packed against each other) but do not contain all such features (for example, the Ig topology does not appear canonical and the barrel does not form a closed cycle). In Fig. 4b we demonstrate two examples of semantic conditioning on natural language captions, where we again occasionally observe alignment between samples and intended prompts, especially for highly-represented protein classes. It is exciting to imagine the potential of such a capability, that is being able to request desired protein features and properties directly through natural language prompts. Generative models such as Chroma can reduce the challenge of function-conditioned generation to the problem of building accurate classifiers for functions given structures. Although there is clearly much more work to be done to make this useful in practice, high-throughput experiments and evolutionary data are likely to enable this in the near term.
Supplementary Appendix K demonstrates extensive in silico refolding studies of samples generated with the conditioners described above. As shown in Supplementary Figs. 16–20, all of these conditional-generation processes can produce samples that refold accurately to their generated backbones. The rates at which this happens vary according to the specific condition and protein length (and are subject to the caveats of this test mentioned above), but even in the challenging cases of shape-, complex symmetry-, class- and language-conditioned designs, we observe widespread refolding across specific conditions and structure prediction methods.
To experimentally validate Chroma, we built a simple design protocol (based on Chroma v.0) that was intended to generate high-likelihood samples drawn from the model. Specifically, the protocol involved three steps: generate backbones by drawing independent samples from Chroma at low temperature; design sequences for each backbone using the Chroma design network; and automatically select a subset for experimental characterization to match the desired experimental scale, driven primarily by sequence and/or structure likelihood (as shown in Supplementary Table 7 and Supplementary Appendix U.1). Notably, we deliberately did not filter designs for refolding by a structure-prediction method or using any structure–energetic calculations. However, such filtering could potentially be used to improve the success rate of design.
We generated 310 proteins (unconditional or semantically conditioned on CATH class or topology) for attempted expression and structural characterization (Fig. 5a). We first addressed an initial set of 172 unconditional proteins, ranging between 100 and 450 amino acids in length (Supplementary Fig. 36). We used a pooled protein solubility assay that was based on the split-GFP reporter system49 to prioritize tractable proteins for subsequent characterization (Supplementary Fig. 38a). After FACS and Nanopore sequencing (Supplementary Fig. 38b), enrichment scores were assigned to categorize the soluble expression levels of each protein (Supplementary Fig. 38c). All 172 tested proteins were assigned higher enrichment scores than the negative control (human β3 adrenergic receptor, Supplementary Table 8), indicating that a wealth of Chroma-designed unconditional proteins can be solubly expressed in Escherichia coli (Fig. 5b). We confirmed stable fluorescence in sorted cell populations (Supplementary Fig. 38d) and corroborated our split-GFP screen results using western blotting, observing soluble expression of 19 of the 20 top-scoring proteins and 0 of the 20 lowest-scoring proteins (Supplementary Fig. 39). We created an additional set of 96 unconditional Chroma proteins encompassing a wider range of lengths (from 100 to 950 amino acids; Supplementary Fig. 40a), which performed similarly to the first unconditional protein set using the split-GFP reporter assay (Supplementary Fig. 40b,c). In this additional set, soluble expression of nine of the ten top-scoring proteins was confirmed by western blotting (Supplementary Fig. 40d).
Of the proteins identified in the top 10% of the split-GFP solubility screen, we purified seven for interrogation using circular dichroism (CD; Fig. 5e) and differential scanning calorimetry (Supplementary Fig. 41 and Extended Data Table 1). The results indicate that most of the isolated proteins were stably folded with appreciable secondary structure. From these proteins, we were able to obtain X-ray crystal structures (Extended Data Table 2) for UNC_079 (PDB 8TNM; Fig. 5c) and UNC_239 (PDB 8TNO; Fig. 5d). The observed structures matched the anticipated designs to a high degree (root-mean-square deviation = 1.1 Å and 1.0 Å, respectively), indicating that Chroma-generated structures are realizable. Importantly, these structures are unique with respect to the PDB, with the top PDB hit to UNC_079 (PDB entry 4NH2, chain E) having query and target TM scores of 0.7 and 0.3, respectively, and the top hit to UNC_239 (PDB entry 6AFV, chain A) having query and target TM scores of 0.5 and 0.23, respectively (Fig. 5c,d).
The results of the split-GFP assay show that it is more difficult to succeed with longer designs, because there is an inverse correlation between length and split-GFP score (Supplementary Fig. 34). Interestingly, although we might expect the extent of refolding by structure prediction to also correlate with experimental success, we saw no correlation when length is corrected for (Supplementary Fig. 34). Similarly, we saw no correlation between soluble expression and structural novelty. We did find model likelihoods to be weakly predictive of experimental success for the first conditional set, but this did not hold true for the second set, in which lengths were extended up to 950 amino acids (Supplementary Fig. 35).
To test the ability of Chroma to propose well-behaved proteins in a conditioned setting, we next evaluated a set of 42 proteins conditioned by ProClass on CATH class (36 designs split among the classes mainly α, mainly β and mixed α/β) and on CATH topology (six designs conditioned on the β-barrel topology 2.40.155; Supplementary Fig. 37a). In the split-GFP solubility assay, 40 of these proteins (95%) scored above the negative control, indicating a high success rate of soluble protein expression (Supplementary Fig. 37b). We purified one representative protein from each secondary-structure category (two designs conditioned on mainly-α and mixed α/β classes, and one design conditioned on the β-barrel topology). Differential scanning calorimetry data for these proteins were consistent with relatively stable folding, with melting temperatures ranging from 64 °C to 78 °C (Supplementary Fig. 37c). On the basis of secondary-structure predictions from CD spectra50, we observed higher α-helical content in the mainly-α design, higher β-sheets in the β-barrel design, and mixed secondary structure in the mixed-content protein (Fig. 5f). Indeed, across both conditional and unconditional designs, the inferred secondary-structure content from CD was closely correlated with the secondary-structure content calculated from Chroma-generated models, for both the fraction of α-helices (R2 = 0.84; Fig. 5g) and β-sheets (R2 = 0.51; Supplementary Fig. 5h), indicating that proteins with various structural compositions can be designed by Chroma.
In this work we present Chroma, a generative model that can generate new and diverse proteins across a broad array of structures and properties. Chroma is programmable in the sense that it can sample proteins with a wide array of user-specified properties, including inter-residue distance and contact, domain, sub-structure and semantic specification from classifiers. Chroma is able to generate proteins that have arbitrary and complex shapes, and it has even begun to demonstrate the ability to accept descriptions of desired properties as free text. Its efficient design, with an innovative diffusion process, quasilinear scaling neural architecture and low-temperature sampling method, means that Chroma can generate extremely large proteins and protein complexes (with more than 3,000 residues) on a commodity graphics processing unit (such as an NVIDIA V100) in a few minutes.
We reasoned that the best way to determine the plausibility of the protein space parameterized by Chroma was to draw independent samples from the model and test them experimentally. Note that this is a departure from the prototypical protein-design protocol, in which initial proposal designs are down-selected using a custom set of filters intended to avoid known or hypothesized model deficiencies and help focus on designs that are more likely to work experimentally. Although the latter practice, which is broadly adopted in the field, can be effective at increasing design success rates, it does require a custom set of filters for each design project and makes fully automated design difficult to achieve. Furthermore, such an approach would detract from our intention of characterizing the distribution learned by Chroma.
Our experimental validation shows that Chroma has learnt a sufficiently accurate distribution such that sampling from it results in proteins that express, fold, have favourable biophysical properties and conform to intended structures at non-trivial rates. Even under the highly conservative view that only the proteins we purified and characterized individually in solution constitute successful designs (as opposed to others that performed comparably by split-GFP, for example), Chroma would still have a 3% success rate. Moreover, the two designs with experimentally determined crystal structures demonstrate that a non-trivial fraction of this distribution should be expected to be atomistically accurate. Given the breadth and novelty of the structure space learned by Chroma (Fig. 2 and Supplementary Figs. 9, 10 and 13), even these conservative estimates of success rate would translate into immense swaths of unexplored actionable protein space that can now be accessible through commodity computing hardware.
The task of exploring protein structure space in a way that can produce physically reasonable and designable conformations has been a long-standing challenge in protein design. In a few protein systems, it has been possible to parameterize the backbone conformation space mathematically—most notably the α-helical coiled coil51 and a few other cases that have high symmetry52—and in these cases, design efforts have benefited tremendously, creating possibilities that are not available in other systems52,53. For all other structure types, however, a great amount of computational time has been spent on the search for reasonable backbones, often leaving the focus on actual functional specifications out of reach. Chroma has the potential to address this problem, enabling a shift from focusing on generating feasible structures towards a focus on the specific task at hand—namely, what the protein is intended to do. By leveraging proteins sampled over more than 3 billion years of evolution, and by finding new ways to assemble stable protein matter, generative models such as Chroma are well poised to drive another expansion of biomolecular diversity with benefits for human health and bioengineering.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
All experimental and computational results are available in the Supplementary Information and Extended Data Tables 1 and 2. Experimental structures solved as part of this study were deposited under PDB accession codes 8TNM and 8TNO. Training datasets were constructed based on the PDB (https://www.rcsb.org/), as queried on 20 March 2022, UniProt 2022_01 (https://www.uniprot.org) and PFAM 35 (http://pfam.xfam.org/). PDB IDs comprising Chroma training, test and validation sets are available in the Zenodo dataset at https://doi.org/10.5281/zenodo.8285077.
Chroma code is available at https://github.com/generatebio/chroma under the Apache 2.0 open-source licence.
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Kuhlman, B. & Bradley, P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 20, 681–697 (2019).
Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).
Cao, L. et al. Design of protein-binding proteins from the target structure alone. Nature 605, 551–560 (2022).
Kries, H., Blomberg, R. & Hilvert, D. De novo enzymes by computational design. Curr. Opin. Chem. Biol. 17, 221–228 (2013).
Joh, N. H. et al. De novo design of a transmembrane Zn2+-transporting four-helix bundle. Science 346, 1520–1524 (2014).
Smith, J. M. Natural selection and the concept of a protein space. Nature 225, 563–564 (1970).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Ramesh, A. et al. Zero-shot text-to-image generation. In Proc. 38th International Conference on Machine Learning (eds Meila, M. et al.) 8821–8831 (PMLR, 2021).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://arxiv.org/abs/2204.06125 (2022).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proc. Advances in Neural Information Processing Systems 35 (eds Koyejo, S. et al.) 36479–36494 (NeurIPS, 2022).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) (NeurIPS, 2019).
Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 746 (2022).
Madani, A. et al. ProGen: language modeling for protein generation. Preprint at http://arxiv.org/abs/2004.03497 (2020).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16990–17017 (PMLR, 2022).
Anand, N. & Huang, P.-S. Generative modeling for protein structures. In Proc. Advances in Neural Information Processing Systems 31 (eds Bengio, S. et al.) (NeurIPS, 2018).
Lin, Z., Sercu, T., LeCun, Y. & Rives, A. Deep generative models create new and diverse protein structures. In Machine Learning in Structural Biology Workshop at the 35th Conference on Neural Information Processing Systems (MLSB, 2021).
Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).
Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. Preprint at https://arxiv.org/abs/2205.15019 (2022).
Trippe, B. L. et al. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. In Proc. 11th International Conference on Learning Representations (eds Kim, B. et al.) (OpenReview.net, 2023).
Wu, K. E. et al. Protein structure generation via folding diffusion. Preprint at https://arxiv.org/abs/2209.15611 (2022).
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
Barnes, J. & Hut, P. A hierarchical O(N log N) force-calculation algorithm. Nature 324, 446–449 (1986).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. 32nd International Conference on Machine Learning Vol. 27 (eds Bach, F. et al.) 2256–2265 (PMLR, 2015).
Song, Y. et al. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (eds Hofmann, K. et al.) (OpenReview.net, 2021).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In Proc. 34th International Conference on Machine Learning (eds Precup, D. et al.) 1263–1272 (PMLR, 2017).
Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. Preprint at https://arxiv.org/abs/1806.01261 (2018).
Jing, B., Eismann, S., Suriana, P., Townshend, R. J. L. & Dror, R. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations (eds Hofmann, K. et al.) (OpenReview.net, 2021).
Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 8946–8970 (PMLR, 2022).
Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
Plaxco, K. W., Simons, K. T. & Baker, D. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277, 985–994 (1998).
Tanner, J. J. Empirical power laws for the radii of gyration of protein oligomers. Acta Crystallogr. D 72, 1119–1129 (2016).
Mackenzie, C. O., Zhou, J. & Grigoryan, G. Tertiary alphabet for the observable protein structural universe. Proc. Natl Acad. Sci. USA 113, E7438–E7447 (2016).
Zhou, J., Panaitiu, A. E. & Grigoryan, G. A general-purpose protein design framework based on mining sequence–structure relationships in known protein structures. Proc. Natl Acad. Sci. USA 117, 1059–1068 (2020).
Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at bioRxiv https://doi.org/10.1101/2022.07.21.500999 (2022).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).
Røgen, P. & Fain, B. Automatic classification of protein structure by using Gauss integrals. Proc. Natl Acad. Sci. USA 100, 119–124 (2003).
Harder, T., Borg, M., Boomsma, W., Røgen, P. & Hamelryck, T. Fast large-scale clustering of protein structures using Gauss integrals. Bioinformatics 28, 510–515 (2012).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 3, 861 (2018).
Wicky, B. I. M. et al. Hallucinating symmetric protein assemblies. Science 378, 56–61 (2022).
King, N. P. et al. Accurate design of co-assembling multi-component protein nanomaterials. Nature 510, 103–108 (2014).
Peyré, G. & Cuturi, M. Computational optimal transport: with applications to data science. Found. Trends Mach. Learn. 11, 355–607 (2019).
Cabantous, S., Terwilliger, T. C. & Waldo, G. S. Protein tagging and detection with engineered self-assembling fragments of green fluorescent protein. Nat. Biotechnol. 23, 102–107 (2005).
Micsonai, A. et al. BeStSel: webserver for secondary structure and fold prediction for protein CD spectroscopy. Nucleic Acids Res. 50, W90–W98 (2022).
Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079–1100 (2011).
Woolfson, D. N. et al. De novo protein design: how do we expand into the universe of possible protein structures? Curr. Opin. Struct. Biol. 33, 16–26 (2015).
Beesley, J. L. & Woolfson, D. N. The de novo design of α-helical peptides for supramolecular self-assembly. Curr. Opin. Biotechnol. 58, 175–182 (2019).
We thank W. F. DeGrado, R. Kormos and Generate employees A. Ramos, A. Delhagen, A. Jecrois, B. R. P. Saravanan, B. Hannigan, B. Patuto, B. Vogler, D. Moonan, D. Curran, D. Ferguson, E. Brignole, E. Palovcak, J. Lucas, J. McFarland, J. Huaman-Argandona, J. Garlick, K. Tamang, K. Hopson, M. Pattie, M. Jankowiak, M. Saputo, M. Nally, M. Mathur, M. Gibson, N. Shaban, N. Joh, R. Chaudhary, R. Federman, S. Clancy, S. DeCamp, T. Linsky, Y. Liu and Z. Harteveld for assistance with experimental and computational methods development, discussions and input on manuscript drafts; and B. Turner and staff at the MIT Biophysical Instrumentation Facility for providing training and access to the CD spectrometer. The study used the resources of the MIT Structural Biology Core Facility and the MIT Biophysical Instrumentation Facility.
All authors are employees and shareholders of Generate Biomedicines.
Peer review information
Nature thanks Arne Elofsson, Alex Pritzel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended descriptions of the Chroma model, sampling algorithms, conditioners, in silico evaluations, an ablation study, training pipelines and the experimental validation. It includes appendices, Supplementary figures and Supplementary tables.
About this article
Cite this article
Ingraham, J.B., Baranov, M., Costello, Z. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023). https://doi.org/10.1038/s41586-023-06728-8