Main

The structural diversity of proteins underlies their functional variety. The overall structure of a protein is determined by its fold, that is, the spatial arrangement of, and connections between, the secondary structure elements. Hundreds of thousands of naturally occurring protein structures have been solved and deposited in the Protein Data Bank (PDB), and the number continues to grow. However, in recent years, novel protein folds have rarely been discovered1,2,3, suggesting that nearly all folds existing in nature have been found. This does not necessarily indicate that all folds accessible to the polypeptide chain have been uncovered. Although debated4,5,6,7, it has been suggested that nature may have sampled only a small fraction of the possible fold space during evolution5,6,7. We investigated this hypothesis through de novo protein design for the folds that have not been sampled by natural evolution.

Recently developed principles for designing protein structures have made possible the design of a wide range of new proteins from scratch8,9,10,11,12, allowing exploration of the huge sequence space beyond that sampled by natural evolution. However, exploration of the fold space has so far been limited to naturally occurring protein folds8,9,10,11,12, except for one new fold of a protein called Top7 (ref. 13). To explore the fold space beyond that sampled by natural evolution, a ‘map’ to search for the folds that are possible, but not observed in nature (that is, novel folds), is indispensable. Therefore, we defined a set of rules for β-sheet topology to predict novel αβ-folds, and we carried out a systematic exploration of novel αβ-folds through de novo protein design, guided by these rules.

Results

αβ-Folds not observed in nature

The αβ-folds, most of which are involved in enzymatic functions14, account for more than half of the protein folds identified so far15. We first sought to identify unobserved αβ-folds with a three- to eight-stranded open β-sheet, that is, a β-sheet that does not form a barrel. We defined αβ-folds in a more abstract manner on the basis of their β-sheet topology, that is, the number, order and orientation of constituent β-strands in a β-sheet (Fig. 1a). Moreover, we considered only those folds with right-handed connections between parallel-aligned β-strands, as per Richardson’s rule16 (Fig. 1b). This gave rise to \(n!\times {2}^{n-2}\) patterns in total for αβ-folds for an n-stranded β-sheet, including numerous αβ-folds not observed in nature (Fig. 1c). However, apparently, not all of the unobserved folds identified are possible. For example, the fold shown in the lower panel in Fig. 1a is not possible because the two β-strand connections are overlapping. Therefore, we introduced a criterion that predicts possible αβ-folds among all patterns of β-sheet topologies on the basis of a set of rules for β-sheet topology.

Fig. 1: Observed and unobserved β-sheet topologies in nature.
figure 1

a, αβ-Folds defined on the basis of the β-sheet topology and Richardson’s right-handed strand connections16 shown in b. The upper panel shows a β-sheet topology frequently observed in nature and its corresponding ferredoxin-like fold, the lower panel shows a β-sheet topology unobserved in nature and its corresponding fold. Each β-strand is numbered according to its order along the linear chain. Gray-colored β-strand connections are on the front side of the β-sheet, black-colored ones are on the back side. b, Richardson’s rule on the connection handedness of para-β-X-β motifs16. The right-handed strand connection (dark gray bar) rather than the left-handed one (light gray bar) is predominantly observed in naturally occurring proteins. c, Numbers of observed, unobserved and theoretically possible β-sheet topologies for each number of constituent β-strands in a β-sheet (see Fig. 3 and Methods for the definition of observed and unobserved topologies).

Source data

Rules for β-sheet topology

We derived a set of rules from the conformational preferences of β-X-β motifs in naturally occurring protein structures, where X represents any backbone conformation (Methods): the connection jump-distance rule for single β-X-β motifs, and the connection overlap and connection ending rules for pairs of β-X-β motifs.

Connection jump-distance rule

‘The large number of intervening β-strands between the two β-strands is disfavored: the number of the intervening β-strands (that is, jump distance) for parallel (para) β-X-β motifs is less than four and that for antiparallel (anti) β-X-β motifs is less than two (Fig. 2a)’. An exception is the anti-β-X-β motif with two intervening β-strands included in the Greek key β-sheet topology and its circular permutations (the dotted bar in Fig. 2a and the topologies with asterisks in Fig. 3d).

Fig. 2: Rules for β-sheet topology.
figure 2

a, Connection jump-distance rule. The jump distance is the number of intervening β-strands between the two β-strands of β-X-β motifs. Para-β-X-β motifs with jump distances of three or less and anti-β-X-β motifs with jump distances of one or less are frequently observed compared to β-X-β motifs with larger jump distances. The same preferences have been previously reported49. We revisited them using the current PDB data. b, Connection overlap rule. D-type β-sheet topologies (loops are located on different sides) are more frequently observed than S-type topologies (loops are located on the same side). Blue- and red-colored motifs indicate two different β-X-β motifs. Similar rules have been reported for para-para-β-X-β motifs21,50. For anti-anti-β-X-β motifs, a rule termed ‘pretzels’ has been reported20,21, but this rule prohibits both S- and D-types. c, Connection ending rule. S- and D- types of β-sheet topologies for pairs of para-β-X-β motifs, in which the second strands of the two motifs are adjacent and parallel-aligned, are shown. S-type β-sheet topologies are more frequently observed than D-type topologies.

Source data

Connection overlap rule

‘Geometrical overlap between the connections of two β-X-β motifs is less favorable: the β-sheet topologies with the two connections on the same side (S-type) are less favorable than those with the two connections on different sides (D-type) (Fig. 2b)’. Para-β-X-β motifs have a right-handed connection preference according to Richardson’s rule (Fig. 1b). Analysis of anti-β-X-β motifs in naturally occurring protein structures revealed that the connections in anti-β-X-β motifs with a jump-distance number of one preferentially have a right-handed bending orientation (Extended Data Fig. 1). These right-handed connection preferences led to the connection overlap rule (Fig. 2b).

Connection ending rule

‘When the second strands in two para-β-X-β motifs are adjacent to each other and aligned in parallel, the β-sheet topologies with the two connections ending on the same β-sheet side (S-type) are preferred over those with the connections ending on different β-sheet sides (D-type) (Fig. 2c)’. Analysis of para-β-X-β motifs revealed that register shifts between the second strand in a para-β-X-β motif and the adjacent parallel-aligned β-strands are nearly always zero or positive17 (Extended Data Fig. 2). In addition, we previously described the αβ rule: the vector from the Cα to Cβ atoms of the first strand residue following the loop connecting the helix to the strand points away from the helix8. These two preferences led to the connection ending rule (Fig. 2c and Extended Data Fig. 3).

Prediction of nonfrustrated novel four-stranded αβ-folds

Using the set of rules for β-sheet topology, we classified all open β-sheet topologies with three to eight strands into frustration-free ones, without violations of the rules, and frustrated ones, with violations. We regarded frustration-free topologies as possible topologies. Many of the observed αβ-folds were identified as frustration-free, while most of the unobserved and scarcely observed αβ-folds, were identified as frustrated (Fig. 3a,b; Methods). Moreover, the frustration-free β-sheet topologies were observed in more homologous groups (that is, evolutionarily independent groups, which are referred to as superfamilies in SCOP2 and CATH3) than the frustrated ones (Fig. 3c; Methods). These results suggest that the set of rules allows distinguishing possible β-sheet topologies among all β-sheet topologies.

Fig. 3: Distributions of frustration-free and frustrated β-sheet topologies in nature.
figure 3

a, Numbers of frustration-free and frustrated β-sheet topologies in each observed or unobserved topology in nature for each number of constituent β-strands in a β-sheet. aThe number within each bracket indicates the percentage of unobserved topologies in frustration-free topologies. b, Observation frequencies of all possible 96 topologies for four-stranded β-sheets sorted by frequency. The observation frequency of a topology in nature is represented by the number of homologous groups (superfamily) having the topology (see Methods for details). We regarded topologies with an observation frequency of less than 1/4, at which the slope changes substantially, as unobserved. c, Ratios of frustration-free and frustrated β-sheet topologies depending on the observation frequency for each number of constituent β-strands in a β-sheet. The number in each band indicates the number of each topology. The observation frequency is presented as the logarithm to base 4. d, Distributions of frustration-free and frustrated topologies in nature for all possible 96 topologies of four-stranded β-sheets. β-Strand order indicates in which order the β-strands, numbered along the sequence, are aligned in a β-sheet from left to right; β-strand orientation indicates orientations of the β-strands. In each grid cell, a β-sheet topology is illustrated with its observation frequency in nature indicated by the number below the topology and the background color gradient from white (low frequency) to yellow (high frequency). Frustration-free and frustrated topologies are represented in dark gray and light gray, respectively. β-Sheet topologies corresponding to the Greek key and its circular permutations are marked with an asterisk. Red-colored loops represent topologies including at least one frustration. Topologies enclosed in a bold black square and numbered from one to eight are unobserved frustration-free β-sheet topologies.

Source data

The 96 patterns of the frustrated and frustration-free β-sheet topologies of four-stranded αβ-proteins are shown in Fig. 3d. About half of the topologies (53 patterns) were frustrated, 37 of which are either unobserved or very rare in nature. For example, the frustrated topology in column 1342 (strand order), row ↑↓↓↓ (strand orientation), which violates the connection jump distance and connection overlap rules (violations are indicated in red), has not been observed in nature. The other half (43 patterns) were frustration-free β-sheet topologies, 35 of which have been observed in nature. For example, the frustration-free β-sheet topology in column 1234, row ↑↓↑↓, termed ‘meander’, is the most frequently observed one. We identified eight frustration-free β-sheet topologies that have not been observed (numbered 5 to 8) or have rarely been observed (numbered 1 to 4) in nature. The latter are barely conserved in the Homology groups to which they belong. This evolutionary instability indicates the possibility that proteins with the topologies may not be robustly foldable. We regarded the αβ-folds with these eight β-sheet topologies as possible and unobserved folds (that is, novel folds) and attempted to carry out de novo design for all the predicted folds. Note that the β-sheet topology that consists of parallel-aligned β-strands with a 3142 strand order (numbered 8 in the figure) forms a knot; this topology has not been observed in nature and has long been considered to be impossible to exist18,19,20,21. However, we selected this topology for de novo protein design.

De novo design of all predicted novel four-stranded αβ-folds

To evaluate whether or not the predicted novel αβ-folds can be created, we carried out de novo design of αβ-fold proteins with the eight predicted novel four-stranded β-sheet topologies (Fig. 4a,b). The αβ-folds were named NF1 to NF8 according to the order of the observation frequencies of their β-sheet topologies; NF1 to NF4 have been scarcely observed, and NF5 to NF8 have never been observed in nature (NF6–NF8 have been reported as unobserved folds20). We sought to design the novel αβ-folds with ideal and simple structures, in which the secondary structures do not have β-bulges or α-helix kinks and the X region in para-β-X-β motifs is an α-helix. For each αβ-fold, we built a backbone blueprint, in which secondary structure lengths and loop ABEGO torsion patterns (‘A’ indicates the alpha region of the Ramachandran plot, ‘B’, the beta region, ‘G’ and ‘E’, the positive phi region and ‘O’, the cis-peptide conformation9,22) were specified using backbone design rules8,9 so that the target fold was favored (Fig. 4b). For NF1, 3, 4, 5 and 7, α-helices were appended at the termini to make the hydrophobic cores sufficiently large. For the same reason, the X region in the anti-β-X-β motifs of NF5, 6 and 7 were built with an α-turn motif22, not just a single loop. In particular, for NF7, ‘AAAB’ loops for βα connections with the right twist angle (Extended Data Fig. 4) and ‘BA’ loops for αβ connections (Extended Data Fig. 5) were adopted to ensure that the two α-turns were packed together. For NF8, the knot-forming fold, two backbone blueprints were built using different torsion types for the loop immediately before the last strand (Extended Data Fig. 6).

Fig. 4: Characterization of the designs for all eight novel αβ-folds.
figure 4

a, Identified novel β-sheet topologies. b, Backbone blueprints used for de novo design of the novel αβ-fold structures. Strand lengths are represented by filled and empty boxes that represent pleats coming out and going into the page, respectively. Letter strings next to the loops indicate their ABEGO torsion patterns9. c, Backbone structures generated from the blueprints. Each residue color represents its ABEGO torsion angle (red, A; blue, B; green, G). d, Energy landscapes obtained from Rosetta ab initio structure prediction simulations30. Each dot represents the lowest energy structure obtained in an independent trajectory starting from an extended chain (black) or the design model (red) for each sequence; the x axis shows the Cα r.m.s.d. from the design model and the y axis shows the Rosetta all-atom energy. e, Far-ultraviolet CD spectra at various temperatures (30–170 °C). f, Thermal denaturation monitored at 222 nm. g, Two-dimensional 1H-15N HSQC spectra at 25 °C and 600 MHz.

Next, for each blueprint, we built a backbone structure by averaging over several hundreds of backbone structures23 generated by Rosetta fragment assembly simulations24 (Fig. 4c; see Methods for details). As described in the previous section, the β-sheet topologies of NF1 to NF4 have rarely been observed and those of NF5 to NF8 have not been observed in nature. To investigate whether similar naturally occurring protein structures exist in terms of the entire backbone level, we performed database analysis using MICAN25,26 and TM-align27, with visual inspection using the TOPS diagram28; no similar naturally occurring protein structures were found, except for the NF2 and NF4 designs (Extended Data Fig. 7). Subsequently, we built side chains on each of the generated backbone structures using the Rosetta design algorithm13,29 (see Methods for details). Designs with low energy, tight core packing and high local sequence–structure compatibility8 were selected, and their energy landscapes were explored by Rosetta ab initio structure prediction simulations30. Designs with amino acid sequences exhibiting funnel-shaped energy landscapes toward the designed structure were experimentally characterized (Fig. 4d).

Experimental characterization of designed proteins

We obtained synthetic genes encoding 16 designs for NF1, four for each of NF2 and NF3, six for each of NF4–7 and twelve for NF8 (six for each of the two blueprints). All sequences are described in Supplementary Tables 18. For all sequences, no clear homologous proteins to any known protein were found (all designs have BLAST E values >10−3 against the NCBI nr database of nonredundant protein sequences). The proteins were expressed in Escherichia coli with C-terminal 6xHis-tags and purified using a Ni-NTA affinity column. In total, 56 out of the 60 designed proteins were expressed well and soluble. These were then characterized by circular dichroism (CD) spectroscopy, size-exclusion chromatography combined with multi-angle light scattering (SEC-MALS) and 1H-15N heteronuclear single quantum coherence (HSQC) NMR spectroscopy. The experimental results for all designs for all target folds are summarized in Extended Data Table 1. The success rate of the designs including the knotted fold was as high as those in previous de novo designs with the folds existing widely in nature (28 out of 60 designs were characterized as foldable proteins)8,9,10,11,12. For each target fold, one monomeric design with a CD spectrum characteristic of αβ-proteins and the expected number of well-dispersed sharp NMR peaks were selected for NMR structure determination (Fig. 4e–g). All the designs exhibited high thermal stability. Interestingly, the thermal denaturation curves for the designed proteins of NF3–6 were not cooperative (Fig. 4f), which could be one of the reasons why these folds have not been observed in nature. The NMR structures solved using MagRO-NMRViewJ31,32 (Supplementary Fig. 1) were in close agreement with the computational design models for both the backbone (Fig. 5; the root mean square deviation (r.m.s.d.) values for backbone heavy atoms ranged from 1.4 to 2.0 Å) and the core side chains (Extended Data Fig. 8 and Supplementary Table 9), with the correct β-sheet topologies (see Table 1 for NMR constraints and structure statistics). Remarkably, we succeeded in designing the smallest knotted NF8 structure consisting of only four strands (Extended Data Fig. 9). The successful de novo design of all eight αβ-folds demonstrates that the set of rules allows prediction of the novel αβ-folds.

Fig. 5: Comparison of computational models with experimentally determined structures.
figure 5

Top, the top two rows show designed novel αβ-folds from NF1 to NF8. The tertiary arrangement of α-helices (circles) and β-strands (triangles) and their connections are shown at the top, the β-sheet topologies below. Middle, computational design models. Bottom, the NMR structures. The r.m.s.d. between the design model and NMR structure for backbone heavy atoms is indicated. The design models are available in Supplementary Data 1, the NMR structures are available in the PDB: NF1-14 (PDB 7BPL), NF2-02 (7BPM), NF3-03 (7BQE), NF4-04 (7BQC), NF5-03 (7BPP), NF6-02 (7BQB), NF7-04 (7BPN) and NF8-01 (7BQD).

Table 1 NMR constraints and structure statistics of the eight designed structures

Prediction of novel five- to eight-stranded αβ-folds

Next, we revisited the number of frustration-free unobserved β-sheet topologies with five- to eight-stranded β-sheets, shown in Fig. 3a (for three-stranded αβ-proteins, all ten frustration-free β-sheet topologies have been observed in nature). As the number of constituent β-strands in a β-sheet increases, the number of frustration-free unobserved topologies increases exponentially and the ratio of unobserved topologies in frustration-free topologies also increases. The prediction indicates that 12,348 frustration-free (that is, possible) αβ-folds have been left as unobserved in nature; this number far exceeds that of the αβ-folds observed in nature (that is, 400 folds). Note that, since we only investigated novel folds identified by the set of rules introduced here, the predicted number corresponds to a lower limit of that of novel folds. There must be more novel folds that are not identified by applying the rules, but accessible to polypeptide chains.

Discussion

The extent of the protein fold space that is accessible to the polypeptide chain has long been unclear. We systematically investigated the unexplored fold space by introducing a set of rules to predict novel αβ-folds and by carrying out de novo design of all the predicted novel αβ-folds with a four-stranded β-sheet. We found that all the predicted novel αβ-folds, including a knotted fold, can be created. Remarkably, the design success rate was comparable to that of previous de novo designs with naturally occurring folds, and the thermal stability of the designs was as high as that of previous designs8,9,10,11,12. Our study indicates that there are more than 10,000 novel αβ-folds with five- to eight-stranded β-sheets.

There are several possible reasons for why these novel folds have not been observed in nature: (1) all life on Earth descended from a common ancestor: naturally occurring folds have been repetitively reused and adapted for different functions, and, therefore, life on Earth is biased by this ancestral relationship and the novel folds have, by chance, not emerged; living things could have evolved using a set of protein folds different from that currently observed in nature; (2) the timeline of biological evolution so far is too short for all possible folds to be explored; and (3) the novel folds are incapable of carrying out functions required for life and have therefore become extinct due to evolutionary bias. To address these possibilities, the relationship between novel fold structures and their functions need to be studied.

We tested whether AlphaFold2 (ref. 33) can predict the designed structures with novel αβ-folds from their amino acid sequences, using the template structure database as it existed before 14 May 2009 when many de novo-designed proteins started to be deposited in the PDB; the designed structures for NF2, 4, 5 and 6 were predicted for all five prediction models, but those for NF1, 3, 7 and 8 were not predicted for all the models. The prediction method relies on the information obtained from the evolutionary history of naturally occurring proteins; predictions for amino acid sequences far from the ones in nature could be difficult.

The number of predicted novel αβ-folds, which is at the lower limit of that of novel αβ-folds, far exceeds that of the folds observed in nature. Moreover, the novel αβ-folds include the knot-forming ones. Recently, functional proteins have been designed de novo34,35,36,37,38,39,40,41,42,43,44,45. The novel αβ-folds predicted in this study should provide a vast scaffold set for designing protein structures with desired functions.

Methods

Structure dataset of naturally occurring proteins

For the derivation of a set of rules for β-sheet topology (Fig. 2), a dataset comprising 12,595 chains obtained from the cullpdb database (accessed 13 December 2018)51 with more than 40 residues, sequence identity <25%, resolution <2.5 Å and R-factor <1.0 was used. For the analysis of β-sheet topologies of naturally occurring protein structures (Fig. 3), a dataset comprising 65,371 domains obtained from the semimanually curated domain database ECOD52, which provides a hierarchical grouping of evolutionarily related domains, with more than 40 residues and sequence identity <99%, was used. For all obtained structures, structure refinements were performed using ModRefiner53, and the secondary structures were assigned using STRIDE54; when the r.m.s.d. of a refined structure against the original structure for Cα atoms was >1.0 Å, the refined structure was discarded and the original one was used.

Analysis of β-sheet topologies in naturally occurring proteins

β-Sheet topologies were defined for open β-sheets included in the protein domains obtained from ECOD on the basis of the following criteria: (1) the lengths of constituent β-strands are more than two residues; (2) the number of β-strands is at least three; (3) two neighboring β-strands have at least two main chain hydrogen bonds between the β-strands; and (4) no insertion along a sequence by any β-strands belonging to another β-sheet consisting of more than two β-strands (Supplementary Fig. 2). Branched β-sheets with β-strands having more than two neighboring β-strands were discarded.

The observation frequencies in nature of all β-sheet topologies were studied using the ECOD database, in which protein domains are classified according to their evolutionary relationships. In the database, two categories, Family and Homology, are defined. Family represents a group of evolutionarily related protein domains identified on the basis of substantial sequence similarity, and Homology represents a group comprising multiple Family groups, evolutionary relationships of which are inferred on the basis of functional and structural similarities (Homology is equivalent to the superfamily in the SCOP2 or CATH3 structure databases). To study the observation frequency for each β-sheet topology, we counted the Homology groups having the topology, with the following consideration. We first examined the occupation ratio (OR) of the topology in the ith Homology group:

$${\mathrm{OR}}\left(i\right)=\frac{\mathop{\sum }\nolimits_{j}^{{N}_{{\mathrm{Family}}}}{R}_{{\mathrm{Family}}}\left(\,j\right)}{{N}_{{\mathrm{Family}}}},$$

where NFamily is the total number of Family groups belonging to the Homology group and RFamily(j) is the ratio of protein domains having the β-sheet topology in the jth Family group. Thus, when all domains in the Homology group contain the β-sheet topology, the occupation ratio of the Homology group is one; otherwise, it is less than one. Finally, the observation frequency for each topology is calculated as the sum of the occupation ratios across Homology groups:

$$\mathop{\sum }\limits_{i}^{{N}_{{\mathrm{Homology}}}}{\mathrm{OR}}(i).$$

For four-stranded β-sheet proteins, we manually checked all structures having topologies with observation frequencies <1.0 and then changed the β-sheet assignments for some of the structures: the β-sheets included in e3hy2X1, e1xw3A1, e2hwjA2, e4rsfA1 and e1tocR2 were identified as β-barrels, and those in e4rgzA1, e2bjjX3, e1iejA2, e3s9lC3, e1blfA4 and e2d3iA3 were identified as six-stranded β-sheets. The defined observation frequency was used to distinguish observed and unobserved topologies in this study: topologies with the observation frequency of 0 were considered unobserved, and evolutionarily unstable topologies with the observation frequency of less than 1/4 were also considered unobserved (Fig. 3b).

Backbone construction

We built a backbone blueprint for each novel αβ-fold. For the X region in para-β-X-β motifs, a helix was built. The lengths of secondary structures and ABEGO torsion patterns for the connecting loops were obtained from previously reported design rules9. For NF1, 3, 4, 5 and 7, α-helices were appended to their termini to ensure a sufficiently large hydrophobic core between the β-sheet and the α-helices. For the same purpose, an α-turn structure consisting of a helix–loop–helix unit was built in the X region in antiparallel β-X-β motifs in NF5, 6 and 7. β-Strand lengths were selected from 4 to 7, and α-helix lengths varied from 11 to 17 residues. The torsion ABEGO patterns of loops were as follows: GB, GBA or BAAB for the connecting loops of para-type αβ units, AB for para-type βα units, BAB or GBB for anti-type βα units, BAAGB for R chiral ββ units and GG for L chiral ββ units. We newly introduced GABA for para-type αβ units and GBB for anti-type αβ units (Extended Data Fig. 5). For α-turn structures, we used the GBB loop22 to connect the two helices. In the NF7 blueprint, AAAB for anti-type βα units (Extended Data Fig. 4) and BA for para-type αβ units (Extended Data Fig. 5) were used to arrange the two α-turns packed with each other.

In total, 1,000–40,000 backbone structures for each blueprint (sufficient number depends on its fold type) were generated by Rosetta sequence-independent Monte Carlo fragment assembly simulations using coarse-grained model backbone structures, in which each residue is represented by main chain atoms (N, H, CA, C and O) and a side chain pseudo atom30. The Rosetta potential function used in the simulations considers steric repulsion (vdw = 1.0), overall compaction (rg = 1.0), secondary structure pairings (ss_pair = 1.0, rsigma = 1.0 and hs_pair = 1.0) and main chain hydrogen bonds (hbond_sr_bb = 1.0 and hbond_lr_bb = 1.0), with no sequence-dependent score terms. The steric radius of Val was used for that of the side chain pseudo atom. The ss_pair and rsigma score terms were modified so that only the strand residue pairs specified in the blueprint were favored in the simulations. To enhance the sampling efficiency for obtaining target topology backbone structures, we built backbone structures part by part. For instance, for the NF2 fold, the N-terminal half (β1–β2–α1–β3), which forms a locally globular substructure, was built first, and, subsequently, the C-terminal half (α2–β4) was built by extending the N-terminal half. The generated backbone structures were further refined as follows. (1) β-Sheet refinement. The entire structure was minimized with constraints making the Cα atoms of the neighboring strand residues in the blueprint to be <5.5 Å, using the Rosetta full-atom FastRelax protocol55 with upweighted hydrogen bonding and backbone torsion angle terms (hbond_sr_bb = 5.0, hbond_lr_bb = 3.0 and omega = 3.0). Val was used for the full-atom side chains for all residues, except for those in the G region in the ABEGO Ramachandran map22 (for which Gly was used). This step was repeated up to ten times until the secondary structures and ABEGO torsion patterns became identical to those designated in the blueprint. (2) α-Helix refinement. The loop–helix–loop structures were rebuilt using the cyclic coordinate descent loop closure method56 implemented in the BlueprintBDR mover. This step was repeated up to ten times for each loop–helix–loop region until the α-helix was built without kinks and the loop torsion patterns were identical to those designated in the blueprint. Next, we selected 100–500 backbone structures in which the terminal α-helices are packed with the β-sheet with the criterion that at least one residue in any continuous five-residue segments in the terminal α-helices is buried (accessible surface area <40 Å2) by contacting with the central two β-strands in the β-sheet. Some of the generated backbone structures showed structural diversity. In such cases, we clustered the backbone structures on the basis of structural similarity, using a hierarchical clustering approach (average linkage). The structural similarity was evaluated with Cα r.m.s.d., using a cutoff for clustering of 1.0–2.0 Å, according to the structural diversity of the generated structures. From the top three largest clusters, we selected the cluster consisting of structures with tightly packed secondary structures. Finally, we averaged the xyz coordinates of the main chain atoms of 30–150 backbone structures in the cluster, followed by the Rosetta idealization protocol with upweighted score terms (hbond_sr_bb = 10.0, hbond_lr_bb = 10.0 and omega = 10.0), resulting in a backbone structure to be used for the subsequent side chain design.

Sequence design

We performed RosettaDesign calculations29 using the full-atom Talaris2014 (ref. 57) scoring function to design side chains (amino acid sequences) that stabilize each generated backbone structure. The design calculation consists of the following three steps: (1) several cycles of amino acid sequence optimization with a fixed backbone and subsequent backbone relaxation; (2) mutations of buried polar residues to hydrophobic ones, followed by optimization of the entire structure; and (3) mutations of solvent-exposed hydrophobic residues to polar residues, followed by optimization of the entire structure. Amino acid types to be used for the design of each residue position, except for that of loop regions, were restricted on the the basis of the secondary structure of the position and the buriedness calculated using virtual amino acids. For the design of each loop region (the residues in the loop and the preceding and following three residues), amino acid types were restricted on the basis of the consensus amino acids obtained from the sequence profile for naturally occurring protein structure fragments, which were collected based on the following criteria: (1) secondary structure and ABEGO torsion pattern identical to those of the loop region and (2) r.m.s.d. against the loop structure <2.0 Å. Through the RosettaDesign calculations, up to 40,000 designs were generated for each design target structure.

The designed sequences were then filtered on the basis of the Rosetta total energy, RosettaHoles score58 <2.0 and packstat score of >0.55 for NF2 and >0.6 for the others. Furthermore, we filtered the designs on the basis of the local sequence–structure compatibility8. We collected 200 fragments for each nine-residue frame in each designed sequence from a nonredundant set of experimental structures, on the basis of the sequence similarity and secondary structure prediction. Subsequently, for each frame, we calculated Cα r.m.s.d. of the local structure against each of the 200 fragments. The designs were ranked according to the summation of the log ratio of the fragments, for which the r.m.s.d. was <1.5 Å across all nine-residue frames, and those with high values were selected.

Protein expression and purification

A spacer was added at the C terminus of each designed sequence (‘GSWS’ for the sequences that have neither a Trp residue nor more than two Tyr residues and ‘GS’ for others) to separate the designed region from the C-terminal 6xHis-tag. Genes encoding the designed sequences were synthesized and cloned into pET21b expression vectors at Eurofins Genomics. The designed proteins were expressed in E. coli BL21 Star (DE3) cells (Invitrogen) as uniformly (U)15N-labeled proteins using MJ9 minimal medium59 containing [15N]ammonium sulfate as the sole nitrogen source and [12C]glucose as the sole carbon source. The expressed proteins with a C-terminal 6xHis-tag were purified using an Ni-NTA affinity column. The purified proteins were dialyzed against PBS buffer (137 mM NaCl, 2.7 mM KCl, 10 mM Na2HPO4 and 1.8 mM KH2PO4, pH 7.4; this buffer was used for all experiments except NMR structure determination). The expression, solubility and purity of the designed proteins were assessed by SDS–PAGE and mass spectrometry (Thermo Scientific Orbitrap Elite). The protein concentrations were determined from the absorbance at 280 nm (ref. 60) measured using a UV spectrophotometer (NanoDrop, Thermo Scientific).

Circular dichroism spectroscopy

CD data were collected on a JASCO J-1500 CD spectrometer using a JASCO SpectraManager software v.2. For all designs, far-UV CD spectra were measured from 260 to 200 nm using ~20-μM protein samples in PBS buffer (pH 7.4) with a 1-mm path length cuvette. For the eight representative designs (NF1-14, NF2-02, NF3-03, NF4-04, NF5-03, NF6-02, NF7-04 and NF8-01), thermal denaturation measurements were performed once from 30 to 170 °C under 1 MPa pressure with an increase of 1 °C min−1. During the denaturation, the ellipticity at 222 nm was monitored, and far-UV CD spectra were measured from 260 to 200 nm at the various temperatures shown in Fig. 4e.

Size-exclusion chromatography combined with multi-angle light scattering

SEC-MALS experiments were performed using a miniDAWN TREOS static light scattering detector (Wyatt Technology) combined with a high-performance liquid chromatography (HPLC) system (1260 Infinity LC, Agilent Technologies). One hundred microliters of 200–500 μM Ni-purified protein samples in PBS buffer (pH 7.4) was injected into a Superdex 75 Increase 10/300 GL (GE Healthcare) or Shodex KW-802.5 (Showa Denko K.K.) column equilibrated with PBS buffer at a flow rate of 0.5 ml min−1. The protein concentrations were calculated from the absorbance at 280 nm detected by the HPLC system. Static light scattering data were collected at three different angles of 43.6°, 90.0° and 136.4° at 659 nm. The data were analyzed using ASTRA software (v.6.1.2, Wyatt Technology) with a change in the refractive index with concentration, a dn/dc value, 0.185 ml g−1.

Two-dimensional 1H-15N heteronuclear single quantum coherence measurement by nuclear magnetic resonance

Two-dimensional (2D) 1H-15N HSQC NMR experiments were performed to verify whether the designed proteins fold into well-packed structures. The HSQC spectra were collected for protein samples of 0.5–1.0 mM in 90% 1H2O/10% 2H2O PBS buffer (pH 7.4) at 25 °C on a JEOL JNM-ECA 600 MHz spectrometer using Delta v.5.0.4 NMR software. The stable monomeric design with the expected number of well-dispersed sharp NMR spectra for each fold (NF1-14, NF2-02, NF3-03, NF4-04, NF5-03, NF6-02, NF7-04 and NF8-01) was selected for NMR structure determination.

Solution structure determination by NMR

Sample preparation

For NMR structure determination of the eight selected designs, uniformly isotope-labeled [U-15N, U-13C] proteins were expressed using the method described above, except that [13C]glucose was used as a sole carbon source. The [U-15N, U-13C]-enriched proteins were purified through a Ni-NTA affinity column followed by gel filtration chromatography on an ÄKTA Pure 25 FPLC (GE Healthcare) using a Superdex 75 Increase 10/300 GL column (GE Healthcare). The purified proteins were dissolved in 95% 1H2O/5% 2H2O PBS buffer at various pH (50 mM NaCl, 1.1 mM Na2HPO4 and 7.4 mM KH2PO4 at pH 6.0 for NF2-02, NF3-03 and NF6-02; 50 mM NaCl, 4.3 mM Na2HPO4 and 5.7 mM KH2PO4 at pH 6.8 for NF5-03, NF7-04 and NF8-01; 50 mM NaCl, 5.6 mM Na2HPO4 and 1.1 mM KH2PO4 at pH 7.4 for NF1-14; and 137 mM NaCl, 1.1 mM Na2HPO4 and 7.4 mM KH2PO4 at pH 7.4 for NF4-04). Shigemi micro-NMR tubes were used for all NMR measurements except RDC (protein concentration ~900 μM for all designed proteins except NF4-04 (~400 μM) and NF6-02 (~700 μM)), and normal NMR tubes were used for RDC experiments (protein concentration ~200 μM).

NMR measurements

NMR measurements were performed on Bruker AVANCE III NMR spectrometers equipped with QCI cryo-Probe (1H/13C/15N/31P) at 303 K. Spectrometers with 600-, 700- and 800-MHz magnets were used for signal assignments and NOE-related measurements, whereas those with 900- and 950-MHz magnets were used for RDC experiments. For signal assignments, 2D 1H-15N HSQC (echo/anti-echo), 1H-13C constant-time HSQC for aliphatic and aromatic signals and three-dimensional (3D) HNCO, HN(CO)CACB and 3D HNCACB for backbone signal assignments, were measured, whereas the BEST pulse sequence61 was used for triple-resonance experiments of NF2, NF3, NF5, NF6, NF7 and NF8. For structure determination, 3D 15N-edited NOESY, and 3D 13C-edited NOESY for aliphatic and aromatic signals (mixing time = 100 ms), were performed. For RDC experiments, 2D IPAP 1H-15N HSQC NMR using WATERGATE pulses for water suppression were measured with or without 6–10 mg ml−1 of Pf1 phage (ASLA Biotech). To confirm the positions of 1H-15N signals in the 2D IPAP 1H-15N HSQC, 3D HNCO in the identical buffer condition containing Pf1 phage were measured. The α and β states of 15N signals split by 1H-15N 1J coupling were separately identified for the protein in the isotropic and weakly aligned states, to obtain one-bond RDC \({}^{1}D_{{}^{1}{\rm{H}}/{}^{15}{\rm{N}}}\) values. They were estimated by simple subtraction of the shifted values between isotropic and weakly aligned states then divided by the static magnetic field to obtain the RDC value in Hz.

NMR signal assignments

All NMR signals were identified using MagRO-NMRViewJ (upgraded version of Kujira31) in a fully automated manner, then noise peaks were filtered by deep-learning methods using Filt_Robot32. The FLYA module was used for fully automated signal assignments and structure calculation62 to obtain roughly assigned chemical shifts (ACS) and trustful ones were selected into a MagRO ACS table. After confirmation and correction of the ACS by visual inspection on MagRO, TALOS+63 calculations were performed to predict phi/psi dihedral angles, which were then converted to angle constraints for the CYANA format. The signal assignments in 2D 1H-15N HSQC spectra for all folds are shown in Supplementary Figs. 310.

Before measuring a series of 3D spectra for the side chain chemical shift assignments for aliphatic and aromatic 1H/13C signals, we inspected 2D 1H/15N and 1H/13C HSQC spectra to evaluate how many crowded, overlapped or missing signals were in these 2D spectra, and then decided the following set of 3D spectra: 3D HCCH-TOCSY, 13C-edited NOESY. The side chain amide signals were assigned using 3D H(CCO)NH, (H)C(CO)NH and 15N-edited NOESY. The details for 3D spectra are described as follows. 3D HCCH-TOCSY: for aliphatic, offset place on 13C-aliphatic center, DIPSI-3 mixing for 13C spin-lock, States-TPPI for 1H, 13C indirect; for aromatic, offset place on 13C-aromatic center, DIPSI-3 mixing for 13C spin-lock, States-TPPI for 1H, 13C indirect. 3D 15N-edited NOESY: HSQC Echo/Anti-echo TPPI gradient section, with Sensitivity Enhancement, without water suppression pulse in D1 (initial delay time). D8 (NOE mixing time) was set around 100–150 ms. [3D H(CCO)NH only for NF4] WATERGATE pulse scheme was used for water suppression on inverse correlation, DIPSI-2 mixing for 13C spin-lock, States-TPPI for 1H, 15N indirect. [3D (H)C(CO)NH only for NF4] WATERGATE pulse scheme was used for water suppression on inverse correlation, DIPSI-2 mixing for 13C spin-lock, States-TPPI for 13C, 15N indirect. [3D (H)C(C)H-TOCSY only for NF4] for aliphatic, offset placed on 13C-aliphatic center, DIPSI-3 mixing for 13C spin-lock, States-TPPI for 13C, 13C indirect; for aromatic, offset placed on 13C-aromatic center, DIPSI-3 mixing for 13C spin-lock, States-TPPI for 13C, 13C indirect.

Structure calculation

Several CYANA64 calculations were performed using the ACS table, NOE peak table and dihedral angle constraints. After the CYANA calculations, several dihedral angle constraints derived from TALOS+ revealing large violations for nearly all models in the structure ensemble were eliminated. After the averaged target function of the ensemble reached <2.0 Å2, refinement calculations using Amber12 were carried out for the 20 models with the lowest target functions. TALOS+ order parameter and the number of NOE distance constraints for each residue are shown in Supplementary Fig. 11.

NMR structure validation

R.m.s.d. values were calculated for the 20 structures overlaid to the mean coordinates for the ordered regions, automatically identified by Filt_Robot using multidimensional nonlinear scaling47. RDC back calculation was performed with PALES48 using experimentally determined values of RDC. The averaged correlation between the simulated and experimental values was obtained using the signals, except for the residues in overlapping regions in 1H-15N HSQC and the residues predicted to be an order parameter of less than 0.8 by TALOS+. Detailed methods and results are described in Table 1 and the Supplementary Text. TALOS+ order parameter and the number of NOE distance constraints for each residue are shown in Supplementary Fig. 11.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.