Main

Many naturally occurring protein structures are complicated, lacking distinguishable symmetry and regularity. Prominent examples of such complicated proteins are globin-fold structures with eight irregularly packed α-helices; Kendrew referred to the tertiary arrangement of the secondary structures as being difficult to describe in simple terms1 (Fig. 1a). In most parts of globin fold structures, two helices adjacent in the sequence are connected crosswise rather than hairpin-like, and the helix–helix packings deviate from the canonical patterns2,3; this fold does not include internal structural repeats such as α-solenoids4,5. These asymmetric, irregular and nonrepetitive secondary structure arrangements make it difficult to simply describe globin structures, and many naturally occurring proteins as well.

Fig. 1: Comparison of the structural complexities of naturally occurring and de novo designed proteins.
figure 1

a,b, Structures of myoglobin (a) and representative de novo designed all-α proteins (b) (the N- and C-terminals are colored in blue and red, respectively, and the characters represent PDB IDs). The α-helices in the globin structure are irregularly aligned, whereas those of the de novo designs are almost parallelly aligned. c, The order parameter capturing the complexities of α-helical proteins, HO. HO is defined by the average of inner products between helix orientation vectors, ui, for all pairs of N α-helices55. Higher values indicate more ordered, and lower values more complicated. d, HO distributions for naturally occurring and de novo designed proteins with three to eight α-helices. Whereas naturally occurring all-α proteins show broad distributions irrespective of the number of constituent α-helices, previous de novo designed all-α proteins indicated by yellow-ocher bars show relatively higher values in the distributions (for details of the previous designs, see Extended Data Fig. 1). Notably, globin structures indicated by blue bars have quite low values. The all-α proteins created in this study, indicated by red bars, have lower values than the previous designs.

Source data

A wide range of all-α protein structures have been designed, but the designs have been limited to simple and ordered structures consisting of α-helices in almost parallel alignment, such as coiled-coil, bundle and barrel structures (Fig. 1b–d and Extended Data Fig. 1)5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27. Jacobs et al. attempted to design α-helical proteins with more variety15, but their designs were still bundle-like (the two designs with five α-helices in Fig. 1b). However, the distribution of complexity for naturally occurring all-α protein structures is biased to the complicated ones (Fig. 1d). The observed distribution bias is probably due to the fact that all-α proteins with complicated spatial arrangements of α-helices can provide diverse and heterogenous molecular surfaces, enabling specific interactions with binding partners. Moreover, such complicated all-α proteins should make it possible to incorporate a functional site enclosed on nearly all sides by multiple structural elements in three dimensions, like globins. Therefore, the ability to create protein structures with irregularly packed α-helices would contribute to the design of various functional proteins.

In this article, we sought to develop a computational method to design complicated all-α structures by employing helix–loop–helix (HLH) motifs typically observed in naturally occurring proteins. The developed method enabled us to generate a wide range of α-helical protein structure topologies from bundle-like to complicated by combining the typical HLH motifs and canonical α-helices. Finally, we demonstrated the ability to create complicated all-α proteins by de novo design of five distinct topologies.

Strategy for all-α topology building

Although it has been suggested that the overall tertiary arrangements of helices of naturally occurring α-helical proteins can be approximated by a quasi-spherical polyhedral model28, the major obstacle in designing complicated all-α topologies with irregularly aligned α-helices is attributed to the difficulty in determining a priori feasible topologies with their backbone blueprints involving lengths of secondary structures and loops. This is different from the design of αβ-proteins: the topologies are selected in advance by β-strand arrangements (that is, the order and orientations of β-strands in a β-sheet), and the backbone blueprints were derived from a set of rules relating local backbone structures of a few successive secondary structure elements to the preferred tertiary motifs29. Therefore, we attempted to explore all-α topologies, not by preparing them a priori but by generating backbone structure topologies through the combinatorial enumeration of tertiary building blocks (Fig. 2). Moreover, the tertiary building blocks were selected from those typically observed in nature, so that the generated backbone structures are likely to be feasible. Therefore, the question is whether complicated all-α topologies can be generated from typical building blocks.

Fig. 2: Strategy for building α-helical backbone structure topologies.
figure 2

Top: building blocks for generating backbone structures. Canonical α-helices and three types of HLH tertiary motifs typically observed in nature, hairpin (h), v-shaped (v) and corner (c), are used. Helices range from 5 to 35 residues, and each motif type comprises six patterns (Fig. 3a). The motif types were classified on the basis of the bending angle between the constituent helices in HLH motifs. Middle: secondary-structure element ordering to build α-helical proteins with five helices. According to the ordering, globular backbone structures without steric clashes are exhaustively explored by combining the building blocks, with the constraint of total residue length. Bottom: examples for generated α-helical backbone structure topologies. Poorly packed structures (lower) are discarded, whereas globularly folded structures (upper) are collected.

A typical set of HLH motifs as building blocks

We first attempted to collect a set of HLH tertiary motifs that are typically observed in nature as building blocks. The HLH units consisting of two α-helices and the connecting loop of one to five residues in length were extracted from naturally occurring proteins, then clustered into 18 subgroups based on the five-dimensional feature vectors representing the HLH tertiary geometries30 (Extended Data Figs. 2 and 3 and Methods). The representative 18 HLH motifs corresponding to each cluster density peak exhibited a broad range of bending angles between two helices, such as left- or right-handed helix–turn–helix, helix–corner–helix and kinked helices (Fig. 3a and Extended Data Figs. 3 and 4); the amino acid preference for each motif is shown in Extended Data Fig. 5: Gly at the residues with positive phi backbone torsion angle, helix capping residues immediately before helices such as Asp, Asn, Thr or Ser (refs. 31,32), and hydrophobic periodicity of helix residues specific to each motif are observed. The 18 HLH motifs are classified into three classes according to the magnitude of the bending angle: hairpin (h), v-shaped (v) and corner (c). The 18 representative HLH motifs were used as building blocks (Fig. 2, top) for generating α-helical backbone structure topologies.

Fig. 3: 18 HLH tertiary motifs and generated α-helical backbone structures.
figure 3

a, Identified 18 HLH tertiary motifs typically observed in nature. The motifs are classified based on the bending angle between the two helices in the motifs: hairpin (h), v-shaped (v) and corner (c), which are presented in order of the magnitude of the bending angle, with the ABEGO backbone torsion pattern for the connecting loop. The residues with the backbone torsion angle, A, B, E and G, in the ABEGO torsion representation (‘A’ corresponds to the right-handed α-helix region in the Ramachandran map, ‘B’ to the β-strand region, ‘E’ to the extended region with a positive phi angle, and ‘G’ to a left-handed α-helix) are shown in red, blue, yellow and green, respectively. b, HO distributions for generated backbone structures with six helices. The black, gray and white bars respectively represent the distributions for the ensemble generated using only hairpin motifs (h), hairpin and v-shaped (h + v) motifs, and all three motifs (h + v + c). Incorporation of v-shaped and corner loops yields lower HO structures. c, Examples for the generated backbone structures. The used motif type strings and the HO values are indicated for each structure. The N- and C-terminals are colored in blue and red, respectively.

Source data

Generation of all-α topologies by combinatorial enumeration

Next, we investigated whether complicated topologies are produced using these typical tertiary motifs. Helical backbone structures composed of five and six helices were built with 90 and 110 residues in the total length, respectively, by combining the set of 18 HLH motifs and canonical α-helices ranging from 5 to 35 residues. The backbone structures were generated by enumerating all the combinations and selecting compact and steric-clash-free structures (Methods): 1,159,937,910 five-helix and 20,878,882,380 six-helix structures were enumerated, and 1,899,355 and 380,869 structures were then selected for each. The resulting topologies exhibited a broad spectrum ranging from helical bundle-like to complicated globular structures, demonstrating that complicated α-helical topologies are created from the typical tertiary motifs and canonical α-helices (Fig. 3b, white bar; Fig. 3c and Extended Data Fig. 6); the helix lengths were also widely distributed in the generated structures (Extended Data Fig. 7). Moreover, we found that the complexities of the generated topologies increase, as tertiary motifs with larger bending angles are included (black, gray and white bars in Fig. 3b). These results highlight the importance of corner-type motifs33 in building complicated α-helical topologies.

Design of complicated α-helical topologies

From the generated myriad backbone structure topologies, we selected five for de novo design, H5_fold-0, H6_fold-C, H6_fold-Z, H6_fold-U and H7_fold-K (the Arabic numeral after ‘H’ indicates the number of helices) (Fig. 4 and Supplementary Fig. 1), in the following way. We first selected three topologies exhibiting extremely low helix order (HO) values (for the definition, see Fig. 1c and Methods): H5_fold-0, H6_fold-C and H6_fold-Z (Fig. 1d). Next, to test whether all identified HLH motifs could be used for de novo design, we selected H6_fold-U and H7_fold-K, which include all of the HLH motifs not used in the first three and still exhibit lower HO values (Fig. 1d). For all target folds except H5_fold-0, the lengths of the terminal helices were manually elongated to ensure sufficient packing interactions. None of these backbone structures is similar to any known protein structures; H5_fold-0, H6_fold-C, H6_fold-Z and H6-fold-U show a TM-score <0.6, using TM-align34 against the ECOD database35, and H6_fold-K shows a score of 0.610, with a structure of e2bnlA1 (Extended Data Fig. 8). The details of the selected topologies are described in Supplementary Text. For each backbone structure, amino acid sequences were designed through iterations of fixed-backbone sequence optimization and fixed-sequence structure optimization using Rosetta design calculations36,37. Designs with low energy, tight core packing38 and high compatibility between local sequences and structures29 were selected, and their energy landscapes were explored by 10,000 independent Rosetta ab initio structure prediction simulations starting from an extended conformation39. Ninety-one percent (75 of 82 designs) for H5_fold-0, 45% (18 of 40 designs) for H6_fold-C, 68% (27 of 40 designs) for H6_fold-Z, 67% (60 of 90 designs) for H6_fold-U, and 40% (36 of 90 designs) for H7_fold-K, showed funnel-shaped energy landscapes. Among the designs having funnel-shaped energy landscapes, we selected approximately ten designs for each topology (for the details, see Methods).

Fig. 4: Backbone structures for the five design target topologies.
figure 4

The design target backbone structures. H1–7 represents the first to seventh helices. The letter string next to a loop indicates the ABEGO torsion pattern and the character within a bracket indicates the motif type. The loop residues are colored in the ABEGO torsion representation, same as Fig. 3a. The HO value and radius of gyration (Rg) are indicated for each structure.

Experimental characterization of designed proteins

We obtained synthetic genes encoding ten designs for H5_fold-0, seven for H6_fold-C, seven for H6_fold-Z, eight for H6_fold-U and eight for H6_fold-K. Some designs (H6_fold-Z, 2; H6_fold-U, 1; H7_fold-K, 2) have weak sequence similarity to known proteins with blast E-value <0.005, but the structures are unknown (Supplementary Table 1). The proteins were expressed in Escherichia coli and purified using a Ni2+-NTA affinity column. The purified proteins were then characterized by circular dichroism (CD) spectroscopy and size-exclusion chromatography combined with multi-angle light scattering (SEC–MALS). For all design target topologies, 34 of 40 designed proteins were found to be well expressed and highly soluble, and showed CD spectra typical of α-helical proteins; 27 out of the 34 designs were found to be monomeric by SEC–MALS (Supplementary Tables 26). Furthermore, the monomeric designs were characterized by 1H-15N heteronuclear single quantum coherence (HSQC) nuclear magnetic resonance (NMR) spectroscopy, and 23 designs showed well-dispersed sharp peaks (Supplementary Tables 26 and Supplementary Fig. 2). The experimental results for all the designs are summarized in Extended Data Table 1. For each topology, we selected one monomeric design with well-dispersed sharp NMR peaks for NMR structure determination (Fig. 5 and Supplementary Fig. 3). All the designs were found to be highly stable from thermal denaturation up to 170 °C by CD (Fig. 5b,c). The NMR structures were solved at high quality using MagRO-NMRViewJ40,41 (Table 1, Supplementary Text, Supplementary Figs. 4 and 5 and Supplementary Table 7), and the solved structures were consistent with the design models (Fig. 6 and Supplementary Table 8). For H5_fold-0, one of the designs was solved by X-ray crystallography and was nearly identical to the design model except for the domain swapping in the crystallized condition (Fig. 6, Table 2 and Supplementary Fig. 5). Despite the inclusion of noncanonical helix–helix packing arrangements in each design, the sidechains from distant α-helices were found to be coherently packed to constitute a single hydrophobic core similar to the design model. Notably, the bulky hydrophobic sidechains from the loops and neighboring α-helices also contributed largely to the core: they spiked the core and pinned the loops to the target conformations (Extended Data Fig. 9; for the importance of hydrophobic residues in the HLH motifs on energy landscapes of the designs, see Supplementary Fig. 6b,e). Interestingly, the N- and C-terminal helices of H6-FoldU_Nomur was found to be fluctuated despite the helix formation (Supplementary Figs. 7–9). Furthermore, in the thermal denaturation, the helical content of H6-FoldU_Nomur was gradually decreased before the transition (the second from the bottom in Fig. 5c), and in the chemical denaturation, the m-value, which represents the cooperativity, was lower than those of the other designs (Fig. 5d; note that m-values also depend on protein size, with larger proteins having larger m-values42; therefore, the H5_fold-0_Elsa and Chantal, which are smaller in size than the other designs, show lower m-values). These results would be attributed to the low hydrophobicity for the core-forming residues of the C-terminus: almost all of the residues are Ala (Supplementary Fig. 8). We also compared the loop geometries of all HLH motifs at the ABEGO level in the design models and experimental structures (Supplementary Fig. 10 and Supplementary Table 9) (for the importance of helix capping residues in the HLH motifs on energy landscapes of the designs, see also Supplementary Fig. 6c,f). Except for the loop immediately before the C-terminal helix of H6-FoldU_Nomur, all loop geometries of the experimental structures agreed with those of the design models. These results indicate that the difficult-to-describe α-helical proteins are designable with typical building blocks.

Fig. 5: Characterization of designed proteins.
figure 5

a, Energy landscapes from Rosetta ab initio structure prediction simulations. The y axis represents Rosetta all-atom energy and the x axis represents the Cα RMSD from the design model. Black points represent the lowest energy structures obtained in independent Monte Carlo structure prediction trajectories starting from an extended chain for each sequence; red points represent the lowest energy structures obtained in trajectories starting from the design model. b, Far-ultraviolet CD spectra at 30 °C, the temperatures close to the melting temperature Tm, and 170 °C. The CD spectra were recorded under the pressure of 10 bar. c, Thermal denaturation measured at 222 nm under the pressure of 10 bar. For each design, the data were fitted to a two-state model (black solid line) to obtain the Tm. d, Chemical denaturation with GuHCl (square brackets denote concentration) measured at 222 nm and 25 °C. For each design, the data were fitted to a two-state model (black solid line) to obtain the free energy of unfolding ΔG and its dependency on the denaturant, m-value. e, Two-dimensional 1H-15N HSQC spectra at 25 °C and 600 MHz.

Table 1 NMR constraints and structure statistics of the five designed structures
Fig. 6: Comparison of computational models with experimentally determined structures.
figure 6

Design structures (left) and NMR structures (a crystal structure for H5_fold-0_Elsa) (right); the N- and C-terminals are colored in blue and red, respectively. The Cα RMSD between them is indicated (for H5_fold-0_Elsa, of which crystal structure is domain-swapped dimer, the Cα RMSD was calculated using MICAN56).

Table 2 X-ray crystallography data collection and refinement statistics

Discussion

De novo designs of α-helical proteins have focused on structures consisting of parallelly aligned α-helices (Fig. 1), many of which are based on helical structure models such as the helical wheel43 and Crick’s parameterization44. We sought to develop a computational method for designing difficult-to-describe α-helical protein structures. We first identified the 18 HLH motifs typically observed in naturally occurring proteins. We then demonstrated that a wide range of globular all-α backbone structure topologies from bundle-like to complicated are generated by combining the 18 typical HLH motifs and canonical α-helices. The key to building complicated α-helical topologies is to include HLH motifs with larger bending angles such as corner-type motifs. The approach of this developed method is regarded as the reverse of blueprint-based design: design target topologies are searched by the combinations of HLH motifs in this approach, whereas design target topologies are predetermined and then local backbone structures favoring the topologies are selected in blueprint-based design.

We succeeded in designing complicated α-helical protein structures with five distinct topologies, three of which, H5_fold-0, H6_fold-C and H6_fold-Z, exhibited structural complexities comparable to the globin fold. The design success rate was as high as that of previous de novo designs, and the design exhibited high solubility and thermal stability, similarly to previous designs29,45,46,47,48,49. Moreover, the loop geometries of almost all HLH motifs were formed as designed, which must have enabled the designed proteins to fold into the target topologies. These de novo design results indicate that the compact and steric-clash free backbone structures generated by using the typical HLH motifs are probably designable. In this regard, however, one of the questions is whether all or how much of the generated backbone structures can have tight core packing of sidechains. We have demonstrated that the selected five backbone structures are packable through de novo design, but the packability for the other backbone structures has not been clarified, which should be addressed in next works.

The computationally generated myriad of complicated all-α structures should provide diverse and heterogeneous molecular surfaces for engineering functions such as binding, enzymatic activity and self-assembly into symmetric oligomers. The myriad of generated structures, which are presumably highly soluble and stable, coupled with the recently developed massive gene synthesis50,51 and parallel high-throughput screening17,18,26,52, should make it possible to create proteins with optimal structures for specific functions17,26.

Methods

Definition of HO

HO is the order parameter that captures the complexities of α-helical proteins. HO is defined by the average of inner products between helix orientation vectors, ui, for all pairs of N α-helices55:

$${\mathrm{HO}}=\frac{2}{N(N-1)}\mathop{\sum }\limits_{\begin{array}{c}i=1\\ i < j\end{array}}^{N}{({{\bf{u}}_{\bf{i}}}\cdot {{\bf{u}}_{\bf{j}}})}^{2}.$$

Higher values indicate more ordered and lower values, more complicated.

Analysis of all-α protein structures for de novo designed and naturally occurring proteins

Twenty-two de novo designed all-α protein structures were collected from Protein Data Bank (PDB). To this end, de novo designs were searched by the keyword ‘de novo’ or ‘de-novo’ in PDB as of November 2020, and then all-α structures containing no β-strands were extracted on the basis of the secondary structure assignments by the DSSP algorithm57 (for the PDB structures including multiple chains or NMR models, the first chain or model was used). The following four classes of de novo designed proteins were excluded from the dataset: (1) designed proteins created on the basis of backbone structures of naturally occurring proteins, and those with sequence similarity higher than 0.90 (as an exception, the three-helix bundle structure designs (PDB code: 6DS9 and 2A3D) were both included because of their structural dissimilarity); (2) assemblies composed of one or two α-helices (for example, 3R3K and 1U7J); (3) repetitive structures such as α-solenoids (for example, 1MJ0 and 5K7V); (4) membrane proteins.

For naturally occurring all-α protein structures, 7,352 representative structures found in the mainly-α class in the CATH database58 with sequence identity less than 40% were used.

For calculating the HO values of the collected structures, secondary structure elements and loops were assigned by DSSP57 (α-helices are defined for the residue regions composed of at least five successive residues assigned as ‘H’ by the DSSP calculation). Note that the secondary structure assignments by DSSP are not always consistent with those originally defined by the authors. For example, the number of α-helices in the structures (PDB codes: 4TQL and 1P68) respectively designed with three and four α-helices were defined as four and five due to partially distorted α-helices.

Clustering of HLH units using the five features representing a HLH geometry

A total of 13,667 HLH structures were extracted from 7,280 X-ray structures (secondary structures were assigned by DSSP57), obtained from the PISCES server59, with resolution ≤2.5 Å, R-factor ≤0.3, sequence length more than or equal to 40, and ≤25% sequence identity. We then classified the HLH structures by their loop residue lengths and extracted 13,510 HLH structures in total with loop of one to five residues in length. The extracted HLH structures were clustered for each loop length from one to five using the density clustering algorithm30 (Extended Data Fig. 3), with the five features representing a HLH geometry (Extended Data Fig. 2).

Building backbone structures

α-Helical backbone structures were built using Rosetta by exhaustive sampling for the conformations with steric-clash free (Rosetta vdw score <4.0 using the weight value, 0.1) and smaller radius of gyration (<14 Å) (the threshold value corresponds to the peak of the distribution of the radius of gyration for naturally occurring proteins; Supplementary Fig. 11) by combining canonical α-helices ranging from 5 to 35 residues (backbone torsion angles, phi, psi and omega, were set to −60.0, −45.0 and 180.0, respectively) and the identified 18 HLH motifs (Main and Fig. 3a), with length constraints of 90 and 110 residues for the five- and six-helix proteins, respectively. For generating five-helix structures, 64,440,995 steric-clash free four-helix structures with 70 residues were first generated, and then an α-helix with 18 types of connecting loops was appended to the C-terminal of the generated four-helix structures so that the total length becomes 90 residues. For generating six-helix structures, an α-helix with 18 types of connecting loops was appended to the N-terminal of the generated five-helix structures so that the total length becomes 110 residues. From these structures, the globular five- and six-helix structures were collected on the basis of the radius of gyration.

Selection of designs for experimental characterization based on the shapes of energy landscapes

We set three criteria for the selection by the shapes of energy landscapes. First, the overall shape of the landscape should be funnel-like with an apparent and sharp ‘neck’ reaching low-root mean square deviation (RMSD) and low-energy region, which is the hallmark of the foldability specifically into the target conformation. This is the most important criterion on the selection of energy landscape plots: for ill-designed sequences, all conformations remain in the high-RMSD and high-energy regions and do not have such a ‘neck’. Second, the funnel should not have subminima that indicate that the protein has alternative folded states. This is a criterion to exclude the possibility of misfolding and avoid a rugged energy landscape. Third, the ensemble of lowest-RMSD and lowest-energy conformations at the bottom of the funnels should not be away from, and ideally should overlap with, the conformational ensemble in the simulations starting from the target structure. This criterion is not mandatory, but consistency between fragment assembly simulations that offer global sampling and near-native relax simulations helps us to rank the designs with the similar quality in terms of the first and second criteria.

Expression and purification of designed proteins

The genes encoding the designed sequences were synthesized and inserted into pET21b vectors. The whole plasmid constructs were purchased from FASMAC or Eurofins Genomics. The target proteins were overexpressed by IPTG induction in E. coli BL21 Star (DE3) cells cultured in MJ9 minimal media including 15N ammonium sulfate as the sole nitrogen source and 12C glucose as the sole carbon source60. The expressed uniformly (U-)15N-labeled proteins with a 6xHis tag at the C-terminus were purified by Ni2+-affinity columns. The purified proteins were then dialyzed against phosphate-buffered saline (PBS) buffer, 137 mM NaCl, 2.7 mM KCl, 10 mM Na2HPO4 and 1.8 mM KH2PO4, at pH 7.4; this buffer was used for all the experiments except NMR structure determination. The expression level, solubility and purity of each designed protein were evaluated by sodium dodecyl sulfate–polyacrylamide gel electrophoresis. To further confirm them, the samples were analyzed by mass spectroscopy (Bruker Daltonics REFLEX III and Thermo Scientific Orbitrap Elite).

Experiments to identify designed proteins exhibiting folding ability

The following three experiments were conducted to evaluate the folding ability of designed sequences: CD spectroscopy, size exclusion chromatography with multi-angle light scattering (SEC–MALS) and 1H-15N HSQC NMR spectroscopy. Supplementary Tables 26 present the results of the evaluations for each designed sequence for each fold.

CD spectroscopy under 1-bar pressure

Far-UV CD spectra was measured to study whether the designs show the characteristic spectra of α-helical proteins, by scanning from 260 to 200 nm at 20 °C for 15 μM protein samples in PBS buffer on a JASCO J-1500 CD spectrometer. The measurements were performed four times and then averaged.

SEC–MALS

Oligomeric states for the designs in solution were studied by SEC–MALS with miniDAWN TREOS static light scattering detector (Wyatt Technology Corp.) combined with a high-performance liquid chromatography system (1260 Infinity LC, Agilent Technologies) with a Shodex KW-802.5 column (Showa Denko K.K.) for H5_fold-0_Chantal and H6_fold-C_Rei or a Superdex 75 increase 10/300 GL column (GE Healthcare) for H5_fold-0_Elsa, H6_fold-Z_Gogy and H7_fold-K_Mussoc. After the equilibration of the column with PBS buffer, 100 µl of the samples after purification by Ni2+-affinity columns were injected. The absorbance at 280 nm was measured by the high-performance liquid chromatography system to give the protein concentrations and intensity of light scattering at 659 nm was measured at angles of 43.6°, 90.0° and 136.4°. These data were analyzed by the ASTRA software (version 6.1.2, Wyatt Technology) using a change in the refractive index with concentration, a dn/dc value, 0.185 ml g−1, to estimate the molecular weight of dominant peaks.

1H-15N HSQC NMR spectroscopy

Whether the designs fold into well-packed structures or not was evaluated by 1H-15N HSQC 2D-NMR spectroscopy. The purified protein samples were concentrated to 0.2–1.0 mM, and mixed with their 10% volume of D2O. The experiments were performed at 25 °C on a JEOL JNM-ECA 600 MHz spectrometer, and data were analyzed by JEOL Delta (version 5.3.1).

High-pressure CD spectroscopy for melting temperature (T m) estimation

For the designs that were evaluated to have the folding ability in the above experiments (one design for each target topology was selected), thermal denaturation was studied by using high-pressure CD spectroscopy. JASCO J-1500 CD spectrometer was equipped with additional pressure instruments so that temperature of the solution samples can be scanned from 30 °C to 170 °C under 10 bar. Temperature was increased 1 °C per minute for 15 μM protein samples. Fixed wavelength measurements at 222 nm were performed at every 1 °C, and wavelength scanning measurements (260 to 200 nm) were performed at 30, 40, 60, 80, 90, 100, 110, 120, 130, 140, 150, 160 and 170 °C. Thermal denaturation was measured once. Tm was estimated by nonlinear fitting to thermal denaturation CD curve at 222 nm. The nonlinear least-squares analysis was performed by nls function in R language, given a two-state unfolding and linear extrapolation model. After this fitting, we obtained Tm at which the estimated populations of folded and unfolded states become equal.

CD spectroscopy for chemical denaturation

Chemical denaturation with GuHCl was monitored at 222 nm for 2–3 μM protein samples in PBS buffer (pH 7.4) at 25 °C in a 1-cm path length cuvette. The GuHCl concentration was automatically controlled by a JASCO ATS-530 titrator. Chemical denaturation was measured once. The chemical denaturation curves were fit by nonlinear least-squares analysis using a two-state unfolding and linear extrapolation model61. The free energy change, ΔG, for the unfolding transition and its dependency on the denaturant, m-value, were obtained from the fitting.

Sample preparation for NMR structure determination

The most promising design for each target topology was overexpressed by IPTG induction in E. coli BL21 Star (DE3) cells cultured in MJ9 minimal media containing 15N ammonium sulfate as the sole nitrogen source and 13C glucose as the sole carbon source60. The expressed U-15N,U-13C-enriched proteins were purified by Ni2+-affinity columns, and dialyzed against PBS buffer. The protein samples were further purified by gel filtration chromatography on an ÄKTA Pure 25 FPLC (GE Healthcare) using a Superdex75 or Superdex75 increase 10/300 GL column (GE Healthcare), which also replaced the PBS buffer at pH 7.4 with the customized buffer for NMR spectroscopy. The following 95% H2O/5% D2O buffer conditions for each sample were used: 100 mM NaCl, 5.6 mM Na2HPO4, 1.1 mM KH2PO4, at pH 7.4 for H5_fold-0_Chantal; 50 mM NaCl, 5.5 mM Na2HPO4, 4.5 mM KH2PO4, at pH 6.9 for H6_fold-C_Rei; 50 mM NaCl, 3.2 mM Na2HPO4, 4.5 mM KH2PO4, at pH 6.5 for H6_fold-Z_Gogy; 155 mM NaCl, 3.0 mM Na2HPO4, 1.1 mM KH2PO4, 10 μM ethylenediaminetetraacetic acid, 0.02% NaN3, cOmplete protease inhibitor cocktail (Roche), at pH 7.4 for H6_fold-U_Nomur; and 155 mM NaCl, 3.0 mM Na2HPO4, 1.1 mM KH2PO4, at pH 7.4 for H7_fold-K_Mussoc.

Solution structure determination by NMR

NMR measurements

NMR measurements were performed on Bruker AVANCE III NMR spectrometers equipped with QCI cryo-Probes at 303 K. The spectrometers with 600, 700 and 800 MHz magnets were used for the signal assignments and nuclear Overhauser effect (NOE)-related measurements, while 700, 900 and 950 MHz ones, for residual dipolar coupling (RDC) experiments. For the signal assignments, 2D 1H-15N HSQC (echo/anti-echo), 1H-13C Constant-Time HSQC for aliphatic and aromatic signals, 3D HNCO, HN(CO)CACB and 3D HNCACB for backbone signal assignments, while BEST pulse sequence was applied to the triple resonance measurements for H6_fold-C_Rei. For structure determination, 3D 15N-edited NOESY and 3D 13C-edited NOESY for aliphatic and aromatic signals (mixing time 100 ms) were performed. For H6_fold-U_Nomur, additional 3D HN(CA)CO, HN(CO)CA, HNCA, HBHA(CO)NH, HBHANH, H(CCCO)NH, CC(CO)NH, 3D 13C-HSQC (13C-t1) NOESY 13C-HSQC, 3D 13C-HSQC (13C-t1) NOESY 15N-HSQC and 4D 13C-HSQC NOESY 13C-HSQC were measured. Except for 3D-edited NOESY, all the other spectra were performed using non-uniform sampling (NUS) for H6_fold-U_Nomur and H7_fold-K_Mussoc. For NUS, sampling ratio was set at 25% for 3D and 6% for 4D with a fixed random seed. The NUS spectra were reconstructed by iteratively re-weighted least squares for 3D while iterative soft thresholding for 4D spectra with virtual-echo technique using qMDD tool62.

For the RDC experiments, 2D in-phase and anti-phase (IPAP) 1H-15N HSQC using water-gate pulses for water suppression were measured with or without 6–10 mg ml−1 of Pf1 phage (ASLA Biotech). For confirming the positions of 1H-15N signals in the 2D IPAP 1H-15N HSQC, 3D HNCO at the identical buffer condition containing Pf1 phage were measured. The α- and β-states of 15N signals split by 1H-15N 1J-coupling were separately identified for the protein in the isotropic and weakly aligned states, to obtain 1-bond RDC \({}^{1}D_{{}^{1}{\rm{H}}/{}^{15}{\rm{N}}}\) values. For the sample H6_fold-U_Nomur, 3D J-HNCO (without 1H decoupling for 15N evolution) was measured at 25% NUS, which were used for confirming α- and β-states of 15N signal positions overlapped in 2D IPAP spectra. 3D J-HN(CO)CA spectrum was also measured for H6_fold-U_Nomur to obtain \({}^{1}D_{{}^{1}{\rm{H}}\alpha /{}^{13}{\rm{C}}\alpha }\) for appending an additional number of alignment data at the identical magnetic field and alignment tensor.

NMR signal assignments

All NMR signals were identified in a fully automated manner using MagRO-NMRViewJ (upgraded version of Kujira40), in which noise peaks were filtered by deep-learning methods using Filt_Robot41. FLYA module was used for fully automated signal assignments and structure calculation63 to obtain roughly assigned chemical shifts (Acs), and then trustworthy ones were selected into the MagRO Acs table. After confirmation and correction of the Acs by visual inspection using MagRO, TALOS+64 calculations were performed to predict phi/psi dihedral angles, which were then converted to angle constraints for the CYANA format.

Structure calculation

Several CYANA65 calculations were performed using the Acs table, NOE peak table and dihedral angle constraints. The Acs table was exported by the MagRO CYANA module, and then the aliased chemical shifts were automatically calculated depending on the spectrum width of responsible NOESY spectra. For dihedral angle constraints, phi and psi, with deviation were derived from TALOS+ prediction using chemical shifts of 15N, 13C′, 13Cα and 13Cβ, with high prediction score noted by ‘Good’. The minimal angle deviation was set at 20°. After several iterations of CYANA calculations, dihedral angle constraints derived from TALOS+64 revealing large violation for nearly all models in structure ensemble were eliminated.

After the averaged target function of the ensemble reached to less than 2.0 Å2, refinement calculations by Amber12 were carried out for 20 models with lowest target functions. The coordinates of final.pdb calculated by CYANA, distance constraints (final.upl), dihedral angle constraints derived from TALOS+ prediction were converted into Amber format and topology file using Sander Tools. Firstly, 500 steps of minimization (250 steps of steepest decent, 250 steps of conjugate gradient) were carried out without electrostatic potential and NMR constraints. Second, molecular dynamics simulations with the ff99SB force field using implicit water system (0.1 M of ionic strength, 18.0 Å of cutoff) were performed, in which the temperature was gradually increased from 0.0 K to 300.0 K by 1,500 steps, followed by the simulation with 28,500 steps at 300.0 K (1.0 fs time step, total 30 ps). Finally, 2,000 steps for minimization (1,000 steps for steepest decent and 1,000 steps for conjugate gradient) with constraints of distance and dihedral angle were applied at the same condition used in the molecular dynamics simulations.

NMR structure validation

The RMSD values were calculated for the 20 structures overlaid to the mean coordinates for the ordered regions, automatically identified by Filt_Robot using multi-dimensional nonlinear scaling54.

The RDC back-calculation was performed by PALES66 using experimentally determined values of RDC. The averaged correlation between the simulated and experimental values was obtained using the signals except the residues on overlapped regions in 1H-15N HSQC and the ones in low-order parameters less than 0.8 predicted by TALOS+. For the validation of H6_fold-U_Nomur, a lot of signals were overlapped in 2D IPAP-HSQC spectra. To overcome this problem, \({}^{1}J_{{\mathrm{HN}}-{}^{15}{\rm{N}}}\) split 3D HNCO (without 1H-decoupling scheme in 15N evolution period) spectra in isotropic and anisotropic states were measured by NUS (25% data point reduction) to obtain signal positions of α- and β-states of 15N spins at resolution of 0.3 Hz. \({}^{1}J_{{\mathrm{H}}{\rm{\alpha }}/{}^{13}{\mathrm{C}}{\rm{\alpha }}}\) split 3D HN(CO)CA spectra at the same conditions were also measured to obtain \(\,{}^{1}D_{{}^{1}{\rm{H}}{\rm{\alpha }}/{}^{13}{\rm{C}}{\rm{\alpha }}}\) at resolution of 0.2 Hz. Initially the RDC reproducibility of H6_fold-U_Nomur were examined using separately \({}^{1}D_{{\mathrm{HN}}-{}^{15}{\rm{N}}}\) and \({}^{1}D_{{\mathrm{H}}{\rm{\alpha }}-{}^{13}{\mathrm{C}}{\rm{\alpha }}}\) tables by PALES for all models to confirm that the averaged correlation coefficients are greater than 0.9, and then final correlation coefficients were calculated with two merged tables.

Solution structural dynamics of H6_fold-U_Nomur measured by NMR

15N R 1, R 2 and 15N-{1H} NOE experiments

The 15N R1, R2 and 15N-{1H} NOE measurements were performed for a uniformly 15N-labeled H6_fold-U_Nomur protein sample with a concentration of 0.78 mM, which is the same condition as the solution used for the structure determination. These were conducted at 303 K on Bruker 700 MHz Avance-III NMR spectrometer equipped with cryogenic probe, using the 4-mm-diameter NMR Shigemi-tube. The 15N R1 and R2 were obtained by measuring 2D 1H-15N HSQC with the inversion-recovery technique and with the temperature-compensated CPMG method, respectively67. Steady-state 15N-{1H} NOE was obtained by measuring 2D 1H-15N HSQC spectra with and without saturation pulse in each of the retardation time acquired by the interleaved method. The 2D 1H-15N peaks were automatically identified and assigned using the MagRO software40. Some assignments were corrected with visual inspection. The 15N-{1H} NOE values were estimated as the peak intensity ratio I/I0 derived from the 2D HSQC spectra with (I) and without (I0) saturation pulse. The I/I0 data were fitted by using an exponential equation, I/I0 = exp(−R × t) with delay time t (s) to obtain the 15N relaxation rate constant R (s1).

2D 1H-15N CLEANEX-PM FHSQC experiments

The uniformly 15N-labeled protein sample of H6_fold-U_Nomur was lyophilized, and then 2D 1H-15N HSQC data were collected immediately after dissolving the lyophilized sample in 100% D2O. However, protons of the amide groups of most residues were promptly replaced by deuterium within 10 min after the dissolution, probably due to the high pH of the sample solvent (pH 7.4). This prevented us to obtain practical H–D exchange rates. Therefore, the exchange rates between the water and amide protons were obtained using the 2D 1H-15N CLEANEX-PM FHSQC68,69 scheme. In this method, the exchange ratio depends only on kopen in the protein folding/unfolding. The amide group would be in the EX1 limit due to the relatively high pH of 7.4, namely kclose « k, where kclose is the global and/or local folding rate of a protein and k is the exchange rate of amide group in the unfolded state, the observable solvent exchange rate kex would be obtained as the global and/or local unfolding rate of a protein, kopen. The 2D 1H-15N FHSQC data without applying spin-lock pulse was also measured under the same condition to obtain the reference, I0. For 2D 1H-15N CLEANEX-PM FHSQC spectra with different spin-lock time tm and the reference spectrum, the observed peaks were automatically identified and assigned by MagRO40 with manual correction to obtain a normalized list of signal intensities for each residue. The following equation was used to obtain kobs for each residue:

$$\frac{I}{{I}_{0}}=\frac{{k}_{{{\mathrm{ex}}}}}{{k}_{{{\mathrm{ex}}}}+{R}_{1{\mathrm{A}}}-{R}_{1{\mathrm{B}}}}\times \left\{\exp \left(-{R}_{1{\mathrm{B}}}\times {t}_{m}\right)-\exp \left[-\left({R}_{1{\mathrm{A}}}+{k}_{{{\mathrm{ex}}}}\right)\times {t}_{m}\right]\right\},$$

where R1B is the apparent longitudinal relaxation rate of water molecules, and R1A is a mixture of the apparent longitudinal and transverse relaxation rates on the rotational frame for the residue of interest. The values of R1A and kex for each residue with error values were obtained by curve-fitting by this equation, with the assumption, R1B = 0.6 (s1).

X-ray structure determination of H5_fold-0_Elsa

Sample preparation for X-ray structure determination

The gene encoding the designed sequence of H5_fold-0_Elsa in pET21b vector was digested at the NdeI and XhoI restriction sites and cloned into pET15b-TEV vector with cleavable sites by TEV protease instead of thrombin (original) between the designed sequence and the N-terminal 6xHis tag. Designed protein was expressed in E. coli BL21 Star (DE3) cells, and purified by a Ni2+-affinity column. The N-terminal His tag was then cleaved by TEV protease, and removed through a Ni2+-affinity column. The protein samples without a His tag were purified by an anion-exchange chromatography (HiTrapQ HP 1-ml column, GE Healthcare) followed by gel filtration chromatography (Superdex 75 10/300 GL column) on an ÄKTA Pure 25 FPLC. Mass spectroscopy was performed to confirm that a His tag was successfully cleaved.

To assess the effect of the tag cleavage on the oligomeric state and stability, we performed SEC–MALS and thermal denaturation CD experiments under high pressure for the original and tag-cleaved samples of H5_fold-0_Elsa. The solvent was exchanged to PBS at pH 7.4 before these experiments. The results showed that the tag-cleaved protein was also monomeric and had nearly identical denaturation temperature (the second row in Fig. 5c, 106 °C) as the original sample with the C-terminal His tag (Supplementary Fig. 12, 105 °C), which indicates that the removal of tag and slight differences in flanking amino-acid sequences do not largely change the stability and oligomeric state of the designed protein in solution.

Crystallization and X-ray structure determination

The protein samples of H5_fold-0_Elsa at the concentration of 12 mg ml−1 (1.07 mM) was crystallized in the solution of 0.4 M MgCl2, 0.1 M Tris–HCl (pH 7.5) and 30% PEG 3350, using the sitting-drop vapor diffusion method at 296 K. The obtained crystals were soaked in the solution of 0.4 M MgCl2, 0.1 M Tris–HCl (pH 7.5), 30% PEG 3350 and 10% glycerol, mounted on cryo-loops (Hampton Research), flash-cooled and stored in liquid nitrogen.

X-ray diffraction data of the crystal were collected with BL-1A beamline (λ = 1.1000 Å) at Photon Factory, and processed to 2.3 Å by XDS70. After phase determination by molecular replacement using the design model by Molrep71 in the CCP4 suite, the molecular model was constructed and refined using Coot72 and Phenix Refine73. Translation/Libration/Screw refinement was performed in late stages of refinement. The refined structures were validated with RAMPAGE74. Ramachandran plot statistics showed that 98.8% and 0.00% of residues were in favored and outlier regions, respectively. The crystallographic data collection is summarized in Table 2.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.