Exploration of novel αβ-protein folds through de novo design

A fundamental question in protein evolution is whether nature has exhaustively sampled nearly all possible protein folds throughout evolution, or whether a large fraction of the possible folds remains unexplored. To address this question, we defined a set of rules for β-sheet topology to predict novel αβ-folds and carried out a systematic de novo protein design exploration of the novel αβ-folds predicted by the rules. The designs for all eight of the predicted novel αβ-folds with a four-stranded β-sheet, including a knot-forming one, folded into structures close to the design models. Further, the rules predicted more than 10,000 novel αβ-folds with five- to eight-stranded β-sheets; this number far exceeds the number of αβ-folds observed in nature so far. This result suggests that a vast number of αβ-folds are possible, but have not emerged or have become extinct due to evolutionary bias.

The structural diversity of proteins underlies their functional variety. The overall protein structure is determined by "fold", the spatial arrangement of and connections between secondary structure elements. The number of naturally occurring protein structures solved and deposited in the Protein Data Bank (PDB) is currently more than hundreds of thousands and still continues to grow. On the other hand, the discovery of novel protein folds recently has become a rare event [1][2][3] , suggesting that almost all folds existing in nature have already been found. However, this does not necessarily indicate that we have uncovered all folds accessible to the polypeptide chain. Although debated 4-7 , natural evolution may only have sampled a small fraction of the possible fold space: there possibly exists a vast fold space not explored by natural evolution [5][6][7] .
We investigate the possibility through de novo protein design for the folds that have not been sampled by natural evolution. Recently developed principles for designing protein structures have made possible the design of a wide range of new proteins from scratch [9][10][11][12][13][14] , allowing us to explore the huge sequence space beyond evolution. However, in terms of the fold space, the exploration has been limited to naturally occurring protein folds 9-14 except for one new fold of a protein called Top7 8 . To explore the fold space beyond evolution, a "map" to search for the folds that are possible but not observed in nature (i.e., novel folds) is indispensable; we derive a set of rules for β-sheet topology to predict novel folds. Here, we carry out systematic exploration of novel αβ-folds through de novo protein design, guided by the rules.

Predicted novel αβ-folds with a four-stranded β-sheet
Using the set of rules for the β-sheet topology, we classified all of the open β-sheet topologies with three to eight strands into frustration-free ones without violations of the rules (these are regarded as possible topologies) and frustrated ones containing the violations. We found that many of the observed αβ-folds were identified as frustration free, while most of the unobserved αβ-folds including scarcely observed ones, as frustrated ( Fig. 3a and b, see Methods). Moreover, the frustration-free β-sheet topologies were observed in more number of homologous groups (that is, evolutionary independent groups, which are referred to as superfamilies in SCOP 2 and CATH 3 ) compared with the frustrated ones (Fig. 3c, see Methods). These results suggest the capability of the set of rules to distinguish possible β-sheet topologies among all patterns of the β-sheet topologies.
We illustrated the 96 patterns of the frustrated and frustration-free β-sheet topologies for the 4stranded αβ-proteins in Fig. 3d. Light-grey and dark-grey topologies in cells represent frustrated and frustration-free ones, respectively. The number shown immediately below the topology illustration in each cell shows the observation frequency in nature, and the cell background color also represents the observation frequency with a colour gradation from white (none) to yellow (abundant). About half of the topologies (53 patterns) are frustrated, and 37 of them are either unobserved or very rare in nature. For example, the frustrated topology located in the cell, 1342 (strand order) -↑↓↓↓ (strand direction), violating the connection jump-distance and connection overlap rules (the violations are indicated by red color loops), is not observed in nature at all. In contrast, another half (43 patterns) are frustration-free β-sheet topologies, and 35 of them are observed in nature. For example, the frustration-free β-sheet topology located in the cell 1234 -↑↓↑↓, called 'meander,' is the most frequently observed one. Here, we identified the eight frustration-free β-sheet topologies that are not observed or very rare in nature. We regarded the αβ-folds with these β-sheet topologies as possible, and attempted to carry out de novo design for all of the folds. Note that the β-sheet topology "8" consisting of parallelly aligned β-strands with the 3142 strand order forms a knot, which has been known as an unobserved β-sheet topology in nature and has long been considered to be impossible to exist 20,21 , but we selected this one.

De novo design of all the predicted novel 4-stranded αβfolds
To critically test whether the novel αβ-folds we predicted can be created or not, we performed the de novo design of αβ-fold proteins with all the predicted eight novel 4-stranded β-sheet topologies ( Fig. 4a, b). Each fold was named from NF1 to NF8, according to the order of observation frequency. The folds from NF1 to NF4 are scarcely observed, and the remaining ones from NF5 to NF8 have never been observed in nature (NF6-8 have been reported as unobserved ones 21 ). We sought to design these novel αβ-folds with ideal and simple structures, in which the secondary structures do not have β-bulges or α-helix kinks, and the X region in para-β-X-β motifs is an αhelix. For each novel αβ-fold, we built a backbone blueprint, in which secondary structure lengths and loop ABEGO torsion patterns are specified using the backbone design rules 9,10 so that the target fold is favored (Fig. 4b). For NF1, 3, 4, 5, and 7, α-helices were appended at the termini to make sufficiently large hydrophobic cores. For NF5, 6, and 7, the X region in the anti-β-X-β motifs was built with an α-turn motif 22 , not just with a single loop, for the same reason; especially for NF7, 'AAAB' loops for βα-connections with the right twist angle (Extended Data Fig. 9) and 'BA' loops for αβ-connections (Extended Data Fig. 8) were adopted for making two α-turns packed together.
For NF8 of a knot-forming fold, the two backbone blueprints were built using different torsion types for the loop immediately before the last strand (Extended Data Fig. 4). Next, for each blueprint, we built a backbone structure, which was obtained by averaging over several hundreds of backbone structures generated by Rosetta fragment assembly simulations 23,24 (Fig. 4c, see Methods for details). The backbone structures were confirmed as novel ones by the database analysis using TM-align 25 and MICAN 26,27 , and by visual inspection with the TOPS diagram 28 (Extended Data   Fig. 5). Subsequently, we carried out Rosetta design to build sidechains on each of the generated backbone structures 8,29 (see Methods for details). Designs with low energy, tight core packing, high compatibility between local sequence and structure 9 were selected, and their energy landscapes were explored by Rosetta ab initio structure prediction simulations 23 . Finally, designs with amino acid sequences exhibiting funnel-shaped energy landscapes toward the designed structure were characterized by experiments (Fig. 4d).

Experimental characterization
We obtained synthetic genes encoding 16 designs for NF1, 4 for NF2-3, 6 for NF4-7, and 12 for NF8 (6 for each of the two different blueprints) (all these sequences are described in Supplementary   Tables 1-8). For all of the sequences, no clear homologous proteins to any known protein were found (All designs have BLAST E-value >10 -3 against the NCBI nr database of non-redundant protein sequences). The proteins were expressed in E. Coli and purified by a Ni-NTA affinity column. For all target folds, 56 of 60 designed proteins were found to be expressed well and soluble. These were then characterized by circular dichroism (CD) spectroscopy, size-exclusion chromatography combined with multi-angle light scattering (SEC-MALS), and 1 H-15 N heteronuclear single quantum coherence (HSQC) NMR spectroscopy. The experimental results for all designs for all target folds are summarized in Extended Data Table 1. The success rate of the designs including the knotted fold was as high as those in the previous de novo designs with the folds widely existing in nature (28 of 60 designs were characterized as foldable proteins) [9][10][11][12][13]30 . For each target fold, one monomeric design with the CD spectrum characteristic of αβ-proteins and the expected number of well-dispersed sharp NMR peaks were selected for NMR structure determination ( Fig. 4e-g). The NMR structures solved by using MagRO-NMRViewJ 31,32 were in close agreement with the computational design models, with the correct β-sheet topologies (Fig. 5, the root mean square deviation (RMSD) values for backbone heavy atoms were ranged from 1.4 to 2.0 Å; Extended Data Table 2 for NMR constraints and structure statistics). These results demonstrated that all the novel αβ-folds predicted by our rules can indeed be created. Remarkably, we succeeded in designing the smallest knotted NF8 structure consisting of only 4 strands (Extended Data Fig. 6).

Prediction of novel αβ-folds with five-to eight-stranded β-sheets
The success in de novo design of all the predicted eight novel αβ-folds demonstrated the ability of the set of rules to predict novel αβ-folds. We then revisit the number of frustration-free unobserved β-sheet topologies with five-to eight-stranded β-sheets, shown in Fig. 3a (for 3-stranded αβproteins, all ten frustration-free β-sheet topologies have been observed in nature). As the number of constituent β-strands in a β-sheet increases, the number of frustration-free unobserved topologies increases exponentially and the ratio of unobserved topologies in the frustration-free ones also increases. The prediction indicates that an enormous number of frustration-free (i.e., possible) αβfolds have been left as unobserved in nature; the number (~12000) is far more than that of the αβfolds observed in nature (400). Note that since we only investigated novel folds that are identified by our introduced set of rules, the predicted number corresponds to a lower limit of that of novel folds. There must be more novel folds that are not identified by the rules but accessible to polypeptide chains.

Discussion
How large the protein fold space is that is accessible to the polypide chain has been unknown. We systematically investigated the unexplored fold space by introducing a set of rules to predict novel αβ-folds, and by carrying out de novo design of all the predicted novel αβ-folds with a four-stranded β-sheet. As the results, we found that all the predicted novel αβ-folds, including a knotted fold, can be created. Remarkably, the design success rate was comparable to that of previous de novo designs with naturally occurring folds, and the thermal stability of the designs was as high as previous The number of predicted novel αβ-folds, which is at the lower limit of that of novel αβ-folds, is far more than that of the observed folds in nature. Moreover, the novel αβ-folds include knot-forming ones. Recently, functional proteins have been designed de novo 14,[34][35][36][37][38][39][40][41][42][43][44] . Our predicted novel αβ-folds provide an enormous scaffold set for designing protein structures with desired functions. We are at a great starting point for exploring the universe of protein structures beyond natural evolution.

Structure dataset of naturally occurring proteins
For the derivation of a set of rules for the β-sheet topology (Fig. 2), a dataset containing 12,595 chains obtained from the cullpdb database (date; 12/13/2018) 45 with more than 40 residues, sequence identity < 25%, resolution < 2.5 Å, and R-factor < 1.0 was used. For the analysis of βsheet topologies of naturally occurring protein structures (Fig. 3), a dataset containing 65,371 domains obtained from a semi-manually curated domain database ECOD 46 (This database provides a hierarchical grouping of evolutionarily related domains) with more than 40 residues and sequence identity <99% was used. Structure refinements were carried out by ModRefiner 47 for all obtained structures, and then the secondary structures were assigned by STRIDE 48 ; when the Cα RMSD of a refined structure against the original structure was greater than 1.0 Å, the refined structure was discarded and the original one was used.
Analysis of β-sheet topologies in naturally occurring proteins β-sheet topologies were defined for open β-sheets included in the protein domains obtained from ECOD, with the following criteria: 1) the lengths of constituent β-strands are more than two residues, 2) the number of β-strands is at least three, 3) two neighboring β-strands have at least two main-chain hydrogen bonds between the β-strands, and 4) no insertion along a sequence by any βstrands belonging to another β-sheet consisting of more than two β-strands (Extended Data Fig. 7).
Frequencies of all β-sheet topologies in nature were studied using the ECOD database, in which protein domains are classified according to their evolutionary relationships. In the database, the two categories, Family and Homology, are defined. Family represents a group of evolutionarily related protein domains, identified by substantial sequence similarity, and Homology represents a group comprising multiple Family groups, of which evolutionary relationships are inferred on the basis of functional and structural similarities (Homology is equivalent to the superfamily in the SCOP 2 or CATH 3 structure databases). To study the observation frequency for each β-sheet topology, we counted the number of Homology groups having the topology, with the following consideration.
We first examined the occupation ratio of the topology in the i-th Homology group: where NFamily is the total number of Family groups belonging to the Homology group, and RFamily(j) is the ratio of protein domains having the β-sheet topology in the j-th Family group. Thus, when all domains in the Homology group contain the β-sheet topology, the occupation ratio of the Homology group is one, otherwise less than one. Finally, the observation frequency for each topology is calculated by the sum of the occupation ratios across Homology groups, For four-stranded β-sheet proteins, we manually checked all structures having the topologies of which the observation frequencies are less than 1.0, and then changed the β-sheet assignments for some of the structures: the β-sheets included in e3hy2X1, e1xw3A1, e2hwjA2, e4rsfA1, and e1tocR2 were identified as β-barrel, and those of e4rgzA1, e2bjjX3, e1iejA2, e3s9lC3, e1blfA4, and e2d3iA3 were, as 6-stranded β-sheet.

Backbone construction
We built a backbone blueprint for each novel αβ fold. For the X region in para-β-X-β motifs, a helix was built. The lengths of secondary structures and the ABEGO torsion patterns for the connecting loops were obtained from the design rules described in previous papers 10 . For NF1, 3, 4, 5, and 7, αhelices were appended to their termini to make a sufficiently large hydrophobic core between the βsheet and the α-helices. For the same purpose, for the X region in antiparallel β-X-β motifs in NF5, 6, and 7, an α-turn structure consisting of a helix-loop-helix unit was built. β-strand lengths were chosen from four to seven, and α-helix lengths were from eleven to seventeen residues. The torsion ABEGO patterns of loops were determined by referring to the previous paper 10

Sequence design
We performed RosettaDesign calculations 51 with the full-atom Talaris2014 49 scoring function to design side-chains (amino acid sequences) that stabilize each generated backbone structure. The design calculation consists of the following three steps. 1) Several cycles of amino-acid sequence optimization with a fixed backbone and the following backbone relaxation. 2) Mutations of buried polar residues to hydrophobic ones, followed by the entire structure optimization. 3) Mutations of solvent-exposed hydrophobic residues to polar residues, followed by the entire structure optimization. Amino-acid residue types to be used for the design of each residue position except for that of loop regions were restricted based on the secondary structure of the position and the buriedness calculated using virtual amino acids. For the design of each loop region (the residues in the loop and the preceding and following three residues), amino acid types were restricted on the basis of the consensus amino acids obtained from the sequence profile for naturally occurring protein structure fragments, which were collected using the following criteria, 1) identical secondary structure and ABEGO torsion pattern to the loop region, and 2) RMSD against the loop structure is lower than 2.0 Å. Through the RosettaDesign calculations, up to 40,000 designs were generated for each design target structure.
The designed sequences were then filtered based on the Rosettatotal energy, the RosettaHoles score 52 < 2.0, and the packstat score 52 > 0.55 for the target NF2 and > 0.6 for others. Furthermore, we filtered the designs on the basis of the local sequence-structure compatibility 53 . We collected 200 fragments for each nine-residue frame in each designed sequence from a non-redundant set of experimental structures, based on the sequence similarity and secondary structure prediction.
Subsequently, for each frame, we calculated Cα RMSD of the local structure against each of the 200 fragments. Designs were ranked according to the summation of the log-ratio of the fragments, for which the RMSD was less than 1.5!Å, across all nine-residue frames, and those with high values were selected.

Protein expression and purification
A spacer was added at the C-terminus of each designed sequence ('GSWS' for the sequences that have neither a TRP residue nor more than two TYR residues and 'GS' for others) to separate the designed region from the C-terminal 6xHis-tag. The genes encoding the designed sequences, which were cloned into pET21b expression vectors, were synthesized by Eurofins Genomics (Tokyo, Japan). The designed proteins were expressed in E. coli BL21 Star (DE3) cells (Invitrogen) as uniformly (U-) 15 N-labeled proteins using MJ9 minimal media 54 , which contain 15 N ammonium sulfate as a sole nitrogen source and 12 C glucose as a sole carbon source, respectively. The expressed proteins with a C-terminal 6xHis-tag were purified through a Ni-NTA affinity column.
The purified proteins were then dialyzed against typical PBS buffer, 137 mM NaCl, 2.7 mM KCl,

Sample preparation
For the NMR structure determination of the eight selected designs, the uniformly isotope-labeled [U-15 N, U-13 C] proteins were expressed using the same method as described above except 13

NMR signal assignments
All NMR signals were identified by using MagRO-NMRViewJ (upgraded version of Kujira 31 ) in a fully automated manner, then noise peaks were filtered by deep-learning methods using Fit_Robot 32 . FLYA module was used for fully automated signal assignments and structure calculation 57 to obtain roughly assigned chemical shifts (Acs) then the trustful ones were selected into MagRO Acs

Structure calculation
Several CYANA 59 calculations were performed using the Acs table, NOE peak table, and dihedral angle constraints. After the CYANA calculations, several dihedral angle constraints derived from TALOS+ revealing large violations for nearly all models in the structure ensemble were eliminated.
After the averaged target function of the ensemble reached less than 2.0 Å 2 , refinement calculations by Amber12 were carried out for 20 models with the lowest target functions.

NMR structure validation
The RMSD values were calculated for the 20 structures overlaid to the mean coordinates for the ordered regions, automatically identified by Fit_Robot using multi-dimensional non-linear scaling 60 . The RDC back-calculation was performed by PALES 61 using experimentally determined values of RDC. The averaged correlation between the simulated and experimental values was obtained using the signals except for the residues on overlapped regions in 1 H-15 N HSQC, and for the residues predicted to be low order parameters (S 2 < 0.8) by TALOS+. The detailed methods and results are described in Extended Data Table 2 and Supplementary Document.   2 | Rules for β-sheet topology. a, Connection jump-distance rule. Para-β-X-β and anti-β-X-β motifs are illustrated. The number of intervening strands between the two β-strands in para-β-X-β motifs is mostly less than four, and that for anti-β-X-β motifs is less than two; an exception is the anti-β-X-β motifs with the number of two, included in the Greek-key β-sheet topology and its circular permutations (the dotted box in the right histogram) (See topologies with an asterisk in Fig.   3d). . The same preferences are reported in the previous study 16 . We revisited them using the current PDB. b, Connection overlap rule. The preferred and not-preferred β-sheet topologies for three types for pairs of β-X-β motifs (para-para-, anti-anti-, and para-anti-β-X-β motifs) are illustrated. The D-type β-sheet topologies (loops are located on different sides of each other) are more frequently observed than the S-type ones (loops on the same side). Similar rules have been known for para-para-β-X-β motifs 62 . For anti-anti-β-X-β motifs, there has been a rule called "pretzels" 21 , but this rule prohibits both S-and D-types without the distinction. c, Connection ending rule. The two types of β-sheet topologies for pairs of para-β-X-β motifs are illustrated, in which the second strands of the two motifs are adjacent and parallelly aligned. The S-type β-sheet topologies are more preferred than the D-type topologies. !  and other knot types (Other), respectively. The design NF8 with the R-Trefoil knot, indicated by an arrow, is characterized as the smallest knotted protein with 79 residues. Note that this is an exceptional case for the R-Trefoil knot structures; the minimal size observed in nature is about 140 residues (For the L-Trefoil structure, the smallest one has 82 residues, PDB ID; 2EFV).