DaReUS-Loop: accurate loop modeling using fragments from remote or unrelated proteins

Despite efforts during the past decades, loop modeling remains a difficult part of protein structure modeling. Several approaches have been developed in the framework of crystal structures. However, for homology models, the modeling of loops is still far from being solved. We propose DaReUS-Loop, a data-based approach that identifies loop candidates mining the complete set of experimental structures available in the Protein Data Bank. Candidate filtering relies on local conformation profile-profile comparison, together with physico-chemical scoring. Applied to three different template-based test sets, DaReUS-Loop shows significant increase in the number of high-accuracy loops, and significant enhancement for modeling long loops. A special advantage is that our method proposes a prediction confidence score that correlates well with the expected accuracy of the loops. Strikingly, over 50% of successful loop models are derived from unrelated proteins, indicating that fragments under similar constraints tend to adopt similar structure, beyond mere homology.

that concurrently employs the strength of different energy components, considering short-range, hydrophobic and electrostatic interactions.
Data-based methods are dependent on the geometry of flanking residues and the database used for mining candidates 40 . Flanks are regions before and after the loop to be modeled. For the completion of crystal structures, these methods are shown to generate successful results when similar fragments to the loop of interest exist in the database 41 . ArchPRED 42 considers the secondary structures flanking the missing loop, their relative orientation and the number of missing residues to identify candidate loop conformations. FREAD 43 searches for candidate fragments matching conditions on distances between C α of the flanks. LoopIng 37 is based on Random Forest model and considers sequence and geometry related features to select the candidates. SuperLooper2 44 mines the Loop In Protein (LIP) database 45 , a comprehensive loop database containing all protein segments up to 35 residues from the PDB, to identify fragments matching geometrical criteria between the two last atoms of the main chain of one flank and the two first of the other.
Hybrid loop modeling methods combine ab initio and data-based methods to improve the quality of loop predictions. CODA generates a consensus loop prediction using both ab initio and data-based methods independently 40 . Similar approaches are considered by others to predict complementary determining region (CDR) of antibodies 46,47 . Another recent method is Sphinx, which first performs data-based search to find fragments shorter than the loop of interest and obtains structural informations 41 . Then it applies ab initio methods to generate fragments of correct length.
Most of the existing loop modeling methods are shown to perform successful loop predictions in high-resolution crystal structures with accuracies of about 1-2Å, if the loop is short (3-12 residues) [32][33][34]37,41,43,44 and increasing up to ~4Å for larger sizes (≤20 amino acids) 37,41,43,44 . However, in practical applications, loops of interest are typically non-homologous regions of a homologous template. For instance, data-based methods perform the search considering flank residues. In high-resolution crystal structures, these flanks are perfect. In contrast, flanks derived from homologous templates might represent very large root-mean-square deviations (RMSD) to the native flanks. Very few studies have tackled method assessment in such perturbed situations and their accuracies are about 1-4Å for short loops (3-12 residues) 32,37,43 but decrease significantly (4-9Å) for larger sizes (13-15 amino acids) 43 .
Another challenging, yet unsolved problem is the prediction of long loops: many of existing loop modeling methods have been designed to predict loops of at most 12 residues.
We previously introduced a fast and efficient approach to mine large collections of structures using a Binet-Cauchy kernel, to search for similar fragments without gaps 48 . It was extended to the search for loop candidate given loop flanks, BCLoopSearch 49 . However, according to our early tests, the following bottlenecks need to be tackled. First, to propose a strategy to prune the possibly very large number of candidates. Next, despite the fact that Binet-Cauchy kernel can tolerate some distortion, a sub-optimal geometry of the flanks can lead to failures in returning the right loop conformation. Finally, the accurate scoring of the loops is still an issue.
In this study we propose DaReUS-Loop (Data-based approach using Remote or Unrelated Structures for Loop modeling). DaReUS-Loop tackles the practical application of loop modeling in non-ideal conditions. Considering the flanks, we mine the entire set of protein entries in the PDB and extract similar fragments. Then we prune the set of candidates considering their sequence similarity and conformational profile. Finally, we build complete protein models and rank them. Our scoring schema provides us with a final set of 10 best models.
We evaluated our method on three challenging template-based test sets: CASP11, CASP12 and HOMSTRAD. The large number of results with RMSD less than 2Å suggests the accuracy of our method predicting loops in a homology modeling context. To assess the quality of the results, we compared our approach with two state-of-the-art ab initio methods, Rosetta NGK and GalaxyLoop-PS2, one data-based method, LoopIng and Sphinx, that is a hybrid method. Comparisons represent that our protocol performs equally or better than those other methods. In addition, DaReUS-Loop outperforms the other approaches to predict long loops of at least 15 residues. A special advantage is that our method proposes a prediction confidence index that correlates well with the expected accuracy of the loops. The computing time of our method is substantially less than Rosetta NGK, GalaxyLoop-PS2 and Sphinx. Strikingly, almost all successful loop models are derived from unrelated proteins, indicating that fragments under similar constraints tend to adopt similar structure, beyond mere homology. Figure 1 summarizes the workflow of our approach. Given the input of a gapped structure (PDB format) and the complete sequence to model, a first step is to identify loop candidates from the loop flanks using BCLoopSearch, mining a set of PDB structures. Due to the possibly very large number of candidates, clustering and filtering are applied to reduce the number of candidates. Three types of filters involve loop sequence similarity, local geometry and conformational profile comparison. Finally, models are built and the 10 best scored models are returned.

Results
Effects of the filtering. In this section we report the effect of filtering over the set of all loop candidates retrieved from our dataset for CASP11 test set. The distribution of sequence identity (BLOSUM scores) with respect to loop local RMSD are shown in Fig. 2a. 36% of the candidates have positive BLOSUM scores and 62% of them have local RMSDs of less than 4Å. In total, this step makes the fraction of fragments with RMSDs less than 4Å increase from 49% before filtering up to 62%. Figure 2b depicts the impact of clustering. As expected, it results in a drastic decrease of the number of candidates. It also comes with a slight improvement in terms of the RMSDs. The mean (resp. median) RMSD is of 3.86 (resp. 3.60)Å before clustering and of 3.60 (resp. 3.24)Å after. As an outcome, 70% of the candidates selected have a RMSD value <4Å.   Quality of the predictions. We compared DaReUS-Loop to two state-of-the-art ab initio methods, Rosetta NGK and GalaxyLoop-PS2, one data-based method, LoopIng and a hybrid method, Sphinx on the common sub-set of loops that could be predicted by all the methods (Common ai and Common db , respectively). Overall statistics on the best of top 10 models are shown in Table 1; more detailed results including per-model local and global RMSDs are reported in Table 2. On average, the DaReUS-Loop protocol outperforms Rosetta NGK and GalaxyLoop-PS2 by at least 0.25, 0.36, and 0.33Å, for the CASP11, CASP12, and HOMSTRAD benchmark sets, respectively. Apart from HOMSTRAD, one also notes that the RMSDs are rather close to the best possible values for the CASP11 and CASP12 sets, with a loss of only 0.40 and 0.56Å, respectively. A larger deviation of 0.73Å is observed for the HOMSTRAD set. Looking at the comparisons with data-based methods (Common db set) for DaReUS-Loop, one observes an increase of the flanked RMSD values, i.e. 0.21, 0.34, and 0.15Å for CASP11, CASP12, and HOMSTRAD, respectively compared to the values obtained for the Common ai subset. This results from reducing flank size to only 2 amino acids per loop end, instead of 4. Moreover, DaReUS-Loop outperforms LoopIng for all sets, with a gain of at least 1Å in all cases. Finally, DaReUS-Loop outperforms Sphinx by at least 0.70Å for the CASP11 and CASP12 test sets, while only a slight improvement is observed for the HOMSTRAD test set. In addition, we report the average flanked RMSD values, while selecting the top 10 models using either JSD or DOPE in Table 2. We observed that both scores result in rather similar predictions, however considering the two together, brings improvements.
Considering the performance using only the top models, since DaReUS-Loop is based on both JSD and DOPE, we selected for each loop the top sccoring models by DOPE and the top model scored by JSD, and chose the best out of the two. To keep the comparison fair, we compared our results with the best of top 2 predicted by Rosetta-NGK and Sphinx and results are reported in Supplementary Table S1 -the other methods (GalaxyLoop-PS2 and LoopIng) do not provide the scores of the models. The results show that DaReUS-Loop performs better than Rosetta-NGK and Sphinx in almost all the cases, the only exception being for the HOMSTRAD test set, where Sphinx performs slightly better than DaReUS-Loop -note that the loops of the HOMSTRAD set are, on average shorter than those of the CASP11 and CASP12 sets.

Prediction confidence index.
We now turn to analyzing whether a prediction confidence could be assigned based on the min(JSD) score, which indicates the best fit of any candidate loop in terms of conformational profile. Figure 3 shows a clear trend that lower min(JSD) values are associated with lower RMSDs, with a Spearman correlation of 0.76. From the figure one also observes a clear jump in the range of RMSD values between min(-JSD) of 0.20 and 0.25, and for JSD values more than 0.20, the quality of the correlation appears degraded. This analysis suggests that min(JSD) can be considered as a measure to assess the overall case-by-case loop modeling quality and to detect failures of our protocol. Therefore, for each of the three datasets, a high-confidence subset was selected (CommonHC), discarding any loop target for which the min(JSD) is more than 0.20 (14 loops in CASP11 and 16 loops in CASP12 test sets) Table 1. For the HOMSTRAD set, all loops of the Common subset meet the condition of a JSD less than 0.20, and the results are unchanged. For the CASP11 and CASP12 sets, one clearly sees a decrease of the average RMSDs by more than 0.55Å, and the values appear closer to that obtained for HOMSTRAD. The performance of DaReUS-Loop compared to other methods (Rosetta NGK, GalaxyLoop-PS2, LoopIng and Sphinx) remains almost unaffected.    Loop candidates are selected from remote or unrelated proteins. Figure 6 shows the distribution of the sequence identity between the proteins in which the candidates are selected and the target proteins. For 58% (79 out of 135) of the cases, loop candidates come from proteins with a sequence identity of at most 10%. Considering a sequence identity of at most 20%, this number increases up to 71% (97/135). Only 6% (8/135) of the loop candidates are selected from protein chains with more than 50% sequence identity. We have also analyzed homology in terms of Class Architecture Topology Homology (CATH) classification 5

Discussion
Here, we propose DaReUS-Loop, a data-based approach that identifies loop candidates from remote or unrelated proteins. DaReUS-Loop is able to mine the complete PDB, employing filters based on sequence similarity, clustering, conformational profiles (based on a structural alphabet) and local geometry to narrow down the candidates. A combination of conformational profiles and atomic-distance-dependent potential (DOPE) is then used to select the best candidates. DaReUS-Loop is specifically designed for loop modeling of structures modeled from homologous templates, when no crystal structure is available. We tested DaReUS-Loop on three challenging template-based test sets and compared the results with the state-of-the-art ab initio and data-based loop modeling methods. We also verified that the loops in our benchmarks correspond to surface-exposed loops (see Methods). Results suggest that DaReUS-Loop improves the accuracy of template-based loop prediction by  . It has to be stated that in rare cases the number of possible loop candidates might be very large (several hundreds of thousands), consequently this leads to proportional increase in the computational time. Such increase is mostly due to the computations of MODELLER. It has to be mentioned that we pre-computed the local conformation profiles for all the protein chains in our structure dataset, otherwise the computational cost of this step is 3 minutes for every candidate. LoopIng webserver is very fast and modeling a loop costs on average 1 minute. Whereas several days are needed for Rosetta NGK to generate 500 models, depending on the size of the loop and protein (CPU-time: 120-1200 hours). The computational time of GalaxyLoop-PS2 varies between 1 to 4 hours (CPU-time: 8-32 hours) to generate 5 candidates using GalaxyWEB, depending on the size of loop and protein. The performance of Sphinx web-server depends on the length of the loop to be modeled and varies between 20 minutes up to several hours for long loops.
Until now, very few studies have considered loop modeling of template-based models, which highlights the difficulty of the task. While assessing Looping, the authors reported very little performance differences between modeling native and template-based loops of CASP10 37 , which might be explained by (i) the short length of the studied loops (between 4 and 8 residues), (ii) quality of the models and (iii) considering the best results for the evaluations. Park et al. evaluated their method (GalaxyLoop-PS2) in different environmental conditions (crystal structure, side-chain perturbed, backbone perturbed and template-based models) and results demonstrated far less accuracy in the case of large environmental errors 32 . Rather similar observations are reported in 43 to compare the results of loop modeling on CASP 7 and 8, using template-based models versus crystal structures.
A special advantage is that DaReUS-Loop comes with a prediction confidence score that correlates well with the expected accuracy of the loops. This score, based on the best fit in terms of conformational profile, enables us to decide if the modeling procedure was successful or not, bringing some insight about the quality of the final model. In particular, all high-quality and medium-quality loops modeled by DaReUS-Loop belonged to the high-confidence subset. Moreover, for the high-confidence subset, long loops (≥15 residues) modeled by DaReUS-Loop tend to be more accurate compared to other methods. Modeling long loops has been an unsolved problem, most existing approaches dealing with loops of at most 12 residues. Our protocol tackles this problem and improves the accuracy of modeling long loops, as long as high-confidence loop candidates are available from the database.
For the CASP test sets, we extended the gaps to regions between two secondary structures. Such extension can bring two negative consequences: (i) the loop gets longer (and therefore harder) and (ii) it decreases the chances to find a high-confidence loop candidate. However, the results showed that DaReUS-Loop models long loops with higher accuracy compared to the other methods. On the other hand, we were able to find high-confidence loop candidates in 82% (135/165) of the cases.
Another striking result is that almost all successful loop models are derived from proteins where the homology is remote at best, with low sequence identities and considerable differences in structural classification. In fact, most successful loop models are derived from completely unrelated proteins, with no detectable homology in sequence or structure. The loops themselves have a higher sequence identity, which is expected given our filtering procedure. However, even so, the sequence identities remain quite low, and it is the constraints imposed by the conformational profile (based on the structural alphabet) and by the chemical environment (as measured by the DOPE score) that are the driving force for the selection of the final models. Thus, our results indicate that fragments under similar constraints tend to adopt similar structure, even in the absence of any detectable homology.

Methods
Structure Database. Our database to search for loop candidates consists of the entire set of protein structures available in the Protein Data Bank (PDB). In March 2017, it consisted of 123,417 PDB entries, corresponding to 338,613 chains in total. Each chain was split into segments that correspond to consecutive regions separated by gaps or non-standard residues, but accepting seleno-methionines. This led to a database of 758,143 protein segments.

Template-based test sets.
To assess the performance of our approach, we have used three test sets. The first one (HOMSTRAD) was taken from the study by 32 . It consists of 23 loops with sizes between 6 and 11 residues. The two other ones correspond to the targets of the CASP11 (http://predictioncenter.org/casp11/) and CASP12 (http://predictioncenter.org/casp12/) experiments 51,52 . For each CASP target, templates were identified using HHsearch 53 against the PDB70 database (02-04-2016), considering a maximum sequence identity cutoff of 50% between template and target. In case of multiple, non-overlapping templates, they were combined into a template set. For each target, the template set was aligned to the target using TM-align 54 , and the template set with the highest TM-score was selected. Only targets where this template set had a TM − score > 0.5 were retained. This resulted in 12 targets of CASP 11 (out of 46 targets) and 10 targets of CASP 12 (out of 34 targets). For each target, one model was built by MODELLER 7 using the best template set, with the alignment from TM-align. Then, loops Scientific REPORtS | (2018) 8:13673 | DOI:10.1038/s41598-018-32079-w were identified as regions of 5 to 30 residues connecting secondary structures of at least 4 residues, as defined by DSSP 55 . Loops that correspond to chain breaks in the experimental structure were excluded. This resulted in a collection of 69 loops and 76 loops for the CASP11 and the CASP12 set, respectively.
The average RMSD of the flanks of the template structure compared to that of the experimental structure of the target is of 0.97Å, 1.04Å and 0.93Å for the CASP11, CASP12 and HOMSTRAD sets, respectively. Loop sizes are between 5-29, 5-28 and 5-11 amino acids for the CASP11, CASP12 and HOMSTRAD test sets, respectively.
Loop candidate search. We previously introduced the BCLoopSearch protocol, to mine large protein structure datasets and retrieve loop candidates, given two disjoint fragments (loop flanks) 49 . It is based on a Binet-Cauchy (BC) kernel and a Rigidity score: where X and Y are C α coordinates of the flanks and dataset fragments, respectively and they are centered at the origin. Note that a BC score of 1 indicates a perfect match. Rigidity score R(X, Y) is defined as: where X i and Y i are C α coordinates of the ith residues of the flanks and dataset fragments and ||⋅|| is the euclidean norm. Rigidity score is the maximum variation of intra-distances between: (i) residues and geometric center and (ii) intra-distances between terminal C α . In addition, we also measured the RMSD between query and candidate flanks for the fragments returned.
In total, four cut-offs values related to (i) flank size, (ii) flank BC score, (iii) flank Rigidity and (iv) flank RMSD, have been considered to limit the number of loop candidates. In this study we used: a flank size of 4 residues, Rigidity ≤ 3 and flank RMSD ≤ 4Å. The minimal flank BC score cut-off was set depending on the size of the loop to be modeled: 0.9 for loops of at most 8 residues and 0.8 for longer loops.
For each target protein, prior to the loop modeling homologous proteins with more than 70% chain sequence identity were excluded from our search database.
Candidate filtering. In most cases the number of candidates returned by BCLoopSearch is too large to be tractable, which implies to limit their number. Three filters were sequentially applied in our protocol to this aim: Sequence similarity. The sequence similarity of a loop candidate with the query loop sequence using BLOSUM62 score. Candidates with negative scores were discarded.
Geometrical clustering. We used the python Numpy library to measure the pairwise distances (RMSD) between all the candidates 56 . In addition, we used the python Scipy package to perform hierarchical clustering 57 . A RMSD cut-off of 1Å was used to group similar loop candidates. To consider memory constraints, we applied an iterative clustering over subsets of 25,000 candidates, until at most 25,000 clusters were obtained. Finally, one representative loop candidate with the highest sequence similarity to the query loop was selected for each cluster. The computational time of our clustering protocol is in the range of 1-5 minutes, however it depends directly on the number of candidates detected by BCLoopSearch. In extreme cases, the needed time may increase up to 10-15 minutes.
Local conformation. Previously, Shen et al. have shown that local conformation profiles predicted from sequence and profile-profile comparison can be employed to accurately distinguish similar structural fragments 58 . Consequently, we pre-computed a collection of profiles for all the protein chains in the structure dataset, and for all proteins of the test sets. For each loop candidate, it is thus possible to extract the sub-profiles P and Q, corresponding to the query and candidate loop, and to measure the Jensen Shannon divergence (JS(P, Q)) between these profiles:

KL KL
where M corresponds to 1/2(P + Q) and D KL is the Kullback-Leibler divergence: where P i and Q j are the two profiles corresponding to positions 1 to L on the query and candidate loop sequences. Note that a JSD of 0 indicates a perfect identity of the profiles. This procedure was applied on each loop candidate and those with a JSD > 0.40 were discarded from the remaining set.
Scientific REPORtS | (2018) 8:13673 | DOI:10.1038/s41598-018-32079-w steric clash detection. After modeling the complete structure, models with steric clashes were discarded considering the C α distance between loop residues and other residues of the protein, using a cut-off value of 3Å.
Model building. Model generation was done using a two stage procedure. First the candidate loops were superimposed on the query flanks of the template, then MODELLER was used to generate a model of the un-gapped structure with the correct amino acid sequence.

Model selection.
To rank the models, we considered two scores. The first one is the JSD score (see above) and the second one is the Discrete Optimized Protein Energy (DOPE) score implemented in MODELLER 59 . DOPE is an atomic-distance-dependent statistical potential derived from known protein structures. Our procedure returns a maximum of 10 models per loop, corresponding to the 5 models with the lowest JSD score, and 5 models with the lowest DOPE score. It has to be mentioned that some degrees of overlap may occur among the top 5 models selected by each score. This may lead to smaller number of final models (<10 models).

Loop quality assessment.
To assess the quality of the results, we use the RMSD of the loop candidates main chain heavy atoms (N, C α , C′ and O). Consistently with previous studies 32,36,43 , we use different RMSD values. The local RMSD corresponds to the RMSD measured after performing the best fit superimposition of the loop region only. In the flanked RMSD, the flanks are first superimposed, excluding the loop atoms, and the RMSD is calculated over the loop region. In the global RMSD, the template structure is superimposed on the target structure excluding the loop region, then the RMSD is calculated over the loop of interest. Solvent accessibility of the loops. We measured the solvent accessibility of the loop residues using Naccess 60 . Residues with relative solvent accessibility (RSA) ≤ 20% were considered as buried. Defining a loop as buried if less than 25% of its residues are exposed, no loop in the three test sets is buried. The median percentage of buried residues are of 29, 33 and 17% for the CASP11, CASP12 and HOMSTRAD sets, respectively.
Comparison with other approaches. In this work we compare the performance of our loop modeling protocol with two state-of-the-art ab initio methods -GalaxyLoop-PS2 32 and Rosetta Next-generation KIC (NGK) 31 , one state-of-the-art data-based approach -LoopIng 37 and one hybrid method -Sphinx 41 . The NGK runs were performed using the protocol provided by 31 , using Rosetta energy values to rank the models. GalaxyWEB was used to generate the GalaxyLoop-PS2 results. Since GalaxyWEB returns only 5 models, and does not return scores, we repeated the GalaxyWEB protocol two times to obtain 10 models per loop. Furthermore, GalaxyWEB does not accept loop modeling for loops of size more than 20 amino acids or loops belonging to proteins of more than 500 residues, which made the comparison impossible for 43 loops over the total of 168 (26% of the cases). LoopIng results were obtained using the LoopIng web-server. It can generate 10 models per loop, and returns only the loop regions, supplemented by two residues on each side of the loop. Since we use flanks of 4 amino acids, and to compare our results in a fair manner, we considered a flank size of 2 amino acids for the comparison with LoopIng. Furthermore, the web-server accepts loops of size 4 to 23 amino acids. Consequently, the comparison is not possible for 14 loops over the total of 168 (8% of the cases). We used Sphinx web-server to obtain loop predictions for all the loops in our test sets. Table 3 summarizes the number of loops considered for performance comparisons. We distinguish between ab initio and data-based search methods. Loop subsets that could be predicted by groups of approaches (Common subsets) are identified.