Improving fragment-based ab initio protein structure assembly using low-accuracy contact-map predictions

Sequence-based contact prediction has shown considerable promise in assisting non-homologous structure modeling, but it often requires many homologous sequences and a sufficient number of correct contacts to achieve correct folds. Here, we developed a method, C-QUARK, that integrates multiple deep-learning and coevolution-based contact-maps to guide the replica-exchange Monte Carlo fragment assembly simulations. The method was tested on 247 non-redundant proteins, where C-QUARK could fold 75% of the cases with TM-scores (template-modeling scores) ≥0.5, which was 2.6 times more than that achieved by QUARK. For the 59 cases that had either low contact accuracy or few homologous sequences, C-QUARK correctly folded 6 times more proteins than other contact-based folding methods. C-QUARK was also tested on 64 free-modeling targets from the 13th CASP (critical assessment of protein structure prediction) experiment and had an average GDT_TS (global distance test) score that was 5% higher than the best CASP predictors. These data demonstrate, in a robust manner, the progress in modeling non-homologous protein structures using low-accuracy and sparse contact-map predictions.


SUPPLEMENTARY INFORMATION
Text S1. Contact selection based on effective number of homologous sequences and protein lengths Text S2. Confidence score cut-offs for each contact predictor when selecting contacts Text S3. C-QUARK force field used to guide the REMC simulations Text S4. Well depth in the contact potential Table S1: Average accuracies of all-range and long-range predicted contacts used in C-QUARK Table S2: Average accuracies of the different ranges of predicted contacts used in C-QUARK Table S3: Average satisfaction rates of different ranges of contacts in the first predicted models Table S4: Average TM-scores and GDT_TS scores for the first models generated by C-QUARK and the control versions of the C-QUARK pipeline. Table S5: Average TM-scores, GDT_TS scores, and RMSDs for the first models generated by C-QUARK and the control methods on the 247 test proteins. Table S6: Average TM-scores, GDT_TS scores, and RMSDs for the first models generated by C-QUARK and the control methods on the 59 test proteins with low contact-map prediction accuracy. Table S7: Average TM-scores, GDT_TS scores, and RMSDs for the first models generated by C-QUARK and control methods on the 57 non-redundant test proteins. Table S8: Summary of C-QUARK on the 64 targets compared with the top five servers in CASP13  Table S9: Summary of C-QUARK on the 64 CASP13 targets compared with distance-based folding methods. Table S10: Performance of C-QUARK on 21 CASP13 multi-domain proteins. Table S11: Top L contact prediction accuracy for different predictors and the consensus Table S12: Top L/2 contact prediction accuracy for different predictors and the consensus Table S13: Top L contact prediction accuracy for different predictors on the training proteins Table S14: Top L/2 contact prediction accuracy for different predictors on the training proteins Table S15: Width of the first well (db) in the contact potential at various protein lengths (L) Table S16: Minimum number of contacts at different protein lengths (L) and Nf of MSA Table S17: Confidence score cut-offs of different contact predictors SUPPLEMENTARY FIGURES Figure S1: RMSD comparison between C-QUARK and QUARK. Figure S2: TM-scores of C-QUARK and QUARK at different protein lengths. Figure S3: Accuracies of predicted contacts versus those derived from the final models. Figure S4: Confidence scores versus long-range accuracies by different programs Figure S5: Contact satisfaction rate versus contact prediction accuracy. Figure S6: Effect of contact satisfaction rate on the TM-scores of the first C-QUARK models. Figure S7: A representative case showing contact satisfaction rate during the simulation trajectories. Figure S8: An example illustrating how contact restraints help C-QUARK modeling. Figure S9: TM-score comparison between C-QUARK and the control methods for 247 test proteins and its subsets. Figure S10: Contact accuracy versus Nf for the cases that are not foldable by C-QUARK. Figure S11: Case studies on proteins not folded by C-QUARK although they had reasonable contactmaps. Figure S12: Case study on CASP13 FM target, T0980s1-D1. Figure S13: Dependence of C-QUARK run time on protein length. Figure S14: Boxplot and distribution for TM-scores of the first models produced by C-QUARK on 21 CASP13 multi-domain targets and the corresponding 62 individual domains. Figure S15: TM-score comparison of C-QUARK using all ten contact predictors vs using ResPRE alone. Figure S16: Eleven local movements in the C-QUARK REMC folding simulations. Figure S17: A schematic of the contact potential used in C-QUARK. Figure S18: Correlation of contacts with Nf. Figure S19: Correlation of TM-score with the contact accuracy derived from the final models.

SUPPLEMENTARY TEXTS Text S1. Contact selection based on effective number of homologous sequences (Nf) and protein length
According to the performance shown in Table S13 and S14, based on the training proteins (Dataset S1 in SI), the contact predictors are classified into four categories: i) NeBcon 1 , ResPRE 2 and DeepPLM as "very high", ii) DeepCov 3 , Deepcontact 4 and DNCON2 5 as "high", iii) MetaPSICOV2 6 as "medium, and iv) GREMLIN 7 , CCMpred 8 and FreeContact 9 as "low". We consider the effective number of sequences (Nf) in the corresponding multiple sequence alignment (MSA) and length of the target as criteria when selecting the number of contacts from each of these categories. Here, is defined by where L is the length of the protein and nseq is the total number of sequences in an MSA.
is the sequence identity between sequence i and sequence j in an MSA, and s=0.80 is the sequence identity cut-off .
if ≥ , and 0 otherwise. For instance, at least the top L, L/2, L/4.5 and L/7.5 contacts are selected from the different confidence score categories, regardless of the length of the target, when Nf <50. On the other hand, if Nf is ≥50 and the length of the target is <120, at least the top L/2, L/3, L/5.5 and L/8.5 contacts are selected from these categories. The rational for selecting fewer contacts for the latter condition is twofold: i) small-sized proteins usually have less contacts, and ii) the contact prediction accuracy increases with an increase in the Nf value, as shown in Figs S18A and S18B, and hence fewer high-resolution contacts are necessary as restraints to produce successfully folded models. On the other hand, although the contact prediction accuracy is usually low at relatively lower Nf values, often the predicted contacts are very near to the true contacts. As a result, selecting more contacts in the former condition may capture the overall 3D contact network and possibly be helpful in modeling, since other energy potential terms help to eliminate the wrong contacts anyway as demonstrated in Fig. 4. Finally, when the length of the target is >=120 with Nf >50, more contacts are needed so that the restraints from the contacts can be imposed throughout the sequence. Therefore, at least the top L/1.25, L/2.25, L/4.75 and L/7.75 contacts are selected from the different categories. Table  S16 summarizes the number of contacts selected from the different categories that were obtained based on several trials of training using the training proteins.

Text S2. Confidence score cut-offs for each contact predictor when selecting contacts
In addition to the above mentioned criteria, we also consider an accuracy threshold for each predictor, where contacts with accuracies greater than the threshold are selected from each of the predictors. However, it is not possible to know the accuracy of predicted contacts without prior knowledge of the corresponding native structure. One solution to estimate the accuracy of predicted contacts is based on the confidence score of contacts between residue pairs, since the confidence score has a strong correlation with the accuracy of the contacts, as shown in Fig.  S4. However, due to the variation of scoring schemes in different contact predictors, the correlation is often not linear. Therefore, we choose different confidence score cut-offs for different predictors that correspond to at least a contact accuracy of 0.5, as shown with a dashed line in Fig. S4 for long-range contacts. The confidence cut-offs corresponding to an accuracy of 0.5 are summarized in Table S1 for different range contacts. The consideration of the 0.5 accuracy cut-off is primarily due to a strong linear correlation between the contact accuracy of the final models and the TM-scores of the models from C-QUARK, as shown in Fig. S19, where the PCCs are 0.904 and 0.877 for all-and long-range contacts, respectively. Such strong correlations indicate that selection of predicted contacts with an accuracy of at least 0.5 and the subsequent satisfaction of these contacts may lead to the generation of models with TM-scores 10 of at least 0.5, which is an indication of obtaining a similar fold as the native 11 . All the confidence cut-offs were determined based on the 243 training proteins (Dataset S1 in SI), which are nonhomologous (with <30% sequence identity) to the 247 test proteins discussed in this work.

Text S3. C-QUARK force field used to guide the REMC simulations
In order to guide its REMC simulations, C-QUARK uses the following force field that calculates the total energy of a conformation by summing up 12 energy terms 12 : = 1 + 2 + 3 + 4 ℎ + 5 + 6 ℎ + 7 + 8 + 9 + 10 ℎ + 11 + 12 (S2) Here, the terms account for the backbone atomic pairwise potential ( ), side-chain center pairwise potential ( ), excluded volume ( ), hydrogen bonding ( ℎ ), solvent accessibility ( ), backbone torsion angles ( ℎ ), fragment-based distance profiles ( ), radius of gyration ( ), strand-helix-strand packing ( ), helix packing ( ℎ ), distance between adjacent Cα atoms ( ), and the contact potential ( ). While the first ten terms are used in both QUARK and C-QUARK, the final term, , is unique to the C-QUARK force field and accounts for the contact restraints from the predicted contacts (see Eq. 1 and Fig. S17). In addition to the contact potential term, the 11 th energy term, which factors in the distance between adjacent Cα atoms ( ), is also a newly added term and takes the following form: where , +1 is the Cα-Cα distance between residues i and i+1, and I[ ] is the Iverson bracket, i.e., [ , +1 > 4]=1 if , +1 > 4, and 0 otherwise. This term is designed to penalize backbone breaks between adjacent residue pairs with Cα-Cα distances > 4Å, which can occur after fragment movements. All the weighting parameters in C-QUARK were re-tuned on the training protein set listed in Dataset S1, to appropriately balance the inherent force field with the contact restraints by maximizing the TM-score of the predicted models. As a result, most of the weighting parameters in 1−10 are similar to what was used in QUARK 12 despite the use of different training proteins, showing the robustness of the QUARK force field. It is interesting that the weight ( 7 ) of the distance-profile energy term increased from 0.60 to 3.00 in the C-QUARK force field to enlarge the effect of filtering out false positive contacts. The last parameter 12 is equal to 0.426 when Nf >50, and 0.355 otherwise.

Text S4. Well depth in the contact potential
The depth of the energy potential, , between residue pair i and j in the 3G contact potential is calculated as: where m is the number of contact predictors, ( ) is the confidence score of the predicted contact between residue pair i and j by the mth predictor, and ( 0.5 ) is the confidence score cut-off listed in Table S17, which corresponds to an average accuracy =0.5 for the mth predictor at R ranges (short, medium and long) based on the training proteins. Table S1: Average accuracies of all-range and long-range predicted contacts used in the C-QUARK simulations for alpha, beta and alpha-beta proteins. Additionally, the TM-score and success rate comparison between C-QUARK and QUARK on these proteins is presented, where the first models produced by the programs are considered for the comparison. The values in the parentheses of the fifth column show the p-values from one-sided Student's ttests, indicating the significance of the TM-score comparison with respect to C-QUARK. The values in the parentheses of the sixth and seventh columns represent the percentage of the cases where the models obtained similar folds as the corresponding native structures.                 Table S17: Confidence score cut-offs of different contact predictors that correspond to an accuracy of 0.5 based on the 243 training proteins (Dataset S1). Definition of the accuracy at different confidence score is discussed in Fig.  S4.  Figure S1: RMSD comparison between the first models produced by C-QUARK and QUARK based on the 247 test proteins. Points above the diagonal line indicate models with better quality produced by C-QUARK than QUARK, and vice versa.      Figure S7: A representative case, 1jiwI, to demonstrate the increase in the contact satisfaction rate as the simulation cycles progress during a representative replica from the folding simulations. Satisfaction rates of (A) all contacts and (B) top 1.5L contacts used during the C-QUARK simulations are shown, where satisfaction rates are higher for the latter case. (C) Satisfaction rate of the top 1.5L predicted contacts in the QUARK simulation, where the satisfaction rate of contacts, particularly long-range contacts, is significantly lower compared to that of C-QUARK due to the lack of contact restraints and a contact potential in QUARK. (D) TM-score comparison of the decoys produced during the representative replica of the C-QUARK and QUARK simulations as the cycles progress. The increase in TM-score as the number of cycles increases is partly due to the satisfaction of predicted contacts in C-QUARK. Figure S8: A similar representative case as Fig. S7 from 1jiwI, which illustrates how contact restraints help bring the contacting residues close to each other in 3D space as the folding simulations progress. (A), (B) and (C) represent the structure of the native, and decoys at an initial stage and at the final stage, respectively. (D) The contact-maps derived from the structures of the native protein (grey circle), and the decoys at the initial stage (blue circle in upper left triangle), and at the final stage (blue circle in lower right triangle). Additionally, the red circles in the left triangles represent the predicted contacts used as restraints in the C-QUARK simulations. All structures are colored in spectrum, with blue to red indicating the N-to C-terminal regions. BN and BC represent the beta-strands at the N-and C-termini, respectively. BN and BC are ~13.5 Å apart from each other after the first cycle of the simulation, as shown in (B), while these should be in contact (~4.4 Å) as evident from the native structure and in the contactmaps in (D) (highlighted with a rectangle in the upper left triangle). Restraints from the predicted contacts between the beta-strands help to bring these beta-strands close to each other at a distance of ~4.5 Å after 500 cycles, as shown in (C) and highlighted with a rectangle in the lower triangle of (D). Figure S9: TM-score comparison between the first models produced by C-QUARK and CNS (A), and C-QUARK and DConStruct (B) for the 247 test proteins. TM-score comparison between the first models produced by C-QUARK and QUARK (C) for the 59 test proteins with low accuracy contact-map prediction. TM-score comparison between the first of C-QUARK and trRosetta (D) for 57 test proteins without redundancy to training sets of trRosetta and all contact predictors used in C-QUARK. The dashed lines indicate the TM-score cut-off of 0.5, beyond which models are considered to obtain similar folds as the corresponding native structures. Points above the diagonal line indicate models with better quality by C-QUARK than the control methods, and vice versa. Figure S10: Accuracy of long-range and medium-range predicted contacts used in C-QUARK versus Nf for the cases that were not folded by C-QUARK. The majority of the proteins were not folded due to low contact prediction accuracy. However, four targets: 1fasA, 2vxsA, 3nikA and 4h4nA, as highlighted in the plot, have reasonable accuracies (>0.4) for the medium-and long-range predicted contacts, but are still not foldable by C-QUARK due to incorrect secondary structure prediction, and lack of predicted contacts in loop/coil regions, as demonstrated in the main text and Fig. S11. Figure S11: Example cases that are not folded by C-QUARK although the long-range and medium-range contact prediction accuracy for these proteins are reasonable (>0.4). (A), (B), (C) show the structures of native (red) and best in top five C-QUARK models (blue) for 1fasA, 3nikA and 4h4nA, respectively. (D), (E) and (F) represent the corresponding contact-maps of these proteins, respectively, for the native structure (grey circles), C-QUARK model (blue) and predicted (red) by the contact predictors. Here, contact-maps in the upper left triangles are similar to that in the lower right triangles in each contact plot. There is a mis-classification of the secondary structure predictor at the positions of 14-16 in 1fasA, where the region is classified as a coil instead of a beta-strand, as highlighted with a dashed circle in (A). As a result, the region from first position to position 16 is modeled as a coil instead of a beta sheet, where the yellow region, as highlighted in (A) is modeled as a floppy coil instead of a regular beta strand. In 3nikA, there is a lack of prediction of contacts in the coil region at the N-terminal, while there are supposed to be long range contacts between the residues at N-and that at C-terminal, as observed in the native structure and the corresponding contact-map, highlighted within dashed rectangle in (E). Consequently, the coil region at the Nterminal is not correctly modeled by the C-QUARK model, as highlighted with a dashed circle in (B). In the 4h4nA, while the regions at 27-35 and 51-58 are supposed to be beta-strands, the secondary structure predictor predicted these regions as helices. As a result, C-QUARK generates those models as helices, as shown in in (C).   Here, all 21 multi-domain targets with solved experimental structures were selected for comparison, no specific target was removed from the CASP13 released target list. The horizontal axis indicates the counts of TM-scores and the vertical axis is TM-score of predicted models, where the "minimum", "first quartile (Q1/25th Percentile)", "median (Q2/50th Percentile)", "third quartile (Q3/ 75th Percentile)", and "maximum" of each boxplot are shown by bold red lines accordingly. Figure S15: TM-score comparison of C-QUARK models produced based on contacts from all ten predictors and those from ResPRE, the contact predictor with the highest prediction accuracy, on 109 hard targets. The dashed lines indicate the TM-score cut-off of 0.50, beyond which models are considered to obtain similar folds as the corresponding native structures. Figure S16: Eleven local movements in the C-QUARK REMC folding simulations. These movements can further be divided into three levels: residue level (M1-M4), segmental level (M5-M8), and topology level (M9-M11). Movements M1, M2, and M3 randomly change one bond length, bond angle, and torsion angle of a randomly selected residue. Movement M4 substitutes these three parameters in the selected residue by the clustered values for this residue which most frequently occur in the template fragments at the position. Movement M5 substitutes one fragment in the decoy by another one randomly selected from the position-specific fragment structures. Movement M6 first randomly changes the positions of the backbone atoms in a selected segment and then tries to restrict all the bond lengths and bond angles within the physically allowable region. Movement M7 rotates the backbone atoms of a randomly selected segment around the axis connecting the two ending Ca atoms. Movement M8 shifts the residue numbers in a segment forward or backward by one residue, which means the coordinates of each residue are copied from their preceding or following residue in the segment. In movement M9, one helix is moved closer to another one. In a similar way, one β-pair is formed in movement M10. Movement M11 tries to form a β-turn motif for every 4-mer segment along the query sequence. Figure S17: A schematic of the contact potential, ( ), for a contacting residue pair i and j as defined in Eq. (1). Here, is the width of the first well, which was tuned based on the training proteins (Dataset S1), and is the depth of the energy potential that is proportional to the confidence score of the predicted contact between the residue pair i and j (Eq. S4).

SUPPLEMENTARY FIGURES
is the distance between the residue pair. The units for all the distances are in Å.