Enhancing coevolution-based contact prediction by imposing structural self-consistency of the contacts

Based on the development of new algorithms and growth of sequence databases, it has recently become possible to build robust higher-order sequence models based on sets of aligned protein sequences. Such models have proven useful in de novo structure prediction, where the sequence models are used to find pairs of residues that co-vary during evolution, and hence are likely to be in spatial proximity in the native protein. The accuracy of these algorithms, however, drop dramatically when the number of sequences in the alignment is small. We have developed a method that we termed CE-YAPP (CoEvolution-YAPP), that is based on YAPP (Yet Another Peak Processor), which has been shown to solve a similar problem in NMR spectroscopy. By simultaneously performing structure prediction and contact assignment, CE-YAPP uses structural self-consistency as a filter to remove false positive contacts. Furthermore, CE-YAPP solves another problem, namely how many contacts to choose from the ordered list of covarying amino acid pairs. We show that CE-YAPP consistently improves contact prediction from multiple sequence alignments, in particular for proteins that are difficult targets. We further show that the structures determined from CE-YAPP are also in better agreement with those determined using traditional methods in structural biology.

methods are, GaussDCA (gDCA [3]), CMAT [4] and PconsC3 [3,5]. It should be noted, that PconsC3 uses a machine learning approach that combines plmDCA, gDCA and RaptorX [6], for which the latter provides a webserver that we cannot provide the input multiple sequence alignment to. Since, this is a requirement for us, to prevent the selection bias, observed when using too rich multiple sequence alignments, we chose to omit RaptorX and use CMAT as a replacement. We acknowledge that PconsC3 likely underperforms due to this replacement.
We predicted contacts for the Noumenon data set using the above-described methods, and performed contact filtering using CE-YAPP. The precision of the predicted contacts before and after applying CE-YAPP are shown in

CE-YAPP parameter selection
Several parameters, in the CE-YAPP method, were either manually tuned or chosen based on previous work [7]. The parameters in question are (I) the number of input contacts, N input , (II) the number of time steps during structure calculations, N steps , (III) the number of repeated simulations, N repeats , and (IV) D and d 0 (Eq. 2 in main text).
CE-YAPP was generally robust to changes in the following parameters:

Number of input contacts
To select the number of input contacts, we varied X in where N AA is the number of amino acids. We maximized for the total mean structural accuracy across 16 repeated simulations and each of the proteins in the Noumenon data set. In this case, we used the global distance test (GDT) with a single cutoff of 5Å. In other words, we used a "low resolution" structural accuracy measure that reports on the ratio of Cα atoms that are within 5Å of the experimental PDB structure. As seen in Fig. S4, there is a peak at X = 1.2, which we chose as a final parameter. It should be noted that CE-YAPP is fairly robust to changes in X, with a mean increase of ∼ 0.04 GDT(5) going from X = 0.5 to X = 1.2.

Equilibrium Simulations
In order to examine the dynamics of λ i (Eq. 2 & 3, main text) in an equilibrium framework, we performed Monte Carlo simulations of a 20 amino acid long peptide called GSGS which natively forms a three-stranded anti-parallel β-sheet. We manually selected two true Cβ-Cβ contacts between each interface of strands, and a single false contact from strand one to strand three were set to 7.0Å and 1Å, respectively. In order to get converged statistics, we selected a simulation temperature, T = 2, for which the protein would unfold and refold in a single simulation.
During the simulations, we monitored the structural RMSD with respect to the native structure as well as the five λ-values (Fig. S6). We find that the λ-values for true contacts (Fig. S6b and c) are highly dynamic; in contrast the λ-value for the false contact remains close to zero (Fig. S6d). We observe a correlation between λ-values for the contacts between strand 1 and 2 and the structural RMSD. Indeed, when the RMSD is large (≈8Å), the λ values are close to zero, (i.e. these contacts are turned off), consistent with idea that restraints are turned off when they cannot be satisfied by the geometry of the protein. Interestingly, however, when the RMSD is lower (≈ 3Å), the λ-values tend to fluctuate between one and zero. To depict a more direct correlation between the RMSD and the λ-values, we show a 2D-histogram over the sum of the four λ-values for the true contacts vs. the RMSD (Fig. S7).
We find that low RMSD (≈3Å) correlates with higher 4 i λ i which is consistent with the idea that given a fairly accurate structure, the true contacts are more likely to be turned on.