Enhancing protein backbone angle prediction by using simpler models of deep neural networks

Protein structure prediction is a grand challenge. Prediction of protein structures via the representations using backbone dihedral angles has recently achieved significant progress along with the on-going surge of deep neural network (DNN) research in general. However, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features could rather clutter the scenario and more complex neural networks then just could counterbalance the noise. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. We also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP significantly outperforms existing state-of-the-art methods on well-known benchmark datasets: for some types of angles, the differences are above 3 in mean absolute error (MAE). The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap.


Scientific Reports
| (2020) 10:19430 | https://doi.org/10.1038/s41598-020-76317-6 www.nature.com/scientificreports/ Input features. As shown in Fig. 2, we use a sliding window of size W: up to W 2 amino acids at each side of a given amino acid. Depending on the window sizes, sliding windows can capture short or long range interactions between residues and secondary structures. Some backbone angle prediction methods that use recurrent neural networks (RNN) and CNNs take the whole protein sequences as input to capture interactions in the entire protein. However, with the absence of a firmly known energy function, it is not clear whether very long range interactions are really effective. So any choices regarding using sliding windows versus using entire proteins are to be made based on empirical evaluation. To make it clearer, in any distance-based energy components e.g. Lennard-Jones or charge-based potentials, the values are in effect zero after a certain distance. Moreover, if we look at the state-of-the-art backbone angle prediction method SPOT-1D, we see, besides using entire proteins, it still uses windowing to capture contact information. Our intent in this work is to explore simple models that can still achieve very good accuracy levels.
While window size effectively ensures context dependence of assumed local conformations, arguably there is not enough data in the training set, even in the protein data bank, to cover all possible combinations of amino acids (e.g. 20 5 ) with a given window size (e.g. 5). So the context has to be captured via a 3-state a 8-state model that can specify the average range of angle values for each amino acid in a given protein. The data deficiency for larger windows even further spoils the training. In this work, for each amino acid, we consider one of the 8 values G, H, I, T, S, E, B, and C to represent predicted 8-state SS and then encode that using an one-hot vector. The 8-state SS prediction is obtained by running SSpro8 14 on each protein. The training set of SSpro8 comprises 5772 proteins that are released before August 20, 2013. SSpro8 uses sequence similarity and sequence-based structural similarity in SS prediction and achieves respectively 92% and 79% accuracy on proteins with and without homologs in the PDB. On one hand, we have already discussed that these highly accurate SS predictions do not necessarily solve the backbone angle prediction problem when high quality protein structures are to be constructed. On the other hand, we note that we have removed all SSpro8's training proteins from our training, validation, and tests sets and also use BLAST 28 for this purpose with e-value 0.01. In this aspect, our method differs from the state-of-the-art backbone angle predictor SPOT-1D, which uses homologous sequences to generate its HMM-based features.
For each amino acid, we consider 20 values obtained from the PSSM matrix generated by three iterations of PSI-BLAST 28 against the UniRef90 sequence database updated in April 2018. We also use 7PCP (seven physicochemical properties) and ASA, and experiment with their various combinations. These features are very common in the literature.
In summary, we have 20 + 8 = 28 PSSM and SS features plus various combinations of 7 or 1 feature values for 7PCP or ASA for each amino acid residue in each protein. This will be multiplied by the size of the sliding window used. We experiment with sliding windows of sizes 1, 5, 9, 13, 17, 21 as SPIDER 23 tried up to size 21.
Predicted outputs. We consider 4 outputs, one for each of φ , ψ , θ , and τ angles. Each φ and ψ can be associated with exactly one residue or C α . A θ angle involving In one set of experiments, we consider these angles directly, handling their periodicity ( −180 • to 180 • ) within the loss function of the DNN used. In another set of experiments, just like the state-of-the-art method SPOT-1D, we use both sine and cosine ratios for each of the 4 angles, and thus use 8 outputs. The trigonometric ratios handle the periodicity issue of the angles and the tangent values obtained from the sine and cosine values can give the predicted angle within −180 • to 180 • . DNN architecture. Figure 3 shows the DNN architecture used in our method. The DNN in fact is a fully connected neural network (FCNN) with three hidden layers, each having 150 neurons. This architecture is similar to that used in SPIDER 23 and SPIDER2 24 . SPIDER2, however, uses a series of 3 DNNs feeding a previous DNN's output as input to the next DNN. In our experiments, we have used only one DNN with three hidden layers, although we have trialled two and four hidden layers as well and showed the results later. The inputs and the outputs of the DNN are per amino acid basis. Depending on the size of the sliding window and the combinations of 7PCP and ASA, the input layer will have different numbers of inputs. The output layer has one output for each angle when we want to predict an angle directly. However, if we consider sine and cosine ratios of an angle and consequently later calculate the angle, then the output layer will have two outputs for each angle. DNN implementation. The DNN has been implemented in Python language using Keras library and SGD optimiser with momentum 0.9. The learning rate starts from 0.01 and if the loss function does not improve in 3 iterations, then learning rate is reduced by a factor 0.5 until it reaches 10 −15 . The activation function is linear in the output layer and sigmoid in the input and the hidden layers. The kernel initialiser is glorot_uniform. We run programs on NVIDIA Tesla V100-PCIE-32GB machines.
Benchmark datasets. We briefly describe the dataset used by SPOT-1D 27 . This dataset has 12450 proteins that were culled from PISCES 32 on Feb 2017 with the constraints of high resolution ( < 2.5A • ), an R-free < 1 , www.nature.com/scientificreports/ and a sequence identity cutoff of 25% according to BlastClust 28 . Among those proteins, 1250 proteins deposited after June 2015 were separated into an independent test set, leaving 11200 proteins, which were then randomly divided into a training set (10200 proteins) and a validation set (1000 proteins). Then, some proteins were removed to obtain efficient calculation. This reduced the training, validation, and independent test sets to 10029, 983, and 1213 proteins, respectively. In the SPOT-1D dataset, another independent test set was obtained from the PDB. These proteins were released between January 01, 2018 and July 16, 2018 and solved with resolution < 2.5A • and R-free < 0.25 . In order to minimise evaluation bias associated with partially overlapping training data, proteins were removed for > 25% sequence identity to structures released prior to 2018. This dataset was also filtered to remove redundancy at a 25% sequence identity cutoff and another 13 proteins with length > 700 were removed, leaving 250 high-quality, non-redundant targets. For convenience, these two independent test sets were denoted as TEST2016 (1213 proteins) and TEST2018 (250 proteins) as they were deposited between June 2015 and Feb 2017 and between Jan 2018 and July 2018, respectively. We use the same dataset used by SPOT-1D 27 . However, we have performed additional filtering since it is not precisely clear to us how SPOT-1D handles the proteins that have mismatches in their amino acid sequences specified in various data source files (e.g. .t, .pssm, .dssp, and .fasta files). To be clearer, we have found that for some proteins, the amino acid sequence specified in one data source file has additional residues at the beginning or ending compared to that specified in another data source file. For such proteins, we have taken the part common in the amino acid sequences specified in various source files. However, when there is any mismatch at the middle of any two amino acid sequences specified in two different data source files for the same protein, we have removed the protein from the dataset. Also, we have removed proteins that have X in the secondary structure sequences in their corresponding DSSP files, although we do not use the secondary structure data from the DSSP files in our learning model. As mentioned before, apart from using subsets of features from SPOT-1D, we generate 8-state SS predictions using SSpro8 14 . The training set for SSpro8 comprised 5772 proteins released in the PDB before August 20, 2013. In order to avoid over-training with SSpro8 predictions as input of our method, we have removed 3259 proteins from SPOT-1D's proteins using Blast 28 against SSpro8's training set with e-value 0.01. We show in Table 1 the numbers of proteins and residues in training, validation, and testing datasets, after performing the abovementioned filtering. As we can see later in Table 5, the remaining dataset after performing the filtering does not degrade the performance of SPOT-1D.
While our main training and test proteins are from the SPOT1D dataset, for further independent testing, we use PDB150 34 and CAMEO93 35 datasets. The PDB150 dataset contains 150 proteins released between February 1, 2019 and May 15, 2019. For each protein, PSI-BLAST 28 was applied against the whole CullPDB 32 dataset with e-value smaller than 0.005. The CAMEO93 dataset contains 93 proteins that are released between February 2020 and March 2020 and has been used by OPUS-TASS in its evaluation. For both datasets, we have applied 25% sequence similarity cutoff w.r.t. our and SSpro8's training and validation datasets and also have removed proteins having X in their fasta file. For proteins with discontinuity in their amino acid sequences, we have considered largest segment of each protein so that our sliding window method can still be applied. At the end, we have obtained 71 and 55 proteins from the PDB150 and CAMEO93 datasets and we use them for independent testing of our method and the state-of-the-art method OPUS-TASS and compare their performance.  www.nature.com/scientificreports/

Results
We compare various settings of SAP to find the best setting for each of the 4 types of angles to be predicted. This comparison helps us understand the impact of various features and encodings. Then, we compare the best settings with the current state-of-the-art predictors. Moreover, we show various other analyses of the results obtained for the best settings.
Calculating absolute errors. For each predicted angle P against the actual angle A, we calculate the difference D = |P − A| . Then, we take AE = min(D,|360−D|) as the absolute error (AE) for that predicted angle. This addresses the periodicity issue that each angle must be in the range −180 • to 180 • . When angles are predicted directly, we implement the AE calculation within the loss function for the training and validation, and also later for testing. When we use sine and cosine ratios, then we calculate AE only during testing. In all cases, the angles that are not defined for the amino acids at the beginning or ending of the proteins are ignored.
Determining best settings. We run 96 settings of SAP. All of these settings having 20 PSSM and 8 SS hot-vector features. The 96 settings are obtained by using or not using ASA, by using or not using 7PCP, by using range-based or Z-score based normalisation for input feature encoding, by using 6 window sizes (1,5,9,13,17,21), and by using direct angles or trigonometric ratios to encode output angles. However, Table 2 presents performance of 16 settings only, selecting the best window size for each combination of the other parameters. From these results, it appears that window sizes 5 and 9 in most cases lead to better performances. Moreover, prediction of direct angles is better than that of trigonometric ratios. While not using ASA appears to be better than using, in contrast, using 7PCP appears to be better than not using. Overall, the best SAP setting is using 7PCP, range-based normalisation, direct angle prediction, and window size 5. Henceforth, we use this setting in further analysis. It is worth noting here that in our observation, training a DNN simultaneously for several outputs is not much different from training the DNN separately for each output in terms of the accuracy level obtained for each output. All results presented in Table 2 are for DNNs having 3 hidden layers. The choice of the number of layers was inspired by SPIDER 23 . However, in Table 3, we show the performance of the best SAP setting when run with DNNs having 2 and 4 hidden layers. In most cases DNNs having 3 hidden layers obtain the best results (shown in bold in Table 3); where this is not the case, DNNs with three hidden layers are a close second (shown in italics in Table 3), with the difference being < 0.09. So for the rest of the paper, we have chosen the DNN with 3 hidden layers as the selected SAP setting.
Performing cross-validation. When we train a DNN, we specify the validation set. Consequently, the MAE values for the validation set as well as for the testing set for each SAP setting are shown in Table 2. In Table 4, we again show the MAE values but only for the best setting of SAP. However, to check the robustness Table 2. Performance of SAP settings on 1206 testing proteins. In the table, column ASA denotes whether accessible surface area is used (Yes/No), column 7PCP denotes whether 7 physicochemical properties are used (Yes/No), column OR denotes output representation is in direct angles (D) or trigonometric ratios (R), column NM denotes normalisation method for input feature encoding is [0,1] range based (R) or Z-score based (Z), WS denotes the best size of the sliding window. Note that the emboldened cells denote the best performance for each combination of ASA and 7PCP while the boxed plus emboldened cells in each respective column denote the best performance among all SAP settings. www.nature.com/scientificreports/ of SAP, we perform 10-fold cross-validation, where the training and validation sets are merged. The merged proteins are then randomly divided into 10 folds. Then, 9 out of 10 folds are used in turn for training while the remaining one is used for testing. Comparison with state-of-the-art predictors. We mainly compare the performance of SAP with that of SPIDER2 24 , SPOT-1D 27 , and OPUS-TASS 6 in Table 5. We have run these systems on the testing dataset that is used in this work and that is a subset of the SPOT-1D dataset because of more rigorous filtering. Moreover, we use 71 and 55 proteins from PDB150 34 and CAMEO93 35 datasets after performing filtering as mentioned before. However, we also compare SAP's performance with that of SPIDER2, SPOT-1D, and OPUS-TASS as they are reported in the respective publications. Below we briefly describe SPIDER2, SPOT-1D, and OPUS-TASS.  38 , which classifies 20 residues into 19 rigid-body blocks depending on their local structures. It also introduces a new constrained/output feature named CSF3 39 , which is a local backbone structure descriptor. Further, it uses a multi-task learning strategy 40 to maximise generalisation of the neural network and an ensemble of neural networks for further improvement.
Since SPOT-1D and OPUS-TASS show their performance on two subsets namely TEST2016 and TEST2018 of the testing proteins, we also do the same although we show the accumulated results for all testing proteins. Notice from the table that SAP significantly outperforms both SPOT-1D and OPUS-TASS in all cases. We have performed t-tests to compare the performances of SPOT-1D and OPUS-TASS with SAP and the p values are < 0.01 in all cases, indicating the differences are statistically significant. The differences are really huge for ψ and τ . These results demonstrate the effectiveness of SAP in enhancing protein backbone angle prediction accuracy.
Although our results are in Table 5, to test the generality of performance of SAP over other datasets, we have run SAP on 71 proteins of PDB150 dataset and 55 proteins of CAMEO93 datasets. In Table 6, we also compare SAP's performance with SPOT-1D's performance on the PDB150 proteins and with OPUS-TASS's performance    Table 6. Performances of SPIDER2, SPOT-1D, OPUS-TASS, and SAP on filtered PDB150 and CAMEO93 proteins. The emboldened values are the winning numbers for the corresponding types of angles and datasets. OPUS-TASS does not predict θ and τ angles while the other three methods predict all four types of angles. www.nature.com/scientificreports/ Comparison on secondary structure groups. Comparison on amino acid groups. Table 8  Using angle ranges from predicted secondary structures. Given the SS predictions and their suggested ranges of φ and ψ values as shown in Table 8  Comparison of angle distributions. Figure 5 shows the distributions of the actual angles and predicted values obtained from SAP, OPUS-TASS, SPOT-1D, and SPIDER2. As we can see from the charts, the distribution of values predicted by SAP aligns very well with the distribution of the actual values. The peaks and troughs of the distributions align quite well, even multiple peaks and troughs are captured well. While the peaks of the predicted distributions are larger and narrower than those of the actual distributions, the troughs of the predicted distributions are rather smaller and wider than those of the actual distributions. When SAP's curves are compared with OPUS-TASS's, SPOT-1D's and SPIDER2's, we see SAP's curves are occasionally closer to the curves for the actual values. We also see that the distributions of φ and ψ angles for are OPUS-TASS and SPOT-1D are almost similar. Notice that the largest peaks of the predicted values are higher than the largest peaks of the actual values. One noticeable fact is in the θ chart: there are actual values between 0 and 90 although with almost zero probability, and these values are not much captured by the predictors. Overall, there is a tendency to predict the peak values with probabilities larger than that of the actual values.

Results below are as we run all of the systems on our datasets
Protein structure generation and refinement. Given the improvement in angle prediction accuracy, an interesting question is as follows: "Can predicted angles be directly employed in building accurate protein structures?" The direct answer to this question is yes if we reach to a very high accuracy level. This is actually the aim of this study to enhance the performance gradually to the level that would predict protein structures with very high accuracy; which is very challenging. Given the 27 proteins in our TEST2018 set, we have tried to generate entire protein structures from the predicted values obtained from SAP, OPUS-TASS, and SPOT-1D, and assuming ω = 180 • and standard bond distances. From Fig. 6, we can see very high root mean square distance (RMSD) for more proteins and only for 2-3 proteins, RMSD values are less than 6 A • , a distance considered to be practically meaningful. Although this is the case with protein structure generation, for structure refinement via ab initio structure sampling and evaluation by using perturbation techniques would obtain significant help. This is because given a prediction ρ and estimated error ǫ , with some level of certainty, one can focus searching within the region [ρ − ǫ, ρ + ǫ] . These soft constraints can thus reduce the search space significantly. With more www.nature.com/scientificreports/  www.nature.com/scientificreports/ proteins having more dihedral angles predicted with less absolute errors, ab initio or refinement search for protein structures would be benefited more from SAP's prediction than OPUS-TASS's or SPOT-1D's.

Comparison on correct prediction per protein.
Having the discussion regarding structure generation and refinement, we compare SAP, OPUS-TASS, and SPOT-1D on what portions of the angles of the proteins are predicted within certain error levels. Figure 7 shows the percentages of proteins that have a given percentage of particular angles with absolute errors at most a given threshold. We choose the threshold values to be 6 and 18 in the charts. Notice that SPOT-1D's and OPUS-TASS's performances are very close in the charts for φ and ψ . Moreover, SAP outperforms the other three methods in all angles in all threshold levels.

Conclusions
Input features and neural network architectures interact with each other when employed in prediction systems. Consequently, inclusions of just more features might cause cluttering and the complex networks might then be needed to counterbalance. In the protein backbone angle prediction research, the existing state-of-the-art  www.nature.com/scientificreports/ prediction method uses ensembles of several types of deep neural networks and a number of features. In this paper, we present simpler deep neural network models for protein backbone angle prediction. Our models use fewer features and simpler neural networks but on a standard benchmark dataset obtain significantly better mean absolute errors than the state-of-the-art predictor. Our program named Simpler Angle Predictor (SAP) along with its data is available from the website https:// gitlab. com/ mahne wton/ sap.
Received: 6 May 2020; Accepted: 23 October 2020 Figure 7. Percentages of proteins (y-axis) that have a given percentage of residues (x-axis) with AE at most a given threshold T where T is 6 and 18 and are denoted by T6 and T18. The lower the threshold, the better the prediction quality.