Enhancing protein backbone angle prediction by using simpler models of deep neural networks

Mataeimoghadam, Fereshteh; Newton, M. A. Hakim; Dehzangi, Abdollah; Karim, Abdul; Jayaram, B.; Ranganathan, Shoba; Sattar, Abdul

doi:10.1038/s41598-020-76317-6

Download PDF

Article
Open access
Published: 10 November 2020

Enhancing protein backbone angle prediction by using simpler models of deep neural networks

Fereshteh Mataeimoghadam¹^na1,
M. A. Hakim Newton^1,2^na1,
Abdollah Dehzangi^3,4,
Abdul Karim¹,
B. Jayaram⁵,
Shoba Ranganathan⁶ &
…
Abdul Sattar^1,2

Scientific Reports volume 10, Article number: 19430 (2020) Cite this article

3921 Accesses
18 Citations
3 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 06 September 2021

This article has been updated

Abstract

Protein structure prediction is a grand challenge. Prediction of protein structures via the representations using backbone dihedral angles has recently achieved significant progress along with the on-going surge of deep neural network (DNN) research in general. However, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features could rather clutter the scenario and more complex neural networks then just could counterbalance the noise. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. We also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP significantly outperforms existing state-of-the-art methods on well-known benchmark datasets: for some types of angles, the differences are above 3 in mean absolute error (MAE). The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap.

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Introduction

Protein structure prediction (PSP) has been an unsolved problem for the last half century¹. Three dimensional structures of most proteins depend on their amino acid (AA) sequences. The PSP problem is to determine the three dimensional structures of given proteins just from their amino acid sequences. The difficulties come from the inevitability of searching an astronomically large conformation space and from the absence of a highly accurate energy function to evaluate potential protein conformations².

There are 20 types of amino acids. A protein might have any of the 20 types of amino acids any number of times in any order subject to stoichiometric constraints³. Each amino acid has three common atoms N, \(C_\alpha\) and C among others. The C and N atoms of every two consecutive amino acids in a protein form a peptide bond and thus we obtain the backbone or main chain of the protein. As shown in Fig. 1, protein backbone structures can essentially be represented by dihedral angles \(\phi\), \(\psi\), and \(\omega\), which are respectively defined by taking every four consecutive atoms from the sequence \(C_{i-1}\), \(N_i\), \(C_{\alpha _i}\), \(C_i\), \(N_{i+1}\), \(C_{\alpha _i}\). Typically \(\omega\) is fixed at \(180^\circ\) for majority proteins⁴, and so only \(\phi\) and \(\psi\) are to be determined. Besides being the parts of the main chain, each amino acid, starting from its \(C_\alpha\) atom, has a side chain as well. The side chains have their own dihedral angles, but for this work we consider them to be out of scope. Once the backbone structures could be predicted with very high accuracy, side chain angles could be predicted or determined later. Besides \(\phi\), \(\psi\), and \(\omega\) angles, as shown in Fig. 1, \(\theta\) and \(\tau\) angles provide an alternative representation for protein backbone structures. While \(\theta\) is a planar angle defined by three consecutive \(C_\alpha\) atoms, \(\tau\) is a dihedral angle defined by four consecutive \(C_\alpha\) atoms. Such a representation is actually possible because of the nearly constant distance between consecutive \(C_\alpha\) atoms. While \(\phi\) and \(\psi\) are dihedral angles each involving four atoms from two consecutive residues, \(\theta\) and \(\tau\) involving three or four residues capture more local structures in a protein. In this work, we predict all the four types of backbone angles \(\phi\), \(\psi\), \(\theta\) and \(\tau\) for each protein in a given protein using deep neural networks (DNN).

Prediction of protein backbone structures is very important since both template-based and template-free protein structure prediction rely strongly on that^2,5. From an abstraction based perspective, protein backbone structure prediction could be viewed as prediction of secondary structures (SSs). Protein secondary structure prediction has obtained significant success over the years through the use of various types of deep neural networks and their ensembles^{6,7,8,9,10,11,12} and ab initio methods¹³. For example, SSpro8¹⁴ achieves 79% accuracy on proteins with no homologs in the Protein Data Bank (PDB) and of 92% accuracy on proteins where homologs can be found in the PDB. However, this progress does not necessarily make backbone angle prediction trivial. With accurate SS predictions, one can obtain narrow ranges (about \(20^\circ\)) of \(\phi\) and \(\psi\) angles, but only for helices and sheets. For coils, \(\phi\) and \(\psi\) can take any value in \([-180, +180]\) and coils comprise about 40% residues in average proteins¹⁵. Moreover, errors in backbone angle prediction in one part of a protein has a cascaded effect in the construction of the entire protein structure. Overall, secondary structures, on one hand, are coarse-grained description of protein local structures in three (helices, sheets, and coils) or eight discrete states (including some variants of the three). On the other hand, secondary structures are somewhat arbitrarily defined with coils essentially having no well-defined structures. In contrast to secondary structures, backbone angles as being continuous variables can represent protein structures at greater accuracy levels. Moreover, predicted backbone dihedral angles, compared to the predicted secondary structures, have been found to be more useful in ab initio structure prediction or refinement by performing search^16,17. Protein backbone angle prediction has improved over the years. A number of methods have been developed to predict \(\phi\) and \(\psi\) as both discrete^18,19 and continuous^{9,20,21,22,23,24,25,26,27} labels.

Protein backbone angle prediction methods in recent years are mostly based on DNNs and their complex variants such as stacked sparse auto-encoder neural networks²³, long short-term memory (LSTM) bidirectional recurrent neural networks (BRNNs)^6,25,27, and Residual Networks (ResNets)²⁷, and their ensembles^6,27 or layered iterations²⁴. In terms of input features, position specific scoring matrices (PSSM) produced by PSI-BLAST²⁸ have been used by most methods^{9,23,24,25,27}. Moreover, 7 physicochemical properties (7PCP) such as steric parameter (graph shape index), hydrophobicity, volume, polarisability, isoelectric point, helix probability, and sheet probability²⁹ have been used as well^{9,23,24,25,27}. Other input features that have been used include accessible surface area (ASA)²³, Hidden Markov Model (HMM) profiles^9,27,30 produced by HHBlits³¹, contact maps²⁷, and PSP19⁶. In order to capture local structures around each given amino acids, sliding windows with various sizes have been used^23,24,25. Moreover, to capture the non-local or long-range interactions among amino acids in a protein, the entire protein sequence has been used as features^9,24,26 or convolutional neural networks (CNNs)^6,30 or LSTM-BRNNs^25,27 have been used. In terms of datasets to be used to evaluate the prediction models, we refer to four datasets: PISCES³², SPOT-1D^27,33, PDB150³⁴ and CAMEO93³⁵. The first two datasets have respectively about 5.5K and 12.5K proteins with 1.2M and 2.7M residues. The last two datasets respectively have 150 and 93 proteins and have been used mainly for independent testing.

Given the literature explored above, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features rather clutter the scenario and more complex neural networks then just counterbalance the noise. Similar results have been reported in other research areas. For example, in a Nature article in seismic aftershock prediction by deep learning methods³⁶, a simple two-parameter logistic regression (that is, one neuron) is shown to have obtained the same performance as that of the 13,451-parameter DNN. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. Nevertheless, we also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP significantly outperforms the existing state-of-the-art methods SPOT-1D and OPUS-TASS⁶ on well-known benchmark datasets: for ψ and τ, the differences are above 3 in mean absolute error (MAE). With an ensemble of several types of DNNs using many input features, SPOT-1D and OPUS-TASS are very complex prediction methods compared to the SAP, which uses just a fully connected DNN and a few input features. The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap.

Methods

In this section, we describe the deep learning model proposed in this paper and the datasets used in this work.

Input features

As shown in Fig. 2, we use a sliding window of size W: up to \(\tfrac{W}{2}\) amino acids at each side of a given amino acid. Depending on the window sizes, sliding windows can capture short or long range interactions between residues and secondary structures. Some backbone angle prediction methods that use recurrent neural networks (RNN) and CNNs take the whole protein sequences as input to capture interactions in the entire protein. However, with the absence of a firmly known energy function, it is not clear whether very long range interactions are really effective. So any choices regarding using sliding windows versus using entire proteins are to be made based on empirical evaluation. To make it clearer, in any distance-based energy components e.g. Lennard–Jones or charge-based potentials, the values are in effect zero after a certain distance. Moreover, if we look at the state-of-the-art backbone angle prediction method SPOT-1D, we see, besides using entire proteins, it still uses windowing to capture contact information. Our intent in this work is to explore simple models that can still achieve very good accuracy levels.

While window size effectively ensures context dependence of assumed local conformations, arguably there is not enough data in the training set, even in the protein data bank, to cover all possible combinations of amino acids (e.g. \(20^5\)) with a given window size (e.g. 5). So the context has to be captured via a 3-state a 8-state model that can specify the average range of angle values for each amino acid in a given protein. The data deficiency for larger windows even further spoils the training. In this work, for each amino acid, we consider one of the 8 values G, H, I, T, S, E, B, and C to represent predicted 8-state SS and then encode that using an one-hot vector. The 8-state SS prediction is obtained by running SSpro8¹⁴ on each protein. The training set of SSpro8 comprises 5772 proteins that are released before August 20, 2013. SSpro8 uses sequence similarity and sequence-based structural similarity in SS prediction and achieves respectively 92% and 79% accuracy on proteins with and without homologs in the PDB. On one hand, we have already discussed that these highly accurate SS predictions do not necessarily solve the backbone angle prediction problem when high quality protein structures are to be constructed. On the other hand, we note that we have removed all SSpro8’s training proteins from our training, validation, and tests sets and also use BLAST²⁸ for this purpose with e-value 0.01. In this aspect, our method differs from the state-of-the-art backbone angle predictor SPOT-1D, which uses homologous sequences to generate its HMM-based features.

For each amino acid, we consider 20 values obtained from the PSSM matrix generated by three iterations of PSI-BLAST²⁸ against the UniRef90 sequence database updated in April 2018. We also use 7PCP (seven physico-chemical properties) and ASA, and experiment with their various combinations. These features are very common in the literature.

In summary, we have \(20 + 8 = 28\) PSSM and SS features plus various combinations of 7 or 1 feature values for 7PCP or ASA for each amino acid residue in each protein. This will be multiplied by the size of the sliding window used. We experiment with sliding windows of sizes 1, 5, 9, 13, 17, 21 as SPIDER²³ tried up to size 21.

Predicted outputs

We consider 4 outputs, one for each of \(\phi\), \(\psi\), \(\theta\), and \(\tau\) angles. Each \(\phi\) and \(\psi\) can be associated with exactly one residue or \(C_\alpha\). A \(\theta\) angle involving \(C_{\alpha _{i-1}}, C_{\alpha _i}, C_{\alpha _{i+1}}\) is associated with \(C_{\alpha _i}\). Similarly, a \(\tau\) angle involving \(C_{\alpha _{i-1}}, C_{\alpha _i}, C_{\alpha _{i+1}}, C_{\alpha _{i+2}}\) is associated with \(C_{\alpha _i}\). In one set of experiments, we consider these angles directly, handling their periodicity (\(-180^\circ\) to \(180^\circ\)) within the loss function of the DNN used. In another set of experiments, just like the state-of-the-art method SPOT-1D, we use both sine and cosine ratios for each of the 4 angles, and thus use 8 outputs. The trigonometric ratios handle the periodicity issue of the angles and the tangent values obtained from the sine and cosine values can give the predicted angle within \(-180^\circ\) to \(180^\circ\).

DNN architecture

Figure 3 shows the DNN architecture used in our method. The DNN in fact is a fully connected neural network (FCNN) with three hidden layers, each having 150 neurons. This architecture is similar to that used in SPIDER²³ and SPIDER2²⁴. SPIDER2, however, uses a series of 3 DNNs feeding a previous DNN’s output as input to the next DNN. In our experiments, we have used only one DNN with three hidden layers, although we have trialled two and four hidden layers as well and showed the results later. The inputs and the outputs of the DNN are per amino acid basis. Depending on the size of the sliding window and the combinations of 7PCP and ASA, the input layer will have different numbers of inputs. The output layer has one output for each angle when we want to predict an angle directly. However, if we consider sine and cosine ratios of an angle and consequently later calculate the angle, then the output layer will have two outputs for each angle.

DNN implementation

The DNN has been implemented in Python language using Keras library and SGD optimiser with momentum 0.9. The learning rate starts from 0.01 and if the loss function does not improve in 3 iterations, then learning rate is reduced by a factor 0.5 until it reaches \(10^{-15}\). The activation function is linear in the output layer and sigmoid in the input and the hidden layers. The kernel initialiser is glorot_uniform. We run programs on NVIDIA Tesla V100-PCIE-32GB machines.

Benchmark datasets

We briefly describe the dataset used by SPOT-1D²⁷. This dataset has 12450 proteins that were culled from PISCES³² on Feb 2017 with the constraints of high resolution (\(< 2.5A^\circ\)), an R-free \(< 1\), and a sequence identity cutoff of 25% according to BlastClust²⁸. Among those proteins, 1250 proteins deposited after June 2015 were separated into an independent test set, leaving 11200 proteins, which were then randomly divided into a training set (10200 proteins) and a validation set (1000 proteins). Then, some proteins were removed to obtain efficient calculation. This reduced the training, validation, and independent test sets to 10029, 983, and 1213 proteins, respectively. In the SPOT-1D dataset, another independent test set was obtained from the PDB. These proteins were released between January 01, 2018 and July 16, 2018 and solved with resolution \(< 2.5A^\circ\) and R-free \(< 0.25\). In order to minimise evaluation bias associated with partially overlapping training data, proteins were removed for \(>25\%\) sequence identity to structures released prior to 2018. This dataset was also filtered to remove redundancy at a 25% sequence identity cutoff and another 13 proteins with length \(> 700\) were removed, leaving 250 high-quality, non-redundant targets. For convenience, these two independent test sets were denoted as TEST2016 (1213 proteins) and TEST2018 (250 proteins) as they were deposited between June 2015 and Feb 2017 and between Jan 2018 and July 2018, respectively.

We use the same dataset used by SPOT-1D²⁷. However, we have performed additional filtering since it is not precisely clear to us how SPOT-1D handles the proteins that have mismatches in their amino acid sequences specified in various data source files (e.g. .t, .pssm, .dssp, and .fasta files). To be clearer, we have found that for some proteins, the amino acid sequence specified in one data source file has additional residues at the beginning or ending compared to that specified in another data source file. For such proteins, we have taken the part common in the amino acid sequences specified in various source files. However, when there is any mismatch at the middle of any two amino acid sequences specified in two different data source files for the same protein, we have removed the protein from the dataset. Also, we have removed proteins that have X in the secondary structure sequences in their corresponding DSSP files, although we do not use the secondary structure data from the DSSP files in our learning model. As mentioned before, apart from using subsets of features from SPOT-1D, we generate 8-state SS predictions using SSpro8¹⁴. The training set for SSpro8 comprised 5772 proteins released in the PDB before August 20, 2013. In order to avoid over-training with SSpro8 predictions as input of our method, we have removed 3259 proteins from SPOT-1D’s proteins using Blast²⁸ against SSpro8’s training set with e-value 0.01. We show in Table 1 the numbers of proteins and residues in training, validation, and testing datasets, after performing the abovementioned filtering. As we can see later in Table 5, the remaining dataset after performing the filtering does not degrade the performance of SPOT-1D.

Table 1 Numbers of proteins and residues in training, validation, and testing datasets.

Full size table

While our main training and test proteins are from the SPOT1D dataset, for further independent testing, we use PDB150³⁴ and CAMEO93³⁵ datasets. The PDB150 dataset contains 150 proteins released between February 1, 2019 and May 15, 2019. For each protein, PSI-BLAST²⁸ was applied against the whole CullPDB³² dataset with e-value smaller than 0.005. The CAMEO93 dataset contains 93 proteins that are released between February 2020 and March 2020 and has been used by OPUS-TASS in its evaluation. For both datasets, we have applied 25% sequence similarity cutoff w.r.t. our and SSpro8’s training and validation datasets and also have removed proteins having X in their fasta file. For proteins with discontinuity in their amino acid sequences, we have considered largest segment of each protein so that our sliding window method can still be applied. At the end, we have obtained 71 and 55 proteins from the PDB150 and CAMEO93 datasets and we use them for independent testing of our method and the state-of-the-art method OPUS-TASS and compare their performance.

Results

We compare various settings of SAP to find the best setting for each of the 4 types of angles to be predicted. This comparison helps us understand the impact of various features and encodings. Then, we compare the best settings with the current state-of-the-art predictors. Moreover, we show various other analyses of the results obtained for the best settings.

Calculating absolute errors

For each predicted angle P against the actual angle A, we calculate the difference \(D = |P - A|\). Then, we take AE = min(D,|360−D|) as the absolute error (AE) for that predicted angle. This addresses the periodicity issue that each angle must be in the range \(-180^\circ\) to \(180^\circ\). When angles are predicted directly, we implement the AE calculation within the loss function for the training and validation, and also later for testing. When we use sine and cosine ratios, then we calculate AE only during testing. In all cases, the angles that are not defined for the amino acids at the beginning or ending of the proteins are ignored.

Determining best settings

We run 96 settings of SAP. All of these settings having 20 PSSM and 8 SS hot-vector features. The 96 settings are obtained by using or not using ASA, by using or not using 7PCP, by using range-based or Z-score based normalisation for input feature encoding, by using 6 window sizes (1, 5, 9, 13, 17, 21), and by using direct angles or trigonometric ratios to encode output angles. However, Table 2 presents performance of 16 settings only, selecting the best window size for each combination of the other parameters. From these results, it appears that window sizes 5 and 9 in most cases lead to better performances. Moreover, prediction of direct angles is better than that of trigonometric ratios. While not using ASA appears to be better than using, in contrast, using 7PCP appears to be better than not using. Overall, the best SAP setting is using 7PCP, range-based normalisation, direct angle prediction, and window size 5. Henceforth, we use this setting in further analysis.

Table 2 Performance of SAP settings on 1206 testing proteins. In the table, column ASA denotes whether accessible surface area is used (Yes/No), column 7PCP denotes whether 7 physicochemical properties are used (Yes/No), column OR denotes output representation is in direct angles (D) or trigonometric ratios (R), column NM denotes normalisation method for input feature encoding is [0,1] range based (R) or Z-score based (Z), WS denotes the best size of the sliding window. Note that the emboldened cells denote the best performance for each combination of ASA and 7PCP while the boxed plus emboldened cells in each respective column denote the best performance among all SAP settings.

Full size table

It is worth noting here that in our observation, training a DNN simultaneously for several outputs is not much different from training the DNN separately for each output in terms of the accuracy level obtained for each output.

All results presented in Table 2 are for DNNs having 3 hidden layers. The choice of the number of layers was inspired by SPIDER²³. However, in Table 3, we show the performance of the best SAP setting when run with DNNs having 2 and 4 hidden layers. In most cases DNNs having 3 hidden layers obtain the best results (shown in bold in Table 3); where this is not the case, DNNs with three hidden layers are a close second (shown in italics in Table 3), with the difference being < 0.09. So for the rest of the paper, we have chosen the DNN with 3 hidden layers as the selected SAP setting.

Table 3 Performance of the best SAP setting when the numbers of hidden layers in the DNNs are varied.

Full size table

Performing cross-validation

When we train a DNN, we specify the validation set. Consequently, the MAE values for the validation set as well as for the testing set for each SAP setting are shown in Table 2. In Table 4, we again show the MAE values but only for the best setting of SAP. However, to check the robustness of SAP, we perform 10-fold cross-validation, where the training and validation sets are merged. The merged proteins are then randomly divided into 10 folds. Then, 9 out of 10 folds are used in turn for training while the remaining one is used for testing. Table 4 shows the MAE value and the standard deviation of MAEs (SDMAE) for each type of angles to be predicted. As one can see, the small differences between MAE values and the small SDMAE values observed in the table shows the consistency and robustness of SAP.

Table 4 Average performance of the best setting of SAP after 10-fold cross validation is performed.

Full size table

Comparison with state-of-the-art predictors

We mainly compare the performance of SAP with that of SPIDER2²⁴, SPOT-1D²⁷, and OPUS-TASS⁶ in Table 5. We have run these systems on the testing dataset that is used in this work and that is a subset of the SPOT-1D dataset because of more rigorous filtering. Moreover, we use 71 and 55 proteins from PDB150³⁴ and CAMEO93³⁵ datasets after performing filtering as mentioned before. However, we also compare SAP’s performance with that of SPIDER2, SPOT-1D, and OPUS-TASS as they are reported in the respective publications. Below we briefly describe SPIDER2, SPOT-1D, and OPUS-TASS.

1.
SPIDER2 is similar to SAP in that both use similar FCNN and similar features. SPIDER2 uses three DNNs of its precursor SPIDER²³ in a series where the output of a previous DNN is fed as input to the next DNN in the series. Like SAP, SPIDER uses FCNN with 3 hidden layers each with 150 neurons. However, SPIDER uses stacked sparse auto-encoder for weight initialisation and 0-1 range normalisation for input values. SPIDER’s input features are PSSM, 3-state predicted SS, ASA, and 7PCP and the outputs are represented by trigonometric ratios. The window size is 21 in SPIDER and 17 in SPIDER2. SPIDER and SPIDER2 use the PISCES³² dataset, which has 5840 proteins.
2.
SPOT-1D is a recent protein backbone angle predictor. It uses an ensemble of 9 long short term memory (LSTM) bidirectional recurrent neural networks (BRNNs) and Residual Networks (ResNets). SPOT-1D’s input features are PSSM, Hidden Markov Model (HMM), 7PCP, and contact maps. SPOT-1D obtains its predicted contact maps from SPOT-Contact³³. SPOT-1D then uses windowing of the predicted contact maps. Further, SPOT-1D generates HMM profiles that include information about homologous sequences. For this, SPOT-1D uses HHBlits³¹ with the Uniprot sequence profile database from October 2017. SPOT-1D’s inputs are mapped in the range of [0, 1] and the outputs are represented by trigonometric ratios. SPOT-1D’s dataset is a superset of SAP’s dataset.
3.
OPUS-TASS is the current state-of-the-art protein backbone angle predictor and predicts \(\phi\) and \(\psi\) only. Its architecture consists of CNN layers, LSTM layers, and Transformer³⁷ layers. It uses an input feature named PSP19³⁸, which classifies 20 residues into 19 rigid-body blocks depending on their local structures. It also introduces a new constrained/output feature named CSF3³⁹, which is a local backbone structure descriptor. Further, it uses a multi-task learning strategy⁴⁰ to maximise generalisation of the neural network and an ensemble of neural networks for further improvement.

Since SPOT-1D and OPUS-TASS show their performance on two subsets namely TEST2016 and TEST2018 of the testing proteins, we also do the same although we show the accumulated results for all testing proteins. Notice from the table that SAP significantly outperforms both SPOT-1D and OPUS-TASS in all cases. We have performed t-tests to compare the performances of SPOT-1D and OPUS-TASS with SAP and the p values are \(< 0.01\) in all cases, indicating the differences are statistically significant. The differences are really huge for \(\psi\) and \(\tau\). These results demonstrate the effectiveness of SAP in enhancing protein backbone angle prediction accuracy.

Although our results are in Table 5, to test the generality of performance of SAP over other datasets, we have run SAP on 71 proteins of PDB150 dataset and 55 proteins of CAMEO93 datasets. In Table 6, we also compare SAP’s performance with SPOT-1D’s performance on the PDB150 proteins and with OPUS-TASS’s performance on the CAMEO93 proteins. The performance of various methods are rather mixed here. We have performed t-tests to compare the performances of SPOT-1D and OPUS-TASS with SAP and the p values are < 0.05 in all cases, indicating the differences are statistically significant.

Table 5 Performances of SPIDER2, SPOT-1D, SAP, and OPUS-TASS on our testing dataset and its subsets TEST2016 and TEST2018. The emboldened values are the winning numbers for the corresponding types of angles and datasets. OPUS-TASS does not predict \(\theta\) and \(\tau\) angles while the other three methods predict all four types of angles.

Full size table

Table 6 Performances of SPIDER2, SPOT-1D, OPUS-TASS, and SAP on filtered PDB150 and CAMEO93 proteins. The emboldened values are the winning numbers for the corresponding types of angles and datasets. OPUS-TASS does not predict \(\theta\) and \(\tau\) angles while the other three methods predict all four types of angles.

Full size table

Comparison on protein length groups

In Table 7, we compare the performance of SAP, OPUS-TASS, SPOT-1D, and SPIDER2 when our testing proteins are grouped based on their lengths i.e. the number of amino acids each protein has. This is to observe how SAP’s performance varies with the increase of the protein length. From the table, we see that for all four types of angles, SAP’s prediction accuracy gradually decreases, with minor exceptions, as the protein length increases. When protein lengths are 300 or below (with minor exception for θ), the MAE values are less than the overall MAE values and for protein lengths above 300, the MAE values are greater than the overall MAE values. From the \(\Delta\)MAE values (i.e. how far from SAP’s MAE) of OPUS-TASS, SPOT-1D and SPIDER2, we see that with the increase of protein lengths, the performance difference increases; which essentially means compared to OPUS-TASS’s or SPOT-1D’s or SPIDER2’s performance, SAP’s performance rather gets better.

Table 7 Performance of SAP, OPUS-TASS, SPOT-1D, and SPIDER2 when our testing proteins are grouped based on their lengths. In the table, \(\Delta\)MAE of a system (e.g. OPUS-TASS, SPOT-1D or SPIDER2) is its MAE minus the MAE of SAP. As such, the greater the value of \(\Delta\)MAE, the worse the performance of the system w.r.t. the performance of SAP.

Full size table

Comparison on secondary structure groups

Table 8 (Left) shows the residue distribution over the testing proteins when the residues are grouped on their SS types. Types C, E, H, S and T are the most frequent groups. Figure 4 (Top Four) shows the MAE values of SAP, OPUS-TASS, SPOT-1D, and SPIDER2 when the residues are grouped on their SS types. From the charts, frequent SS type H appears to have the best MAE values while other frequent SS types C, E, and S have significantly worse MAE values than the overall MAE values.

Table 8 Residue distribution over the testing proteins when residues are grouped on their (Left) SS and (Right) AA types. Also, on the left, typical ranges suggested for the torsion angles \(\phi\) and \(\psi\) for various secondary structures⁴¹.

Full size table

Comparison on amino acid groups

Table 8 (Right) shows the residue distribution over the testing proteins when the residues are grouped on their AA types. Types A, D, E, G, I, K, L, P, R, S, T, and V are the most frequent groups having at least 4.5% residues. Figure 4 (Bottom Four) shows the MAE values of SAP, OPUS-TASS, SPOT-1D, and SPIDER2 when the residues are grouped on their AA types. From the charts, frequent AA types A, E, I, L appear to have the best MAE values in all 4 types of angles. Among other frequent AA types C, D, G have worse MAE values than the overall MAE values in some types of angles.

Using angle ranges from predicted secondary structures

Given the SS predictions and their suggested ranges of \(\phi\) and \(\psi\) values as shown in Table 8 left, particularly for helices (G, H, I) and sheets (B, E), one might just use the mid values of the respectives ranges as the predicted values and expect an MAE of about 10 for the respective SS type. When we do that for the residues that belong to SS types G, H and I, we get MAE values respectively 27.71, 9.12, and 22.04 for \(\phi\) and 18.71, 8.83, 21.17 for \(\psi\). In contrast, the MAE values for SAP predictions are respectively 12.40, 5.43, 11.34 for \(\phi\) for SS types G, H, and I, and 16.08, 6.40, 15.16 for \(\psi\). The situations worsens for sheets such as SS types B and E. These results clearly show that just achieving higher accuracy in SS prediction would not be sufficient for backbone angle prediction.

Comparison of angle distributions

Figure 5 shows the distributions of the actual angles and predicted values obtained from SAP, OPUS-TASS, SPOT-1D, and SPIDER2. As we can see from the charts, the distribution of values predicted by SAP aligns very well with the distribution of the actual values. The peaks and troughs of the distributions align quite well, even multiple peaks and troughs are captured well. While the peaks of the predicted distributions are larger and narrower than those of the actual distributions, the troughs of the predicted distributions are rather smaller and wider than those of the actual distributions. When SAP’s curves are compared with OPUS-TASS’s, SPOT-1D’s and SPIDER2’s, we see SAP’s curves are occasionally closer to the curves for the actual values. We also see that the distributions of \(\phi\) and \(\psi\) angles for are OPUS-TASS and SPOT-1D are almost similar. Notice that the largest peaks of the predicted values are higher than the largest peaks of the actual values. One noticeable fact is in the \(\theta\) chart: there are actual values between 0 and 90 although with almost zero probability, and these values are not much captured by the predictors. Overall, there is a tendency to predict the peak values with probabilities larger than that of the actual values.

Protein structure generation and refinement

Given the improvement in angle prediction accuracy, an interesting question is as follows: “Can predicted angles be directly employed in building accurate protein structures?” The direct answer to this question is yes if we reach to a very high accuracy level. This is actually the aim of this study to enhance the performance gradually to the level that would predict protein structures with very high accuracy; which is very challenging. Given the 27 proteins in our TEST2018 set, we have tried to generate entire protein structures from the predicted values obtained from SAP, OPUS-TASS, and SPOT-1D, and assuming \(\omega =180^\circ\) and standard bond distances. From Fig. 6, we can see very high root mean square distance (RMSD) for more proteins and only for 2–3 proteins, RMSD values are less than 6\(A^\circ\), a distance considered to be practically meaningful. Although this is the case with protein structure generation, for structure refinement via ab initio structure sampling and evaluation by using perturbation techniques would obtain significant help. This is because given a prediction \(\rho\) and estimated error \(\epsilon\), with some level of certainty, one can focus searching within the region \([\rho -\epsilon , \rho +\epsilon ]\). These soft constraints can thus reduce the search space significantly. With more proteins having more dihedral angles predicted with less absolute errors, ab initio or refinement search for protein structures would be benefited more from SAP’s prediction than OPUS-TASS’s or SPOT-1D’s.

Comparison on correct prediction per protein

Having the discussion regarding structure generation and refinement, we compare SAP, OPUS-TASS, and SPOT-1D on what portions of the angles of the proteins are predicted within certain error levels. Figure 7 shows the percentages of proteins that have a given percentage of particular angles with absolute errors at most a given threshold. We choose the threshold values to be 6 and 18 in the charts. Notice that SPOT-1D’s and OPUS-TASS’s performances are very close in the charts for \(\phi\) and \(\psi\). Moreover, SAP outperforms the other three methods in all angles in all threshold levels.

Conclusions

Input features and neural network architectures interact with each other when employed in prediction systems. Consequently, inclusions of just more features might cause cluttering and the complex networks might then be needed to counterbalance. In the protein backbone angle prediction research, the existing state-of-the-art prediction method uses ensembles of several types of deep neural networks and a number of features. In this paper, we present simpler deep neural network models for protein backbone angle prediction. Our models use fewer features and simpler neural networks but on a standard benchmark dataset obtain significantly better mean absolute errors than the state-of-the-art predictor. Our program named Simpler Angle Predictor (SAP) along with its data is available from the website https://gitlab.com/mahnewton/sap.

Change history

06 September 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41598-021-96666-0

References

Gibson, K. D. & Scheraga, H. A. Minimization of polypeptide energy. I. Preliminary structures of bovine pancreatic ribonuclease S-peptide. Proc. Natl. Acad. Sci. U.S.A. 58, 420 (1967).
Article ADS CAS Google Scholar
Zhou, Y., Duan, Y., Yang, Y., Faraggi, E. & Lei, H. Trends in template/fragment-free protein structure prediction. Theor. Chem. Acc. 128, 3–16 (2011).
Article CAS Google Scholar
Mittal, A., Jayaram, B., Shenoy, S. & Bawa, T. S. A stoichiometry driven universal spatial organization of backbones of folded proteins: are there Chargaff’s rules for protein folding?. J. Biomol. Struct. Dyn. 28, 133–142 (2010).
Article CAS Google Scholar
Cutello, V., Narzisi, G. & Nicosia, G. A multi-objective evolutionary approach to the protein structure prediction problem. J. R. Soc. Interface 3, 139–151 (2005).
Article Google Scholar
Guo, J.-T., Ellrott, K. & Xu, Y. A historical perspective of template-based protein structure prediction. In Protein Structure Prediction, 3–42 (Springer, 2008).
Xu, G., Wang, Q. & Ma, J. OPUS-TASS: a protein backbone torsion angles and secondary structure predictor based on ensemble neural networks. Bioinformatics (Oxford, England) (2020).
Hu, H., Li, Z., Elofsson, A. & Xie, S. A Bi-LSTM based ensemble algorithm for prediction of protein secondary structure. Appl. Sci. 9, 3538 (2019).
Article CAS Google Scholar
Torrisi, M., Kaleel, M. & Pollastri, G. Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction. Sci. Rep. 9, 1–12 (2019).
Article CAS Google Scholar
Fang, C. Applications of deep neural networks to protein structure prediction. Ph.D. thesis, University of Missouri-Columbia (2018).
Torrisi, M., Kaleel, M. & Pollastri, G. Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes. bioRxiv 289033 (2018).
Faraggi, E., Zhang, T., Yang, Y., Kurgan, L. & Zhou, Y. SPINE X: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J. Comput. Chem. 33, 259–267 (2012).
Article CAS Google Scholar
Kihara, D. The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci. 14, 1955–1963 (2005).
Article CAS Google Scholar
Rost, B. Protein secondary structure prediction continues to rise. J. Struct. Biol. 134, 204–218 (2001).
Article CAS Google Scholar
Magnan, C. N. & Baldi, P. SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30, 2592–2597 (2014).
Article CAS Google Scholar
Lodish, H. et al. Hierarchical structure of proteins. In Molecular Cell Biology, 4th edn (WH Freeman, 2000).
Faraggi, E., Yang, Y., Zhang, S. & Zhou, Y. Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction. Structure 17, 1515–1527 (2009).
Article CAS Google Scholar
Simons, K. T. et al. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins: Struct. Funct. Bioinform. 34, 82–95 (1999).
Article CAS Google Scholar
Kuang, R., Leslie, C. S. & Yang, A.-S. Protein backbone angle prediction with machine learning approaches. Bioinformatics 20, 1612–1621 (2004).
Article CAS Google Scholar
Kang, H. S., Kurochkina, N. A. & Lee, B. Estimation and use of protein backbone angle probabilities. J. Mol. Biol. 229, 448–460 (1993).
Article CAS Google Scholar
Wood, M. J. & Hirst, J. D. Protein secondary structure prediction with dihedral angles. Proteins: Struct. Funct. Bioinform. 59, 476–481 (2005).
Article CAS Google Scholar
Dor, O. & Zhou, Y. Real-SPINE: an integrated system of neural networks for real-value prediction of protein structural properties. Proteins: Struct. Funct. Bioinform. 68, 76–81 (2007).
Article CAS Google Scholar
Xue, B., Dor, O., Faraggi, E. & Zhou, Y. Real-value prediction of backbone torsion angles. Proteins: Struct. Funct. Bioinform. 72, 427–433 (2008).
Article CAS Google Scholar
Lyons, J. et al. Predicting backbone c\(\alpha\) angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network. J. Comput. Chem. 35, 2040–2046 (2014).
Article CAS Google Scholar
Heffernan, R. et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci. Rep. 5, 11476 (2015).
Article ADS Google Scholar
Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33, 2842–2849 (2017).
Article CAS Google Scholar
Heffernan, R. et al. Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning. J. Comput. Chem. 39, 2210–2216 (2018).
Article CAS Google Scholar
Hanson, J., Paliwal, K., Litfin, T., Yang, Y. & Zhou, Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 35, 2403–2410 (2018).
Article Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Meiler, J., Müller, M., Zeidler, A. & Schmäschke, F. Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Mol. Model. Annu. 7, 360–369 (2001).
Article CAS Google Scholar
Klausen, M. S. et al. NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins: Struct. Funct. Bioinform. 87, 520–527 (2019).
Article CAS Google Scholar
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nat. Methods 9, 173 (2012).
Article CAS Google Scholar
Wang, G. & Dunbrack, R. L. PISCES: recent improvements to a pdb sequence culling server. Nucleic Acids Res. 33, W94–W98 (2005).
Article CAS Google Scholar
Hanson, J., Paliwal, K., Litfin, T., Yang, Y. & Zhou, Y. Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 34, 4039–4045 (2018).
CAS Google Scholar
Fang, C., Shang, Y. & Xu, D. Prediction of protein backbone torsion angles using deep residual inception neural networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 1020–1028 (2018).
Article Google Scholar
Haas, J. et al. The protein model portal—a comprehensive resource for protein structure and model information. Database. 2013 (2013).
Mignan, A. & Broccardo, M. One neuron versus deep learning in aftershock prediction. Nature 574, E1–E3 (2019).
Article ADS CAS Google Scholar
Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems, 5998–6008 (2017).
Xu, G. et al. OPUS-DOSP: a distance-and orientation-dependent all-atom potential derived from side-chain packing. J. Mol. Biol. 429, 3113–3120 (2017).
Article CAS Google Scholar
Xu, G., Ma, T., Zang, T., Wang, Q. & Ma, J. OPUS-CSF: ac-atom-based scoring function for ranking protein structural models. Protein Sci. 27, 286–292 (2018).
Article CAS Google Scholar
Lounici, K., Pontil, M., Tsybakov, A. B. & Van De Geer, S. Taking advantage of sparsity in multi-task learning. arXiv preprintarXiv:0903.1468 (2009).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. Orig. Res. Biomol. 22, 2577–2637 (1983).
CAS Google Scholar

Download references

Acknowledgements

This research is partially supported by Australian Research Council Discovery Grant DP180102727.

Author information

These authors contributed equally: Fereshteh Mataeimoghadam and M. A. Hakim Newton.

Authors and Affiliations

School of Information and Communication Technology, Griffith University, Nathan, QLD, Australia
Fereshteh Mataeimoghadam, M. A. Hakim Newton, Abdul Karim & Abdul Sattar
Institute of Integrated and Intelligent Systems, Griffith University, Nathan, QLD, Australia
M. A. Hakim Newton & Abdul Sattar
Department of Computer Science, Rutgers University, Camden, NJ, USA
Abdollah Dehzangi
Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA
Abdollah Dehzangi
Department of Chemistry and School of Biological Sciences, IIT Delhi, Delhi, India
B. Jayaram
Department of Chemistry and Biomolecular Sciences, Macquarie University, Macquarie Park, NSW, Australia
Shoba Ranganathan

Authors

Fereshteh Mataeimoghadam
View author publications
You can also search for this author in PubMed Google Scholar
M. A. Hakim Newton
View author publications
You can also search for this author in PubMed Google Scholar
Abdollah Dehzangi
View author publications
You can also search for this author in PubMed Google Scholar
Abdul Karim
View author publications
You can also search for this author in PubMed Google Scholar
B. Jayaram
View author publications
You can also search for this author in PubMed Google Scholar
Shoba Ranganathan
View author publications
You can also search for this author in PubMed Google Scholar
Abdul Sattar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.M. and M.A.H.N. contributed equally and in all parts of the work. A.D. helped prepare data, took part in discussions, analysed results, and reviewed the manuscript. A.K. helped implement the program in Python. B.J, and S.R reviewed the manuscript. A.S. took part in discussions and reviewed the manuscript.

Corresponding authors

Correspondence to Fereshteh Mataeimoghadam or M. A. Hakim Newton.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: Following the publication of this Article the Authors detected programming errors that affected the accuracy of some of the results. Full information regarding the corrections made can be found in the correction for this Article.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mataeimoghadam, F., Newton, M.A.H., Dehzangi, A. et al. Enhancing protein backbone angle prediction by using simpler models of deep neural networks. Sci Rep 10, 19430 (2020). https://doi.org/10.1038/s41598-020-76317-6

Download citation

Received: 06 May 2020
Accepted: 23 October 2020
Published: 10 November 2020
DOI: https://doi.org/10.1038/s41598-020-76317-6

This article is cited by

Artificial intelligence for template-free protein structure prediction: a comprehensive review
- M. M. Mohamed Mufassirin
- M. A. Hakim Newton
- Abdul Sattar
Artificial Intelligence Review (2023)
Secondary structure specific simpler prediction models for protein backbone angles
- M. A. Hakim Newton
- Fereshteh Mataeimoghadam
- Abdul Sattar
BMC Bioinformatics (2022)
Enhancing protein inter-residue real distance prediction by scrutinising deep learning models
- Julia Rahman
- M. A. Hakim Newton
- Abdul Sattar
Scientific Reports (2022)
Machine learning approaches demonstrate that protein structures carry information about their genetic coding
- Linor Ackerman-Schraier
- Aviv A. Rosenberg
- Alex M. Bronstein
Scientific Reports (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Highly accurate protein structure prediction with AlphaFold

An open source knowledge graph ecosystem for the life sciences

De novo design of protein structure and function with RFdiffusion

Introduction

Methods

Input features

Predicted outputs

DNN architecture

DNN implementation

Benchmark datasets

Results

Calculating absolute errors

Determining best settings

Performing cross-validation

Comparison with state-of-the-art predictors

Comparison on protein length groups

Comparison on secondary structure groups

Comparison on amino acid groups

Using angle ranges from predicted secondary structures

Comparison of angle distributions

Protein structure generation and refinement

Comparison on correct prediction per protein

Conclusions

Change history

06 September 2021

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Artificial intelligence for template-free protein structure prediction: a comprehensive review

Secondary structure specific simpler prediction models for protein backbone angles

Enhancing protein inter-residue real distance prediction by scrutinising deep learning models

Machine learning approaches demonstrate that protein structures carry information about their genetic coding

Comments

Search

Quick links