Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields

Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.


S.1 Training on the JPred dataset
In order to show that our method does not over-fit and to perform a comparison with JPred, we trained seven DeepCNF models using the JPred training data (http://www.compbio.dundee.ac.uk/jpred4/about.shtml), which has 1338 training proteins. We divided this dataset into 7 groups according to the JPred cross-validation sets and each time used 6 of them to train one DeepCNF model.
We use the same model architecture and hyperparameters determined by our previous method. That is, our DeepCNF model has 5 hidden layers and each layer has 100 different neurons. The window size is set to 11, and the regularization factor is set to 50.

S.8 Segment OVerlap (SOV) Score
The Segment Overlap score (SOV) measures overlap between the observed and the predicted secondary structure segments instead of per-residue accuracy, proposed by Zemla et. al. (Zemla, Venclovas, Fidelis, & Rost, 1999). The predictions that have high per-residue accuracy but deviate from experimental segment length distributions have lower SOV scores (Im, 2008). SOV score ranges from 0 to 1 with 1 indicating the perfect overlap.
Brief description of SOV from (Zemla, et al., 1999) is as follows. To calculate SOV, the (predicted) secondary structure of one protein sequence is parsed into segments such that each segment has a single secondary structure type. Let S1 be the observed secondary structure and S2 the predicted secondary structure. For each type ∈ { , , }, ( ) is the set of segment pair ( 1, 2) with type where s1 is from S1, s2 is from S2 and 1 and 2 overlap in at least one residue. That is, ( ) = {( 1, 2) ∶ 1 ∩ 2 ≠ 0 and 1 2 have type }. In contrast, ′( ) = { 1 ∶ 1 ∩ 2 = 0 and 1 2 have type }.
Then the segment overlap score between S1 and S2 is calculated as follows.