Prediction of structural features and application to outer membrane protein identification

Protein three-dimensional (3D) structures provide insightful information in many fields of biology. One-dimensional properties derived from 3D structures such as secondary structure, residue solvent accessibility, residue depth and backbone torsion angles are helpful to protein function prediction, fold recognition and ab initio folding. Here, we predict various structural features with the assistance of neural network learning. Based on an independent test dataset, protein secondary structure prediction generates an overall Q3 accuracy of ~80%. Meanwhile, the prediction of relative solvent accessibility obtains the highest mean absolute error of 0.164, and prediction of residue depth achieves the lowest mean absolute error of 0.062. We further improve the outer membrane protein identification by including the predicted structural features in a scoring function using a simple profile-to-profile alignment. The results demonstrate that the accuracy of outer membrane protein identification can be improved by ~3% at a 1% false positive level when structural features are incorporated. Finally, our methods are available as two convenient and easy-to-use programs. One is PSSM-2-Features for predicting secondary structure, relative solvent accessibility, residue depth and backbone torsion angles, the other is PPA-OMP for identifying outer membrane proteins from proteomes.


Results
A pipeline of our methods. A pipeline of our methods was constructed and is clearly presented in Fig. 1. The pipeline consists of three modules: (1) prediction of structural features, (2) identification of OMPs, and (3) modeling of 3D structures for potential OMPs. In the first step, a query sequence is threaded by PSI-BLAST through the NCBI NR database for three iterations with an e-value threshold of Figure 1. A pipeline of our methods. The pipeline consists of three modules, prediction of structural features, identification of OMPs, and modeling of 3D structures for potential OMPs. First, a target protein is iteratively threaded through the local NCBI NR database for three iterations to generate sequence profiles. Profiles are then fed into the trained neural networks to predict structurally features. Second, the target protein is searched against an OMP sequence database by using PPA-OMP with a scoring function incorporating sequence profiles and predicted structural features. The target protein is judged to be an OMP or not by the significance of the top alignment. Third, the target protein is searched against a structurally known OMP database by PPA-OMP program if the target protein is predicted to be an OMP. The 3D structural models of the target are built using the alignment by PPA-OMP with the assistance of MODELLER 23 program. Because PPA-OMP is used to search a sequence database and a structurally known database in this pipeline, PPA-OMP is used twice in this flow chart.
0.001 to generate sequence profiles. Then, structural features can be generated by PSSM-2-Features with the sequence profiles provided as input. In the second step, PPA-OMP local alignment method 22 is used to search the query sequence against an OMP sequence database, and the query sequence is determined to be an OMP or not by top alignments. Finally, the query sequence is searched against a structurally known OMP database. The 3D models for the query sequence are built using PPA-OMP alignment with the assistance of MODELLER 23 if it is predicted to be an OMP. It should be pointed out here that global alignment is used to align as many residues as possible in the 3D model building.

Comparison of protein secondary structures calculated by different methods.
Protein secondary structures were derived from PDB structures. Nevertheless, defining secondary structures from PDB structures is not an exact process due to the fact that different programs have their own definitions. In fact, different programs have their strengths in assigning α -helix, β -strand or coil. Therefore, using different programs to derive the secondary structures in the benchmark dataset always has a certain level of bias in the results. The contents of secondary structures on SCOPe_TEST1073 by the DSSP and STRIDE programs were summarized in Fig. 2. As shown the Figure, the most proportion of assignments was coil by the two programs. STRIDE defined slightly more α -helix and β -strand than that of DSSP (0.397 versus 0.379, 0.214 versus 0.204), while DSSP defined more coil than STRIDE (0.415 versus 0.388). Overall, the two programs had 94.7% agreement in their assignments.
Overall performance of PSSM-2-Features. Protein secondary structure prediction. Our method (PSSM-2-Features) was trained on PDB_TRAIN6675 dataset and tested on SCOPe_TEST1073 dataset. Table 1 shows the results of protein secondary structure prediction on the SCOPe_TEST1073 dataset and cross-validation result on PDB_CS6001 dataet. We relied mainly on the SCOPe_TEST1073 dataset to assess different methods.
Among the four measures (Q 3 , Q H , Q E and Q C ), Q 3 is the most comprehensive parameter to assess the performance of secondary structure prediction. The Q 3 accuracy of PSSM-2-Features is slightly lower than 80%. When protein secondary structure was assigned by STRIDE, the Q H , Q E and Q C values of PSSM-2-Features are 0.869, 0.728 and 0.764, respectively. Similar results were obtained when protein secondary structure was assigned by DSSP.
It should be noted that overfeeding of models is avoided by removing similar sequences from training datasets. For example, protein sequences in the PDB_TRAIN6675 dataset were not similar to the sequenes in the SCOPe_TEST1073 dataset at the sequence level (BLAST e-value > 0.001). Similarly, the identity between any two sequences is lower than 30% in PDB_CS6001 dataset.
Phi angle, relative solvent accessibility and residue depth prediction. Table 2 shows the input features and the optimized window sizes for each structural property in the PSSM-2-Features. The fitness of amino acids in three types of secondary structures is given in Table S1 (supplementary file 1) and used in protein secondary structure prediction of the PSSM-2-Features. The mean absolute errors (MAE) of RSA, RD and Phi predictions were summarized in Table 3. Of the four structural features, RD gave the lowest error value (MAE = 0.062). RSA was the most difficult to be predicted and we obtained a MAE value of 0.164 on SCOPe_TEST1073 dataset. This result may suggest that solvent accessibility is probably less conserved than other properties (e.g., secondary structure) in the protein families, which is consistent with that reported by ROST and Sander 24 . On the other hand, the Pcc scores of RD, Phi and RSA predictions are 0.597, 0.546 and 0.690 on SCOPe_TEST1073 dataset. It is interesting to learn that the Pcc value of RSA is higher than that by RD and Phi although the MAE value of RSA is not better than them. To further investigate the results, the Pcc scores between predicted and actual properties (i.e., RSA and RD) were calculated. The distributions of the Pcc scores for the proteins on the SCOPe_TEST1073 dataset for RSA and RD are shown in Figs 3 and 4, respectively. Overall speaking, the results by our predictors are relatively accurate. The reasons of our algorithm's effectiveness relying on the factors as: (1) the input features (e.g., PSSM) are informative, and (2) parameters of neural networks4 are highly optimized. Here, we can learn that the structural features are strongly related to evolutionary information (i.e., PSSM profile). Furthermore, we used SS_RI measure (i.e., Eq 12) to estimate the reliability of protein secondary structure prediction for each residue. In our benchmark result, if the score of SS_RI > 0.35, it yields a predicted result with a false positive rate of less than 1%.
Comparison with state-of-the-art methods. Protein secondary structure prediction. PSSpred, Psipred, SABLE and SPINE-X for protein secondary structure prediction were installed in our local computers  Table 2. Input features and optimized window sizes for the training of structural properties. a There are one or two numbers in the column of number of hidden layers. If there are two numbers in, the two numbers are nodes in the first and second networks. Generally speaking, we use the second neural network to refine the prediction by the first neural network. b PSSM, PSFM, FT and CS stand for position-specific scoring matrix, position-specific frequency matrix, amino acid's fitness score to secondary structure and conservation score, respectively. and test proteins were directly fed into them. Interestingly, all methods tested here are NN-based predictors. Again, we relied mainly on the SCOPe_TEST1073 dataset to assess different methods. When protein secondary structures were assigned by the STRIDE program, PSSpred, Psipred and SPINE-X resulted in Q 3 accuracies greater than 80%. SABLE and PSSM-2-Features generated Q 3 accuracies lower than 80%. In contrast to other methods, the Q E of the PSSpred is very high (74.6%), and this is probably because the PSSpred used seven neural networks to make a consensus prediction. When protein secondary structures were assigned by the DSSP program, the PSSpred generated a Q 3 accuracy slightly higher than 80%, while the Q 3 scores of Psipred, SPINE-X and PSSM-2-Features were slightly lower than 80%. Similarly, the Q E of the PSSpred is the highest (75.1%) compared with other three methods. It is quite clear that the benchmark results were slightly different when different programs were used to derive protein secondary structure. Similar results were obtained when different methods were tested on PDB_CS6001 dataset.   The Q E scores of the Psipred, the SPINE-X and PSSM-2-Features are relatively lower when compared with Q H and Q C and this can be ascribed to the fact that the formation of β -strand is strongly influenced by long-range interactions 25 , which is hard to be predicted. In 1998, Gromiha and Selvaraj reported a similar result and they found the prediction of all-α proteins was better than that of all-β proteins 26 .
Phi angle, relative solvent accessibility and residue depth prediction. SPINE-X can also be used to predict phi angle. As reported by Singh et al. 27  As to the RD prediction, we can not directly compare it with other methods due to the unavailability of their standalone programs and web servers 8,28,29 . Here, we compared their performance using the data reported in the literatures. Yuan-Wang method 8

RSA versus RD.
We also investigated the correlation between RSA and RD (Fig. 5). RSA measures to what extent an amino acid is accessible to a solvent while RD measures how deeply a given residue is buried. It is not surprising to learn that the Pcc score between RSA and RD is − 0.574, which suggests RSA is negatively correlated and complementary with RD.  HHomp c  1400  1435  1437  1441  1442  1445  1449  1454  1455  1459   PPA-OMP c  1314  1389  1452  1541  1564  1634  1667  1706  1728  1741   Control-PPA c  1166  1291  1310  1319  1323  1346  1362  1393  1404  1417   Table 4. Comparison of receiver operator characteristics table for different methods. a Here, false positives correspond to those non-OMPs that are predicted as OMPs. b The numbers in this line show various thresholds of false positives. c The numbers in these lines correspond to true positives that can be identified by methods tested here. Some structural features are correlated to a lesser degree, such as SS and RSA. Therefore, it is reasonable that the performance of RSA prediction can be slightly improved by using secondary structure based on this context. Benchmark of outer membrane protein identification. We used the same library and test dataset (i.e., R-dataset) as HHomp to assess our OMP identification method. The performance of OMP identification was compared via ROC analysis. We paid more attention to the performance at < 1% (e.g., 50 false positive instances in the R dataset), which is considered a critical threshold in practical applications. As can be seen from Fig. 6 and Table 4 PPA-OMP correctly recognizes 1,314 OMPs before including the first 5 false positives, and, HHomp can detect 1,400 OMPs at the same level. At a 1% false positive rate control, PPA-OMP can correctly recognize 1,741 OMPs, and the number is slightly higher than that identified by HHomp (1,459 OMPs). The RSA, RD and Phi-based terms are informative. This can be clearly demonstrated by a 3% lower sensitivity when these terms were removed from the PPA-OMP method (i.e., Control-PPA). In other words, the Control-PPA method was constructed by using sequence profiles and secondary structure terms. The significance of PPA-OMP alignment scores can also be calculated from the R-dataset. It is estimated that a predicted result is at less than the 1% and 5% false positive rates if the alignment scores are higher than 20 and 15, respectively.
Benchmark experiment on β-class globular proteins. Since all β -class globular proteins and OMPs share similar 3D structures, it is very necessary to benchmark the performance of PPA-OMP for excluding globular proteins. Here, we randomly selected one protein from each family from the SCOPe_40% dataset. Thus, we compiled a dataset called Beta-G822, which contains 822 all β -class globular proteins. These 822 β -class proteins were directly fed into PPA-OMP, which was constructed using 496 consensus OMPs as the library. There are 34 of these 822 β -class proteins that were predicted to be OMP with scores higher than 99% confident level. This prediction accuracy is very high (95.8%). This result may be attributed to the fact that all β -class globular proteins and OMPs were grouped into different homologous families although they share similar 3D structures. For example, OMPs were grouped into f.5 fold (Outer membrane efflux proteins (OEP)), f.4.2 superfamily (Outer membrane phospholipase A (OMPLA)), f.4.1.2 family (Outer membrane enzyme PagP), etc., while all β -class globular proteins were grouped into b class (i.e., b.*.*.*, where * is a wild symbol). Meanwhile, PPA-OMP is a profile-to-profile alignment method and it can discriminate different protein families. PPA-OMP therefore can accurately exclude β -class globular proteins from OMPs. The prediction results for these proteins are publicly available at http://genomics.fzu.edu.cn/OMP/benchmarks/globular_beta_proteins.tar.bz2.
Protein structure prediction of OMPs. The algorithms described in this work were seamlessly incorporated into our OMP prediction web server (http://genomics.fzu.edu.cn/OMP/). The web server can accept a single protein, either in plain text or in FASTA format. A multiple FASTA formatted input is also acceptable. The number of proteins is up to 50 in each multiple FASTA input. To build 3D structures of potential OMPs, we compiled a library (http://genomics.fzu.edu.cn/OMP/3DLibrary/) consisting of 154 structurally known protein chains. Each target protein, either assumed or predicted to be an OMP, is threaded through the library with the PPA-OMP alignment algorithm. The final models are built by the alignments between target and identified templates with the assistance of MODELLER 23 . The generated models are reliable only if the query is an OMP. The performance of PPA-OMP is further exemplified in the protein structure prediction of 2O5PA ( Figure S1 in supplementary file 2). Although both 2O5PA and 1XKWA are from Pseudomonas aeruginosa and are structural homologs, they share a weak sequence similarity. When we searched 2O5PA against the sequences of the OMP library using PPA-OMP, 1XKWA was one of top hits (other top templates were closely homologous proteins). The model built by using the 1XKWA was RMSD of 2.79 Å and TM-score value of 0.798. As reported by Xu and Zhang 30 , the model is reliable if the TM-score value is higher than 0.5. Therefore, the model for 2O5PA by PPA-OMP is high-quality and can be used for further analysis. There are only a few non-redundant structurally known OMPs (~30) 16,20,31 , and we therefore did not benchmark the performance of protein structure prediction of the PPA-OMP using large scale datasets.

Proteome-wide OMP identification in E. coli.
We utilized the complete proteome of E. coli to test the performance of our method in a real application. In our previous work 20 , we collected a known OMP dataset consisting of 122 proteins from the E. coli proteome by retrieving the annotations from NCBI, PSORTdb 32 , OMPdb 33 databases and via SPARK-X 34 fold recognition tool. In this work, the 4,126 sequences of E. coli were directly fed into PPA-OMP and 111 proteins were predicted to be potential OMPs with a false positive rate control of 1%. There were 76 out of the 111 proteins in the known E. coli OMP dataset. Therefore, these 76 predicted OMPs should be regarded as true positives with high confidence. Details of the 76 identified OMPs and their predicted 3D models are available at http://genomics. fzu.edu.cn/OMP/attachments/. Six computational algorithms (i.e., SOSUI, amino acid, dipeptide, motif, SVM-based methods and a method called "New approach") were used in TMBETA-GENOME (http:// tmbeta-genome.cbrc.jp/annotation/) to annotate OMPs. Here, we assumed a protein is predicted to be OMP with high confidence if at least five computational methods of TMBETA-GENOME predict it as OMP. And we obtained 182 such proteins from TMBETA-GENOME for E. coli K12 proteome (http:// genomics.fzu.edu.cn/OMP/benchmarks/TMBETA-GENOME_182_OMPs.txt. Interestingly, 70 out of these 76 proteins are included in those 182 proteins predicted by TMBETA-GENOME database.
In the remaining 35 proteins, there exist 9 proteins whose subcellular localizations are annotated as 'unknown' or 'this protein may have multiple localization sites' in the PSORTdb 32 database. It should be clearly pointed out that some of these 9 proteins may be OMPs that are not experimentally identified yet. To validate the real types of them may need experimental work or further bioinformatics analysis. The other remaining 26 hits are clearly annotated as non-OMPs based on subcellular localization information in the PSORTdb database, suggesting that they are very likely to be false positives.
In fact, it is estimated that 96 ~ 98% protein sequences in the E. coli proteome are non-OMPs 31,35 , and it is therefore reasonable to have 30 ~ 40 false positives at a 1% false positive rate.

Discussions
Taken together, we can clearly draw the following conclusions: (1) accurate predictors for protein structural properties, including SS, RSA, RD and Phi, can be built by combining neural network and effective features, (2) these predicted structural features can be applied to improve the identification of OMPs. In this work, we proved these two conclusions by developing two computational tools, namely PSSM-2-Features and PPA-OMP. PSSM-2-Features was designed for predicting structural features, and PPA-OMP program was for identifying OMPs from proteome wide sequences.
There are two factors, the informative input features and highly optimized neural networks, making the PSSM-2-Features relatively accurate. The effectiveness of OMP identification by PPA-OMP can be attributed to the fact that the accuracies of the developed predictors are relatively high and most β -barrel OMPs are relative by common ancestry 16 .
It should be noted that an obvious drawback of our methods, in contrast to other algorithms that only use the target sequence information, is that predicted structural features may be inaccurate when sequence profiles contain non-homologous proteins. Thus, the performance of both structural feature prediction and OMP identification will be affected in such cases. However, incorporation of predicted structural features will, on average, significantly improve prediction performance, in many realistic applications. Therefore, our web server and programs, most probably, will be useful for researchers in the biological community.

Materials and Methods
Data collection and preprocessing. To build reliable models for structural features, it is essential to compile large and non-redundant datasets for training and testing. We collected 6,675 proteins from the PDB 36 database. The set of the 6,675 proteins was named PDB_TRAIN6675 and used as a training dataset. Furthermore, we compiled a test dataset called SCOPe_TEST1073, which consists of 1,073 proteins from SCOPe (SCOP extended, version 2.03) database 37 , to benchmark the performance of structural feature prediction. These proteins in SCOPe_TEST1073 share low similarity to the proteins in the PDB_TRAIN6675 (BLAST e-value > 0.001). Both PDB_TRAIN6675 and SCOPe_TEST1073 datasets are non-redundant. The PDB_TRAIN6675 was constructed by removing highly similar sequences at a cutoff of 40% identity via CD-HIT 38 . The SCOPe_TEST1073 contains 1,073 superfamilies (i.e., each protein in the dataset is a representative superfamily of SCOPe database). To critically benchmark OMP identification, R-dataset (compiled by Remmert et al. 16 ), which contains 2,164 OMPs from the TransportDB 39 database and 5,000 non-OMPs randomly selected from the SCOP 40 database (version 1.69), was downloaded from ftp://ftp.tuebingen.mpg.de/pub/protevo/HHomp/benchmark/. It should be mentioned that those proteins in the training dataset (i.e., PDB_TRAIN6675) share low similarity to the proteins of both SCOPe_TEST1073 and R-dataset at the sequence level (BLAST e-value > 0.001). We further clustered sequences in PDB_TRAIN6675 dataset with the cutoff of 30% sequence identity via BLASTCluster program in the BLAST package. The highly similar sequences with sequence identity > 30% were removed and we obtained 6,001 sequences from PDB_TRAIN6675 dataset in this step. The set of these 6,001 sequences was named PDB_CS6001. The 30-fold cross-validation is used to benchmark prediction on PDB_CS6001 dataset. More details of constructing the datasets are available at supplementary file 3. All datasets and benchmark results in this study can be downloaded at http://genomics.fzu.edu.cn/OMP/ benchmarks/.
Neural network learning. We used neural networks (NNs) to train the predictors for structural features in this work. Prediction performance of NN-based predictors mainly depends upon two factors. One is how much information is contained in input features, the other is the architecture of NNs. A NN is generally composed of three components, i.e., an input layer, one or more hidden layers and an output layer. The training process is to obtain the optimized weights connecting different layers. The training algorithm used in this work was implemented via Encog framework (https://code.google.com/p/ encog-java/downloads/list). The learning rate of 0.001 and momentum of 0.85 were found to be effective. The sigmoid activation function (1/(1 + e −x )) was applied to hidden and output layers. The architecture and parameters were specifically optimized for each feature predictor. The procedures to develop these algorithms (i.e., predictors for SS, RSA, RD and BTA) were through similar steps: (i) sequence profile generation, (ii) encoding construction and (iii) optimization of parameters of NNs.
Protein secondary structures. STRIDE 9 and DSSP 10 were used to derive protein secondary structure from 3D coordinates. The STRIDE program utilizes both hydrogen bond energy and main chain dihedral angles, to derive secondary structures for structurally known proteins while the DSSP program mainly depending on hydrogen bond energy. The states obtained by STRIDE and DSSP are G (3-10 helix), H (α -helix), I (PI-helix), T (turn), E (extended conformation), B (isolated β -bridge), and C (coil). These seven states are reduced by the following transformations: H, G and I -> H (α -helix), E and B -> E (β -strand), and other states -> C (coil). Different results obtained by STRIDE and DSSP are compared and discussed in this paper. We trained the predictor for secondary structure using a similar way to Psipred. Briefly, two standard feed-forward back-propagation NNs are used. The first NN contains two hidden layers and the second NN has only a single hidden layer. The output layer contains three nodes with each node standing for one secondary structure type. The sigmoid activation function is used and the three secondary structures are therefore encoded as H (0, 0, 1), E (0, 1, 0), and C (1, 0, 0) in the output layer. The generated three secondary structure probabilities of the first neural network are fed into the second neural network that again produces probabilities (i.e., final probabilities).
Residue solvent accessibility. The relative solvent accessibility (RSA) of an amino acid in a protein measures to the extent of the amino acid accessible to a solvent (usually water) surrounding the protein.
In general, hydrophobic amino acids are buried inside the protein while hydrophilic amino acids are on or near the surface. The DSSP program was used to calculate accessible surface areas for all residues in our datasets. The obtained accessible surface areas were then divided by the surface areas of amino acids to get RSA. The surface areas of twenty amino acids were obtained from the reference 41. Because of unusual bond angles, sequence lengths and distorted geometry in real proteins, RSA values can sometimes exceed 100%. We directly set the values to 100% for such cases.

Residue depth.
In contrast to the solvent accessibility, residue depth (RD) measures the degree of inaccessibility of a given residue buried inside a protein. The concept of RD supplements the information provided by RSA. The RD values of proteins were calculated by EDTSurf 42 program. The values output by EDTSurf lie in [2.8, 9.8], where a higher value corresponds to a deeper region where a residue locate. Using the same method as in FFAS-3D 43 , the RD values were normalized to the range of 0 ~ 1 by an equation as follow where dv(i) is the value output by EDTSurf, and it is absolute residue depth value for the residue i. RD(i) is the relative residue depth value for residue i. A model was trained to predict RD scores of proteins.
Backbone torsion angles. We only trained and used a predictor for Phi angle according to the fact that the predicted result of Psi is not very satisfactory (data not shown). The Phi angles range from − 180˚ to 180˚. The angles were transformed to the range of 0˚ to 360˚ by keeping the angles between 0˚ and 180˚ unchanged, and adding 360˚ for angles between − 180˚ and 0˚. The angles were then linearly normalized to the range of 0 ~ 1 by dividing by 360˚.

Fitness of amino acids in secondary structures.
We also analyzed the fitness of amino acids in α -helix, β -strand, and coil on PDB_TRAIN6675 dataset and applied it to protein secondary structure prediction. Three probability values for each amino acid appearing in the three types of secondary structures were derived as where NA i is the number of the ith residue type in the dataset. NS j,i is the number of the jth secondary structure type in the ith residue type. i is in the range of 1 to 20, representing 20 amino acids, and j ranges between 1 and 3, standing for 3 types of secondary structures. The obtained score FT(i,j) is the fitness score between the ith residue type and the jth secondary structure type. The probabilities of three secondary structure types for any residue sum up to 1.

Input features.
Each target residue is represented by the features of its sequence or structural characteristics. The features for training are usually referred to as input vectors or input features. The construction of input features is mainly based on three observations. First, sequence profiles are important for structurally relative properties. Second, it is helpful to consider adjacent residues of the target residue. Last but not least, some structural properties are somewhat correlated (e.g., SS and RSA). For a query sequence, its sequence profile can be generated by using the PSI-BLAST 21 to search NCBI NR database for three iterations with an e-value cutoff of 0.001. There are two types of sequence profiles generated using the option '-Q' . One is a position specific scoring matrix/profile (PSSM), the other is a position specific frequency matrix/profile (PSFM). Both of the profiles are used in this work. For each residue, a sliding window containing 2n + 1 residue long (i.e., window size = 2n + 1) fragment profiles centered at the target residue is excised from the sequence profile and fed into NNs. where entropy(i) is the entropy value of residue i using Eq. 3 and CS(i) is the conservation score for residue i. If a position is very conserved (i.e., only one type of amino acids found in this position), the entropy will be equal to 0. The entropy is close to 2.996 if the residue is highly variable (i.e., the frequencies of twenty amino acids are equal in the position). The CS value lies in (0,1], where a higher score corresponds to a more conserved state for the residue. In addition, an extra unit per amino acid is used to indicate whether the residue spans either the N or C terminus of the protein chain. For region spanning the N or C terminus, the feature values are set to zeros and the value of the additional bit is set to 1, otherwise the value of the bit is set to 0. We carefully selected input features for each specific structural property. For RD and Phi predictions, we used a sliding window containing PSSM profile, PSFM profile, conservation score, and an extra unit per amino acid, indicating whether the residue spans either the N or C terminus of the protein chain. For SS prediction, in addition to the features used by the former two predictors, we further used the fitness of amino acids in SS. For RSA prediction, we used the input features that were used in RD, and, the three probabilities of SS prediction were also employed. It should be clearly noted that the predicted SS used in RSA prediction is not a sliding window, but just the probabilities of three types of SS for the target residue. A sliding window containing secondary structures has also been examined, but no improvement was observed. Dynamic programming for outer membrane protein identification. The dynamic programming algorithm was implemented using the procedure described in the book of Durbin et al. 44  where Profile(i,j) is a simple dot-product profile-to-profile alignment score. SS_Sim(i,j) is a measure of secondary structure similarity. The term ∑ ∆ = , w k k ij k 2 4 is used to calculate the differences of structural properties between the target and template sequences. The shift parameter is introduced to avoid the alignment of unrelated residues in the local regions. We will explain the details of each term in the following sections. The statistical significance of alignment scores is calculated using the same way as our previous work 45 (See supplementary file 5 for details).