RPI-Bind: a structure-based method for accurate identification of RNA-protein binding sites

RNA and protein interactions play crucial roles in multiple biological processes, while these interactions are significantly influenced by the structures and sequences of protein and RNA molecules. In this study, we first performed an analysis of RNA-protein interacting complexes, and identified interface properties of sequences and structures, which reveal the diverse nature of the binding sites. With the observations, we built a three-step prediction model, namely RPI-Bind, for the identification of RNA-protein binding regions using the sequences and structures of both proteins and RNAs. The three steps include 1) the prediction of RNA binding regions on protein, 2) the prediction of protein binding regions on RNA, and 3) the prediction of interacting regions on both RNA and protein simultaneously, with the results from steps 1) and 2). Compared with existing methods, most of which employ only sequences, our model significantly improves the prediction accuracy at each of the three steps. Especially, our model outperforms the catRAPID by >20% at the 3rd step. All of these results indicate the importance of structures in RNA-protein interactions, and suggest that the RPI-Bind model is a powerful theoretical framework for studying RNA-protein interactions.


Supplemental materials Section 1. Data set
A total of 1,342 protein-RNA complexes were extracted from the Nucleic Acid Database (NDB) 1 (as of January, 2015), and their corresponding structures, solved by X-rays crystallography with resolution better than 3.5 Å, were downloaded from the Protein Data Bank (PDB) 2 . The polypeptides (protein with sequence length <25 amino acids) and polyribonucleotides (RNA with sequence length <10 nucleotides) were excluded, due to their low information content. Further, to avoid homology bias, a strict criterion was used.
Firstly, the non-redundant dataset was obtained by clustering the protein and RNA sequences with CD-HIT program 3 at 40% identity and 80% similarity. Then another common tool, BLASTCLUST program 4 was used to ensure that no clear homologues are present in the non-redundant dataset. In this way, we removed protein and RNA chains with sequence identity above 25% and 60%, respectively. Additionally, similar protein chains interacting with similar RNA chains were considered as homologous RNA-protein pairs. Such pairs were clustered and a representative pair from each cluster was selected.
Since the PDB data contained both biological interactions and crystal contacts, it was necessary to distinguish between them because crystal contacts can cause erroneous identification of the interaction. Here, only those RNA-protein pairs containing more than 5 interacting residues/nucleotides were kept as biological interactions. Finally, 172 nonredundant RNA-protein pairs were obtained. (Supplemental Table S1).
Further, we used the following criteria to define protein-RNA contacts: (i) hydrogen bond with maximum donor-acceptor distance <= 3.35Å and maximum hydrogen-acceptor distance <=2.7Å 5,6 , and (ii) van der Waals forces with hydrophobic, electrostatic and distance based interactions <=3.9 Å 7,8 . As a result, we obtained a total of 28,780 contacts, consisting of 9,077 RNA binding sites and 5,692 protein binding sites.
The residues other than binding sites are non-binding sites. However, the non-binding regions between protein and RNAs sometimes is not due to the structures and sequences of RNA or proteins, especially of the local structures of these regions, but the distance from the 3D structures and space. Therefore, the inclusion of all the residues excluding the binding sites may underestimate the roles of structure and sequence in the binding regions. In this work, we mainly focus on the role of local structure and sequence, and we are very curious about whether there are differences between binding sites and their neighbor non-binding sites in the same local structure and sequence environment. So we specifically selected the neighbor non-binding sites as negative samples. Under above criteria, those residues or nucleotides who fail to form hydrogen bond, van der Waals, hydrophobic and electrostatics interactions are defined as non-binding sites. Then a fivelength sliding window (the center of each window is binding site) was used to search both sides of binding sites. Those non-binding sites in the window are "neighbor nonbinding sites". In this way, we collected 9,801 RNA non-binding sites and 3,078 protein non-binding sites.

Section 2. Structure representations
In this study, protein blocks (PBs) was used to represent protein local conformations (PLCs). PBs is one of the structural alphabets used to describe each and every region of protein backbones. PBs has 16 structural fragments, with each containing 5 residues in length and corresponds to eight dihedral angles (φ,ψ). The structural differences between PLCs depend on the side chain length, position, and proximity of the dihedral angle to the protein backbone. Each PLC has a unique side-chain conformation; therefore the PLCs can represent the protein side-chains. We used the PDB-2-PB 9 to retrieve the PBs for each protein in our data set (Supplemental Table S2).
The PLCs and RLCs are similar to the protein-and RNA-secondary structures but the difference is that they contain more structural states than regular secondary structures. So the PLCs and RLCs can be used to predict the potential binding sites not only on protein and RNA bound conformations but also their unbound conformations.

PLCs and RLCs binding preferences
PLCs and RLCs interface preferences were calculated for the non-redundant dataset of protein-RNA complexes. These preferences give a measurement of the possibility of

Structures features of proteins and RNAs
In this study, the interacting RNA-protein interacting pairs were measured by using the interaction propensity with log-odds values. Here we considered the structure preference between every three continuous amino acids (triplet) and one contacted RNA nucleotide.
We adopted this interaction propensity measurement, because neighboring PLCs/RLCs play an important role in determining the RPI.
The triplet interaction propensities S(i, j) for all four possible combinations were calculated using following formula: S(i, j) = ∑ f p,r (i, j) We calculated four types of triplet log-odds values from the RPI sequence and structures as following: (i) amino acid triplets with nucleotides, (ii) nucleotide triplets with amino acids, (iii) PLC triplets with RLCs and (iv) RLC triplets with PLCs. In this work, we used all these four combinations of triplet log-odds value matrices for binding site predictions.

Section 3. Sequence features of proteins
Each amino acid residue is represented by six descriptors including sequence mutual interaction propensities, physicochemical characteristics, hydrophobic index, relative accessible surface area, conservation score and side-chain pKa values, as follows: The sequence mutual interaction propensities were calculated as triplet-log-odds values 15  in a protein being mutated of corresponding amino acid type during the evolution process.

Section 4. Sequence features of RNAs
Each nucleotide in step 2 is also represented by three sequence features: mono-, di-and tri-nucleotide composition. In mono-nucleotide composition, we calculated four nucleotide (A, C, G, and U) compositions in each window sequence separately. In dinucleotide composition, the composition of two continuous nucleotides (AA, AC, AG, AU ...) in each window sequence was calculated separately, and provided a total of 16 numerical values. In tri-nucleotide composition, we calculated the composition of three continuous nucleotides (AAA, AAC, AAG …) in each window sequence separately.

Section 5. Machine learning methods
The Random Forest (RF) approach is a popular machine learning technique used for dealing with various biological problems 20 . Here, we applied RF classifier implemented by the RF package in R, to perform binding site prediction for a given protein chain, RNA chain and protein-RNA pair. The two parameters, ntrss (the number of to grow) and mtry (the number of variables randomly selected as candidates at each node), were optimized using a grid search approach; the value of ntrss was from 500 to 2500 with a step length of 500, and the value of mtry was from 1 to 40 with a step length of 1. In addition, the RF provides an important application of selecting more important features based on their contributions to the performance of the predictive models. Permutation importance analysis is frequently used as a metric in the RF method for measuring the relative importance of features, and the importance score of a feature is calculated according to the average decrease of the model accuracy on the out-of-bag samples when this feature is randomly permuted. Here, the RF and permutation importance analysis were implemented by the RF package in R.
We also compared the performance of the RF method with other machine learning methods, including Support Vector Machine (SVM) and Neural Network (NN). For SVM, we considered the radial basis function (RBF) as the kernel function, and two parameters, the regularization parameter C and the kernel width parameter γ were optimized by using a grid search approach. It could identify good parameters based on exponentially growing sequences of (C, γ) (for example, C =2 −5 , 2 −4 … 2 10 and γ = 2 −10 , 2 −9 … 2 5 ). A standard feed-forward neural network was used, with a sigmoid transfer function and a single hidden layer of 10 neurons. All possible connections were allowed between the input units and the hidden-layer neurons, as well as between the former neurons and the final output units. The backpropagation algorithm was applied in training the ANN, with random initial weights. The learn rate was set to 0.0001 and the weight decay to -0.001.
SVM and NN were implemented in the python module scikit-learn.

Section 6. Performance evaluation
Five-fold cross validation was used to evaluate the performance of our models. The data set was divided into 5 subsets of equal size. Each subset was used for testing, while the

Supplemental Tables
Supplemental Table S1. 172 non-redundant RNA-protein pairs Supplemental