Rsite2: an efficient computational method to predict the functional sites of noncoding RNAs

Noncoding RNAs (ncRNAs) represent a big class of important RNA molecules. Given the large number of ncRNAs, identifying their functional sites is becoming one of the most important topics in the post-genomic era, but available computational methods are limited. For the above purpose, we previously presented a tertiary structure based method, Rsite, which first calculates the distance metrics defined in Methods with the tertiary structure of an ncRNA and then identifies the nucleotides located within the extreme points in the distance curve as the functional sites of the given ncRNA. However, the application of Rsite is largely limited because of limited RNA tertiary structures. Here we present a secondary structure based computational method, Rsite2, based on the observation that the secondary structure based nucleotide distance is strongly positively correlated with that derived from tertiary structure. This makes it reasonable to replace tertiary structure with secondary structure, which is much easier to obtain and process. Moreover, we applied Rsite2 to three ncRNAs (tRNA (Lys), Diels-Alder ribozyme, and RNase P) and a list of human mitochondria transcripts. The results show that Rsite2 works well with nearly equivalent accuracy as Rsite but is much more feasible and efficient. Finally, a web-server, the source codes, and the dataset of Rsite2 are available at http://www.cuialb.cn/rsite2.

Scientific RepoRts | 6:19016 | DOI: 10.1038/srep19016 sites have been released based on the above datasets 27 . These methods use diverse strategies, e.g. machine learning 28 , pattern recognition 29 , phylogenetic motif identification [30][31][32] , structure network analysis 33 , text mining 34 , and geometric analysis [35][36][37][38] . However, considering currently limited information about ncRNA functional sites and low sequence conservation of ncRNAs 39 , computational methods like machine learning and phylogenetic motif identification are not suitable so far to predict ncRNA functional sites. As for geometric analysis, it finds that the residues close to the protein centroid 38 and the residues that resides on the protein surface 35 could be the functional elements. These residues are normally located within the local extreme points of residue distance curve. We suppose that it could be applicable for ncRNA as well. In a previous study 40 , we have developed a tertiary geometry based computational method, Rsite, to identify ncRNA functional sites and have displayed its reliable accuracy. The major limitation of Rsite is that it works based on ncRNA tertiary structures. Up to now for most of ncRNAs, their experimentally derived tertiary structures are not available. In addition, it is still very hard to identify the tertiary structures of RNAs especially large RNA molecules (> = 100nt) using computational prediction [41][42][43] . The application of Rsite is therefore largely restricted.
In this study, we first revealed a strong correlation between the nucleotide distances derived from the RNA tertiary structures and those derived from the RNA secondary structures. This makes it reasonable to develop secondary structure geometry based computational methods for ncRNA functional site prediction. Compared with ncRNA tertiary structures, secondary structures can be determined and predicted more simply and easily. Based on the above observation, here we developed a secondary structure based computational method, Rsite2, to predict the functional sites of ncRNAs. Finally, we confirmed that Rsite2 has a nearly equivalent accuracy as the tertiary structure based method (Rsite) but is much more feasible and efficient.

Results
ncRNA secondary structures and tertiary structures show significantly positive correlation. With a non-redundant list of RNA-containing 3D structures from the Nucleic Acid Database (NDB, http://ndbserver.rutgers.edu/) 44,45 that records in total 1419 PDB IDs of the equivalence class representatives, we obtained 1034 biological assemblies of them from the Protein Data Bank (PDB, http://www.rcsb.org/pdb/ home/home.do) 19 . After a screening procedure (see Methods), there left 203 ncRNA molecules for further 2D-3D structure correlation analysis. Almost all of the 203 ncRNAs are shorter than 200nt except a few rRNA. In the case of the Euclidean distances from nucleotides to molecular centroid, the overall Spearman correlation coefficient is 0.79 (p = 0) (Fig. 1a). In addition, for each specific ncRNA molecule, we calculated the q values using the R (v3.2.0) package 'qvalue' 46 (v1.99.1). It can be seen that 122 (60%) of the 203 ncRNAs exhibit significantly strong correlations (Spearman rho > 0.6, q < 1e-5) (Fig. 1b) and 173 (85%) ncRNA molecules demonstrate significantly positive correlations (q < 0.01) between 2D and 3D structures. As shown above, it is reasonable to predict ncRNA functional sites on the basis of secondary structure-derived distance metrics. And we further proved the feasibility in the following sections.
Predicting the functional sites of the tRNA (Lys). Firstly, we downloaded the secondary structure of the tRNA (Lys) in PostScript format from RNA STRAND (http://www.rnasoft.ca/strand/) 47 . We next calculated and plotted the distance curve (Fig. 2). As a result, Rsite2 identified 10 possible functional sites in total. The result shows that only 1 of the 7 known functional sites is missed (Table 1; Fig. 3). As a result, Rsite2 has a sensitivity of 86% (6/7) and a positive predictive value (PPV) of 60% (6/10) for the prediction of the functional sites of tRNA (Lys).
Predicting the functional sites of the Diels-Alder ribozyme. In the same way as the above section, we calculated and plotted the distance curve of the Diels-Alder ribozyme (Fig. 4). As a result, Rsite2 predicted 7 putative functional sites for the Diels-Alder ribozyme. We successfully predicted all of the 3 known functional sites ( Table 2; Fig. 5). The result shows that Rsite2 has a sensitivity of 100% (3/3) and a PPV of 43% (3/7).
Predicting the functional sites of the S. cerevisiae RNase P. Moreover, we applied Rsite2 to RNAase P whose secondary structure and functional sites were experimentally identified. For doing so, we downloaded the secondary structure of RNase P from RNA STRAND 47 . And then we calculated the distance curve (Fig. 6). As a result, Rsite2 predicted 25 putative functional sites for the RNase P. Rsite2 successfully hit 13 of the 17 functional protein binding sites derived from RNase footprinting assays 48 (Table 3; Fig. 7). It shows that Rsite2 achieves a sensitivity of 76% (13/17) and a PPV of 52% (13/25).
Predicting the functional sites of human mitochondrial transcripts. Recently, by digital RNase footprinting, Liu et al. 49 mapped the functional segments of human mitochondrial transcripts that participate in interactions with proteins. We wondered whether these regions can be predicted by Rsite2. For this purpose, we first retrieved the transcript sequences (Feb. 2009 GRCh37/hg19) on the mitochondrial chromosome (chrM) using the UCSC Table Browser data retrieval tool 50 (http://genome.ucsc.edu/). RNAfold 51 was then used to predict the secondary structures of these mitochondrial RNAs. Finally, Rsite2 predicted the putative functional sites for each of them. After careful comparison, we found that Rsite2 successfully hits 32 (74%) of the 43 functional segments in the mitochondrial RNAs from the UCSC database 50 . Due to the low ratio of footprint's length to RNA's length and the large number of predicted sites, and because in default Rsite2 merges sites no more than 2 nucleotides apart regardless of RNA's length, PPVs range from 1% (1/72) to 44% (4/9) and the overall PPV is 7% (54/763) (Supplementary File 1).

Discussion
It is increasingly needed to develop computational methods for the identification of ncRNA functional sites. We previously developed an ncRNA tertiary structure based method, Rsite. However it is largely limited because the tertiary structures of most ncRNAs are not available. In this study, we revealed that there is a significantly positive correlation between ncRNA secondary structures and tertiary structures. This observation gives clues that secondary structure-derived distance metrics could be used to predict their functional sites. Based on this observation, we then presented a secondary structure based computational method to predict candidate functional sites of ncRNAs. Since we are not able to assess Rsite2 systematically for the reason that there are not standard datasets that we can employ as benchmarks at the present stage, we showed the efficiency and accuracy of Rsite2 by applying it to three ncRNAs and a set of mitochondrial RNAs. Compared with the first version of Rsite, Rsite2 was improved in the following two aspects. Firstly, Rsite2 runs on the RNA secondary structures. It is known that  RNA secondary structures are much easier and more feasible to be obtained by wet-lab or computational experiments than tertiary structures. This makes Rsite2 much more feasible and efficient. In addition, it would provide great convenience to the functional site prediction of large ncRNA, of which we just know about the primary sequence and the secondary structure, and facilitate the investigation of ncRNA functions. Secondly, Rsite2 provides the users with not only the source codes but also a web-server, which makes it very easy for users to predict the functional sites of their ncRNAs. With simple single-stranded chains, ncRNAs have the ability to form certain common structural units which can serve as building blocks for subsequent elaborate construction of tertiary structures. So it makes sense that overall secondary structures of ncRNAs are positively correlated with their tertiary structures. However, due to the disparity between some predicted secondary structures and known secondary structures, their 2D-3D structure correlations are weak or even negative. For instance, an RNA pseudoknot with 3D domain swapping (PDB ID: 387D), whose length is 26, shows significantly negative correlation (Spearman rho = − 0.82, p = 2e-07) if the secondary structure generated by RNAfold 51 is used. Conversely, the correlation turns into a significantly positive one (Spearman rho = 0.69, p = 9e-05) once replaced with known 2D structure taken from RNA STRAND (http:// www.rnasoft.ca/strand/) 47 .
Nonetheless, ncRNA secondary structure cannot perfectly substitute for tertiary structure in light of the inevitable information loss. It is quite difficult to precisely depict all of the more sophisticated tertiary structural motifs via secondary structural motifs. The more structural features there are unique to 3D-level, the more information loss comes from the replacement with 2D structure. Furthermore, we expect that taking other features such as sequence conservation and structure conservation into account would improve the current method.
From another perspective, the divergences between ncRNA secondary structure and tertiary structure might be hints for functional sites too. It is totally reasonable that nucleotides which seem apart in ncRNA secondary structure contact with each other in 3D space, forming a pocket that maybe catalyze chemical reactions, or a domain resembling other shapes, like a clamp. Besides, we hypothesize that, during the building process of tertiary structure, nucleotides folded inside out or outside in are candidates for functional sites. Further research about this is required.  Table 2. The known functional sites (FSs) and the Rsite2 hits of the Diels-Alder ribozyme.
The understanding of RNA folding principles is at an early stage 52,53 . When more details about RNA folding become available, we expect Rsite2 will be improved at that time. Finally, although limitations exist in the current method, we believe that the secondary structure based method (Rsite2) is an alternative solution for the prediction of the functional sites of ncRNAs.

Methods
The tertiary structure data of ncRNAs. To generally analyze the correlation between ncRNA secondary structure and tertiary structure, we first acquired the latest non-redundant PDB ID list of RNA-containing tertiary structures (Release 1.89, 2014-12-05) from NDB website (http://ndbserver.rutgers.edu/) 44,45 , and then downloaded from PDB server (http://www.rcsb.org/pdb/home/home.do) 19 the corresponding biological assemblies of the equivalence class representatives. As a result, we obtained the tertiary structures of 1034 ncRNAs.
Predicting the secondary structures of ncRNAs. From the above structure files of ncRNAs, we extracted the nucleotide sequences using an in-house python program. We then predicted the secondary structures of these ncRNAs using RNAfold 51 (v1.8.5), an easy-to-use tool for RNA secondary structure prediction. It produced for each ncRNA sequence one PostScript file, from which we further fetched out the 2D coordinates of each ncRNA molecule.
Calculating nucleotide location in ncRNA structures. We first obtained the 3D and the 2D coordinates from the PDB files and the PostScript files, respectively. We then calculated their molecular centroids based on the 3D and 2D coordinates. We presented two metrics to calculate the location of each nucleotide in an ncRNA. The first metric, nucleotide distance to centroid (NDC), is the Euclidean distances between any one nucleotide and the centroid. And we computed the sum of distances for each nucleotide between it and all the rest nucleotides in one RNA molecule as the second metric, called nucleotide distance sum (NDS). Both types of distances can quantitatively evaluate the location of one nucleotide. The outermost nucleotide has the biggest distance value and the innermost the smallest.   Correlation analysis between the 2D and the 3D structures of ncRNAs. We performed Spearman correlation analysis for the 2D and the 3D nucleotide locations calculated using the above two metrics. Before the above process, we filtered out the ncRNAs with multiple RNA chains, those no longer than 20nt, those whose coordinates are incomplete, and those whose nucleotide composition does not cover all four kinds of bases.
Identifying ncRNA functional sites. We used three ncRNAs (tRNA (Lys), Diels-Alder ribozyme and RNase P) and a set of human mitochondria transcripts with well-annotated functional sites to validate the accuracy of Rsite2. For a given ncRNA, Rsite2 first calculates the two metrics of nucleotide location defined above based on the 2D coordinates derived from the predicted secondary structure. Then, Rsite2 utilizes the algorithms of Rsite 40 to predict functional sites. In Results, the distance metric is NDC in default.

Algorithms of Rsite in brief.
For an ncRNA with n nucleotides, we can get a distance curve (D) of length n, of which D(i) denotes the location of the ith nucleotide, and smooth it with a Gaussian filter. Then we identify the nucleotides that are the extreme points in the distance curve as the putative functional sites. And multiple points close (< = 2nt in default for their sequence positions) to each other are integrated into one functional site.