RAMPred: identifying the N1-methyladenosine sites in eukaryotic transcriptomes

N1-methyladenosine (m1A) is a prominent RNA modification involved in many biological processes. Accurate identification of m1A site is invaluable for better understanding the biological functions of m1A. However, limitations in experimental methods preclude the progress towards the identification of m1A site. As an excellent complement of experimental methods, a support vector machine based-method called RAMPred is proposed to identify m1A sites in H. sapiens, M. musculus and S. cerevisiae genomes for the first time. In this method, RNA sequences are encoded by using nucleotide chemical property and nucleotide compositions. RAMPred achieves promising performances in jackknife tests, cross cell line tests and cross species tests, indicating that RAMPred holds very high potential to become a useful tool for identifying m1A sites. For the convenience of experimental scientists, a web-server based on the proposed model was constructed and could be freely accessible at http://lin.uestc.edu.cn/server/RAMPred.

Result and Discussion m 1 A sites identification. In statistical prediction, three cross-validation methods, i.e., independent dataset test, sub-sampling (or n-fold cross-validation) test, and jackknife test, are often used to evaluate the anticipated success rate of a predictor. Among the three methods, the jackknife test is deemed the least arbitrary and most objective 11 . Therefore, the jackknife test has been increasingly adopted by researchers to examine the quality of various computational models [12][13][14][15][16] . Thus, the jackknife test was used to examine the performance of the proposed model. In the jackknife test, each sample in the training dataset is in turn singled out as an independent test sample and all the properties are calculated without including the one being identified.
By encoding RNA sequence using nucleotide chemical property and nucleotide composition, each 41-bp long sequence in the dataset was transferred to a 164 (4 × 41)-dimensional vector (see Materials and Methods) and was used as the input of SVM. The model thus obtained is called RAMPred (RNA N 1 -adenosine methylation predictor). The jackknife test results of RAMPred for identifying m 1 A sites in H. sapiens, M. musculus and S. cerevisiae genomes were enumerated in the first four columns of Table 1. In addition, in order to objectively evaluate the performance of RAMPred in identifying m 1 A sites, the receiver operating characteristic curves and precision recall curves for H. sapiens, M. musculus and S. cerevisiae were also plotted and were shown in Fig. 2. The AUROC and AUPRC values examining the performance of RAMPred were calculated and provided in the last two columns of Table 1. As we can see from Table 1 and Fig. 2, the prediction accuracies of RAMPred were considerably high for identifying m 1 A sites in all the three species.
The chemical properties or nucleotide composition may have different roles in the prediction of m 1 A site. In order to investigate the contribution of each feature for m 1 A site identification, we built a series of models and validated them on the benchmark dataset. Their predictive accuracies obtained from jackknife test for identifying m 1 A sites in H. sapiens, M. musculus and S. cerevisiae genomes were shown in Fig. 3. It was observed that, among the four kind of features (namely ring structure, hydrogen bond, chemical functionality and nucleotide composition), the model based on the ring structure yields the highest accuracy. However, it's lower than that obtained by using their combinations (Fig. 3). These results indicate that ring structure has the largest contribution for m 1 A site identification in the current method, and the other three features (hydrogen bond, chemical functionality and nucleotide composition) play complementary roles for the prediction.
In addition, to ensure that the predictive accuracy of RAMPred is not sensitive to the selection of negative data, we repeated the random sampling procedure ten times. In each time, a prediction model was built based on the positive dataset and the generated negative dataset. For saving computational time, the four metrics as defined in Eq. 4 for the other nine models in identifying m 1 A sites via the 10-fold cross validation test were reported in Supplementary Tables S1-S3 for H. sapiens, M. musculus and S. cerevisiae, respectively. We found that the predictive accuracy is not affected by the selection of negative data, demonstrating the reliability and robustness of the model proposed in this study.
Comparison with Other classifiers. To the best of our knowledge, there is no published computational method for identifying m 1 A sites. We could not provide the comparison analysis with existing results to confirm that RAMPred is superior to other methods. However, to further testify its superiority, the predictive results of   18 . In iRNA-Methyl, RNA sequence was formulated with the "pseudo dinucleotide composition" (PseDNC) [19][20][21] into which three RNA physical-chemical properties (i.e. enthalpy, entropy and free energy) were incorporated 18 . To demonstrate the effectiveness of nucleotide chemical properties and nucleotide composition for m 1 A site identification, the PseDNC-based SVM model was also developed. The 10-fold cross validation test results of the PseDNC-based SVM model in identifying m 1 A site by using the same benchmark dataset are given in Supplementary Table S5, from which we can see that the performance of RAMPred is superior to the PseDNC-based SVM model for identifying m 1 A site. All these results indicate that RAMPred can be effectively used to identify m 1 A sites.
Cross cell line and cross species validation. m 1 A is a dynamic modification in response to certain stress conditions and its level varies among different tissues 2 . Since the training dataset of RAMPred were collected from different species and cell lines (see Materials and Methods), it is interesting to see to what extent a model trained by using the data from one tissue or specie recognizes the m 1 A sites from other tissues or species. To demonstrate this point, we trained cell line-specific and species-specific models based on the m 1 A site data from  different cell lines and species, and then validated them on the independent datasets from other cell lines or species. The cross cell line and cross species independent test results are given in Fig. 4.
It was found that the mammalian models trained using data from H. sapiens and M. musculus genomes can accurately identify each other's m 1 A sites and the performances are pretty good. Although the performances of the mammalian models for identifying m 1 A sites in yeast genome are acceptable, they are lower than that obtained by the model trained using data from yeast own data. This result indicates that the construction of species-specific predictor for identifying m 1 A sites is necessary for yeast. It was also found that the cross cell line prediction performances are satisfactory and are equivalent to the intra-cell line performances in the three human cell lines (i.e., HeLa, HEK293 and HepG2) and two mouse cell lines (i.e., Liver and MEFs), indicating that there is no need to construct cell line-specific models to identify m 1 A sites in mammalian genomes.
Web-Server and User Guide. To enable applications of the proposed method and for the conveniences of community, a freely accessible online web-server called RAMPred was established. To maximize the user's convenience, a step-by-step guide of the web-server on how to use RAMPred is given as following.
Firstly, browse the web server at http://lin.uestc.edu.cn/server/RAMPred and you will see the top page of RAMPred on your computer screen, as shown Fig. 5. Click on the Read Me button to see a brief introduction about the predictor and the caveat when using it. Click on the Data button to download the benchmark datasets  used to train RAMPred. Click on the Citation button to find the relevant papers that document the detailed development and algorithm of RAMPred.
Secondly, select the organism or species by checking on the corresponding open circle. To get the anticipated prediction accuracy, the species button must be consistent with the source of query sequences: if the query sequences are from H. sapiens, check on the 'H. sapiens' button; if from M. musculus, check on the 'M. musculus' button; if from S. cerevisiae, check on the 'S. cerevisiae' button. Either type or copy/paste the query RNA sequences into the input box at the center of Fig. 5. The input sequence should be in FASTA format. For the examples of RNA sequences in FASTA format, click the Example button right above the input box. The predicted results will be shown on the computer screen by clicking on the Submit button.

Conclusions
By using nucleotide chemical property and nucleotide composition, for the first time, we developed a support vector machine-based model to identify m 1 A sites in H. sapiens, M. musculus and S. cerevisiae genomes. The jackknife test results on the rigorous benchmark datasets demonstrate that the proposed method RAMPred is very promising for identifying m 1 A sites in the three eukaryotic genomes.
To identify the key features for m 1 A site identification, we compared the predictive results obtained by using different kind of parameters and found that the ring structure has the largest contribution for m 1 A site identification. This result holds for all the three genomes and is consistent with the following fact. N 1 -methylation on RNA adenosine occurs at the Watson-Crick interface and is catalyzed by methyl-transferases that need to recognize and bind with specific genomic regions 22 . Therefore, nucleotide ring structure could facilitate the π -cation/π -π / van der Waals contact between methyl-transferases and RNA sequence.
In order to rigorously evaluate its performance, we also tested the proposed method by performing cross cell line and cross species validations. It is encouraging to see that the cross cell line performances are quite good, indicating that our method is stable for identifying m 1 A site in mammalian genomes. We also noticed that the performances of mammalian based models for identifying yeast m 1 A sites are lower than that of the yeast specific one and vice versa.
As an epigenetic modification, RNA methylation is a complicate progress. Besides sequence context and nucleotide chemical properties, other factors may be also helpful for m 1 A site identification. For example, it has been demonstrated that m 1 A correlates with elevated translation, and enriched in 5′ -untranslated region and coding sequence, and also overrepresented in the start codon upstream of the first splice site 2,10 . In addition, high-resolution experimental data with quantitative information about m 1 A modification are also highly desirable, which would aid the representation of the sequence context surrounding the m 1 A sites. For better understanding of the biological function of N 1 -methylation on RNA adenosine, we will combine all these factors and develop new models to improve the predictor's performance for m 1 A sites identification in the future work.

Materials and Methods
Datasets. Based on MeRIP-seq technique, Dominissini and his colleagues obtained the m 1 A peaks in H. sapiens, M. musculus and S. cerevisiae genomes 2 . By mapping these peaks to H. sapiens (hg.19), M. musculus (mm10) and S. cerevisiae genome, respectively, we obtained m 1 A site containing sequences for these three genomes. It was observed via preliminary trials that when the length of the sequences in the benchmark dataset is 41 bp with the m 1 A in the center, the corresponding predictive results were most promising. Accordingly, we focus on RNA sequence with 41 nucleotides in the current study.
To overcome redundancy and reduce the homology bias, sequences with more than 80% sequence similarity were removed by using the CD-HIT program 23 . After such a screening procedure, we obtained 6,366, 1,064 and 483 m 1 A site containing sequences and deemed them as the positive samples of H. sapiens, M. musculus and S. cerevisiae, respectively. If the sequence identity is set to a lower percentage, such as 40%, the result will be more objective and reliable. However, in this study we did not use such a stringent criterion because the currently available data do not allow this. Otherwise, the number of samples will be too few to have statistical significance.
The negative samples in each species were obtained by choosing the 41-nt long sequences satisfying the rule that the adenosine in the center was not detected by the MeRIP-seq technique. By doing so, we could obtain a great number of negative samples in each species. Therefore, the number of negative samples will be dramatically larger than those of positive samples. In machine-learning problems, imbalanced datasets can significantly affect the performance evaluation of learning methods. To balance out the numbers between positive and negative samples in model training, we randomly picked out 6,366, 1,064 and 483 sequences to form the negative samples for H. sapiens, M. musculus and S. cerevisiae, respectively. To demonstrate the robustness of the proposed model, we repeated the random sampling procedure ten times and obtained ten random samples of negative datasets for downstream training and prediction for each species.
According to Dominissini and his colleagues' work 2 , the m 1 A site containing sequences in H. sapiens were from three cell lines, namely, HeLa (cervical adenocarcinoma), HepG2 (hepatocellular carcinoma) and HEK293 (embryonic kidney) cell lines, and those sequences in M. musculus were from two cell lines, namely, primary mouse embryonic fibroblasts (MEFs) and liver cell lines. To further validate the performance of the proposed method, we also built cell line specific datasets for H. sapiens and M. musculus, respectively. The numbers of positive and negative samples of the cell line specific datasets were shown in Fig. 6. All the data are available at http:// lin.uestc.edu.cn/server/RAM/data. Representation of RNA sequences. Stimulated by its success in identifying post-transcriptional RNA modifications 24,25 , nucleotide chemical property and nucleotide composition were used to represent RNA sequences for identifying m 1 A sites in the present work. Below is the brief elaboration on how to encode RNA sequences using nucleotide chemical property and nucleotide composition.
Scientific RepoRts | 6:31080 | DOI: 10.1038/srep31080 RNA is transcribed with four nucleotides, namely, adenine (A), guanine (G), cytosine (C) and uracil (U). These four bases have different chemical properties. In terms of ring structures, adenine and guanine are purines that have two rings, while cytosine and uracil are pyrimidines that have one ring. When forming secondary structures, guanine and cytosine form strong hydrogen bonds, whereas adenine and uracil form weak hydrogen bonds. In terms of chemical functionality, adenine and cytosine can be classified into the amino group, while guanine and uracil into the keto group.
In order to include these chemical properties in RNA encoding, three coordinates (x, y, z) were used to represent the chemical properties of the four nucleotides and were assigned 1 or 0 values 24,26 . The x coordinate stands for the ring structure, y for the hydrogen bond, and z for the chemical functionality. Hence, each nucleotide in RNA sequence can be encoded by (x i , y i , z i ), where 24,25   where l is the sequence length, |N i | is the length of the i-th prefix string {n 1 , n 2 , … , n i } in the sequence, q ∈ {A, C, G, U}. Therefore, by integrating nucleotide chemical properties and nucleotide composition, the sequence with a length of l will be encoded by a (4 × l)-dimensional vector. An example of encoding RNA sequence using nucleotide chemical properties and nucleotide composition is shown in Fig. 7. Support Vector Machine. Support vector machine (SVM) is a powerful and popular method for pattern recognition and is widely used in the realm of bioinformatics 18,[27][28][29] . The basic idea of SVM is to transform the input data into a high dimensional feature space and then determine the optimal separating hyperplane. In the current study, the LibSVM package 3.18 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) was used to implement SVM. Due to its effectiveness and speed in training process, the radial basis kernel function (RBF) was used to obtain the classification hyperplane in the current study. In the SVM operation engine, the grid search method was applied to optimize the regularization parameter C and kernel parameter γ using a grid search approach as defined by where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. The ROC (receiver operating characteristic) curve 30 was also used to evaluate the performance of the current method, which plots the true positive rate (sensitivity) against the false positive rate (1-specificity). A best possible prediction method would yield a point with the coordinate (0, 1) representing 100% sensitivity and 0 false positive rate or 100% specificity. Therefore, the (0, 1) point is also called a perfect classification. A completely random guess would give a point along a diagonal from the point (0, 0) to (1,1). The area under the ROC curve, also called AUROC, is often used to indicate the performance quality of a binary classifier: the value 0.5 of AUROC is equivalent to random prediction while 1 of AUROC represents a perfect one. To examine the performance of the proposed predictor when restricting low false positive rates, the precision-recall curve was also plotted, which plots precision (the fraction of TP in all predicted positives) against recall (sensitivity). The area under the precision-recall curve (AUPRC) was also used to examine the performance of the proposed method when restricting low false positive rates.