Introduction

The methylation of the 6th nitrogen of Adenosine (N6-methyladenosine, m6A) is the most prevalent form of RNA modification and is found in all three domains of life1. m6A is catalyzed by an evolutionarily conserved, multi-component enzyme2. Unlike adenosine-to-inosine editing, m6A does not alter the coding capacity of transcripts3,4. However, it has been demonstrated that m6A is associated with a number of key biological processes including mRNA splicing, export, stability and immune tolerance5,6,7. Moreover, it has been reported that m6A is closely correlated with the mammalian brain development8. The regulatory role of m6A in cell division has also been reported in plants9.

By using high-throughput techniques such as MeRIP-Seq8 and m6A-seq10, the distribution of m6A has been characterized in the human and mouse transcriptomes8. The experimental results revealed that m6A sites tend to occur near the stop codon, in 3’ UTR and within long internal exons8,11. The nonrandom distribution of m6A sites across the genome is highly conserved from yeasts to humans11,12, suggesting that m6A modification is both fundamental and important for organisms. The experimental results also demonstrated that the m6A sites identified in the yeast harbored the RGAC (R = A/G) consensus motif12, reminiscent of the mammalian RRACU (R = A/G) motif11. Similar to epigenetic DNA and histone modifications, m6A modification is also dynamic and reversible, the m6A patterns change in different cell types10 or when cells are stressed12.

The experimental methods yielded quite encouraging results and did play a role in promoting the research progress on identifying the distribution of m6A in the transcriptome. However, resolution of both m6A-seq and MeRIP-seq methods is low, only ~24 nt (nucleotide) around the methylated adenosine11. Therefore, experimental methods cannot pinpoint which adenosine residue is actually modified. In addition, current experimental methods are both costly and time consuming. Therefore, it is necessary to develop new methods for studying the distribution and function of m6A. As excellent complements to experimental techniques, computational methods will speed up genome-wide m6A detection.

However, to the best of our knowledge, there is no computational tool available for the discovery of m6A. In the present study, we propose a support vector machine based method to identify the m6A sites in the Saccharomyces cerevisiae genome. By using the nucleotide chemical property and accumulated nucleotide frequency information, the sequence-order effects and nucleotide physicochemical properties are integrated together in the proposed model. In the jackknife test, an overall accuracy of 78.15% is achieved in identifying the m6A sites in the benchmark dataset. For the convenience of the experimental scientists, a web-server for the proposed model is provided at http://lin.uestc.edu.cn/server/m6Apred.php.

Results

Nucleotide preference

In order to understand nucleotide preference surrounding m6A sites, based on the benchmark dataset, we computed the sequence logos of the 10 upstream and 10 downstream nucleotides using WebLogo13. As shown in Fig. 1, besides the well-known consensus motif RGAC (R = A/G) located at −2 to 1 bp relative to the m6A site (position 0)12, strong preference of nucleotides in both upstream and downstream sequences surrounding the m6A site were also observed. The adenines are favored at positions −4, −3 and −2, whereas the uracils are favored at positions from +2 to +4. In contrast, except for the RGAC (R = A/G) located at −2 to 1, no exclusive preference of nucleotides was observed surrounding the unmethylated adenosine.

Figure 1
figure 1

Sequence logo of the 10 upstream and 10 downstream nucleotides surrounding m6A sites.

m6A sites identification

Three cross-validation methods, the sub-sampling (or K-fold cross-validation) test, the independent dataset test and the jackknife test, are often used to evaluate the quality of a predictor. Among the three methods, the jackknife test is deemed as the least arbitrary and most objective14 and hence has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors15,16,17. Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sample in the training dataset is in turn singled out as an independent test sample and all the properties are calculated without including the one being identified.

In order to compare the contribution of the features for m6A site identification, we firstly performed the predictions using individual nucleotide chemical property and their combinations. The predictive results are reported in Table 1. Among the three kinds of nucleotide chemical properties, the hydrogen bond yields the highest predictive accuracy (71.32%), indicating that it has the largest contribution for m6A site identification. However, the predictive accuracies obtained by using each kind of nucleotide chemical property alone are all lower than that obtained by using all three kinds of nucleotide chemical properties (Table 1).

Table 1 The predictive results by using different features for m6A identification.

Considering the observed nucleotide preference surrounding the m6A sites (Fig. 1) and the above results, the accumulated nucleotide frequency and nucleotide chemical property were combined to encode the sequences in the training dataset. Hence, each 21-bp long sequence in the dataset was represented by an 84 (4×21)-dimensional vector (see Methods) and used as the input of SVM to train the model for identifying m6A sites. In the jackknife test, the proposed model obtained an accuracy of 78.15% with a sensitivity of 79.21% and specificity of 77.04% (Table 1). The predictive accuracy thus obtained is higher than that obtained by merely using nucleotide chemical properties (75.87%), indicating that nucleotide frequency contributes slightly to the identification of m6A sites.

As the performance of the proposed model may depend on the threshold, similar to a recent work18, three thresholds of high, medium and low obtained in jackknife test were selected with the specificity values of 95%, 90% and 85%, respectively. The predictive performances of the proposed model with these different thresholds were reported in Table 2. Meanwhile, in order to provide a graphical illustration to show the performance of the model as its discrimination threshold varied, the ROC curve was plotted in Fig. 2 and an AUROC of 0.84 was obtained.

Table 2 Performance of the proposed model at different thresholds on jackknife test.
Figure 2
figure 2

A graphical illustration to show the performance of the model by means of the ROC curve.

The vertical coordinate is the true positive rate (Sn) while horizontal coordinate is the false positive rate (1-Sp). The area under the ROC curve (AUROC) is 0.84.

To ensure that the predictive accuracy is not sensitive to the selection of negative data, we repeated the random sampling procedure ten times and obtained ten random samples of negative datasets for downstream training and prediction. The predictive results of these models for identifying m6A sites in the jackknife test were reported in Supplementary Table S1. We found that the predictive accuracy is not affected by the selection of negative data.

In addition, the proposed model was also evaluated on the independent testing dataset (see Methods). We found that the proposed model obtained an accuracy of 75.73% with a sensitivity of 53.89% and a specificity of 79.07% for identifying m6A sites on the testing dataset with the positive-to-negative ratio of 1:10. The precision-recall curve, which plots the corresponding precision-recall pairs over a range of values, was also plotted in Supplementary Figure S1. These results demonstrate the reliability of the model developed in this study.

Comparison with Other classifiers

To further demonstrate the power of the proposed method, we also did some comparative calculations as described below.

First, based on the sequence similarity principle, we used the classic sequence similarity search-based tool BLAST19 to conduct the jackknife test on the same benchmark dataset. The results thus obtained are given in Table 3, from which we can see that the percentage rate for Acc obtained by BLAST is about 10% lower than the proposed model for m6A identifications.

Table 3 Comparison of different classifiers for m6A identification.

Second, we also compared the predictive results of the proposed method with that of four other commonly used classifiers, i.e., Naïve Bayes20, Logistic Function21, RBFNetwork22 and Random Forest23 as implemented in WEKA24. The jackknife test results for identifying m6A sites in the benchmark dataset for different classifiers were listed in Table 3. It is shown that the sensitivity, specificity, accuracy and AUROC of the proposed SVM model are all higher than that of Naïve Bayes, Logistic Function and Random Forest. Although the specificity of the proposed method is lower than that of RBFnetwork, its sensitivity, accuracy and AUROC are all higher than that of RBFnetwork. Hence, these results suggest that our proposed method is promising and has great potential to become a useful tool for m6A identifications.

Web-server

To enable applications of the proposed model and for the convenience of the vast majority of experimental scientists, an online predictor is created. The step-by-step guide on how to use it is provided as the following:

Step 1. Open the web server at http://lin.uestc.edu.cn/server/m6Apred.php and you will see the top page on your computer screen, as shown in Fig. 3. Click on the Read Me button to see a brief introduction about the predictor and the caveat when using it.

Figure 3
figure 3

A semi-screenshot for the top page of the web-server at http://lin.uestc.edu.cn/server/m6Apred.php.

Step 2. On clicking the open circle, the threshold (All, High, Medium, or Low) as reported in Table 2 will be selected. Either type or copy/paste the query RNA sequences into the input box at the center of Fig. 3. The input sequence should be in FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol (“>”) in the first column, followed by lines of sequence data. The words right after the “>” symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box.

Step 3. Click on the Submit button to see the predicted result. For example, if use the query RNA sequences in the Example window as the input, the following results will be shown on the screen: the outcome for the 1st query example is: the ‘A’ at position 11 is methylated with a probability of 0.92 and the ‘A’ at position 32 is also methylated with a probability of 0.92. The outcome for the 2nd query sample is: the A at position 11 is unmethylated with a probability of 0.96. All these results are consistent with the experimental observations.

Step 4. Click on the Data button to download the datasets used to train and test the model.

Step 5. Click on the Citation button to find the relevant paper that reports the detailed development and algorithm of the model.

Caveats

Each of the input query sequences must be 21 bp or longer and only contains valid characters: ‘A’, ‘C’, ‘G’, ‘U’.

Conclusions

By using the nucleotide chemical property and nucleotide density information, we proposed a support vector machine based model to identify m6A sites in the Saccharomyces cerevisiae transcriptome. To identify the key features for m6A site identification, we compared the predictive results obtained by using different kinds of parameters (Table 1). In comparison with accumulated nucleotide frequency, nucleotide chemical property plays the more important roles for m6A site identification. Among the three considered nucleotide chemical properties, the hydrogen bond has the largest contribution for m6A site identification, consistent with the recent finding that the hydrogen bond is implicated in formation of RNA secondary structure25 which decreases the m6A methylation26.

In addition, we also compared the predictive accuracy of SVM with four other commonly used classification methods for m6A site identification. We found that the predictive result of SVM is better than those of Naïve Bayes, Logistic Function and Random Forest. This is likely due to the limited number of the experimentally validated m6A sites that used to train the models. Naïve Bayes, Logistic Function and Random Forest require a large number of samples to train, whereas SVM needs fewer training data.

For the convenience of researchers in the scientific community, a web-server for the proposed model is provided. We hope that these results will provide further insights into the understanding of the distribution and function of m6A modifications. As the current method is only applicable to Saccharomyces cerevisiae, future work will expand to other species to train and improve the model.

Methods

Dataset

By using the m6A-seq technique, Schwartz et al. identified 1,307 methylated adenine (m6A) sites centered around RGAC motifs from 1,183 genes in Saccharomyces cerevisiae12. In order to obtain a high quality training dataset and avoid experiment bias, the 832 m6A sites with distances to the detected m6A-seq peaks less than 10 bp were selected as positive samples of the training dataset12. The pairwise sequence similarity within all the positive training samples is less than 85%. The remaining 475 (1,307−832 = 475) m6A sites were used to construct the independent testing dataset. The negative samples were obtained by the following steps. By searching Saccharomyces cerevisiae genome, we obtained 33,280 adenines centered around the RGAC consensus motif, which were not detected by the m6A-seq technique. Therefore, the 33,280 adenines were deemed as nonmethylated adenine. To balance out the numbers between positive and negative samples in model training, we randomly picked 832 samples from the 33,280 non-methylated adenines and used them as negative samples. Following these procedures, we obtained a benchmark dataset including 832 m6A site containing sequences and 832 non-m6A site containing sequences, respectively. To examine whether the predictive accuracy is sensitive to the selection of negative data, we repeated the random sampling procedure ten times and obtained ten random samples of negative datasets for downstream training and prediction. We also randomly fetched 4,750 negative samples from the ten negative datasets and merged them with the above mentioned 475 samples in the testing dataset. By doing so, an independent testing dataset with the positive-to-negative ratio of 1:10 (475:4,750) was obtained.

It was observed via preliminary trials that when the length of the sequences in the benchmark dataset is 21 bp with the m6A in the center, the corresponding predictive results were most promising. Accordingly, all the sequences in the training and testing dataset are 21 bp long and are available at http://lin.uestc.edu.cn/server/m6Apred.php.

Sequence encoding

One of the keys in developing a model for identifying genomic attributes is to encode the biological samples with effective expressions. In the present study, nucleotide chemical properties and density information of each nucleotide in RNA sequences were considered.

Chemical property of each nucleotide

There are four different kinds of nucleotides, i.e., adenine (A), guanine (G), cytosine (C) and uracil (U), found in RNA. Each nucleotide has different chemical structure and chemical binding. Shown in Fig. 4, adenine and guanine have two rings, while cytosine and uracil have only one ring. Although RNA is generally single stranded, its biological functions are correlated with the secondary structure. When forming secondary structures, in terms of hydrogen bond, guanine and cytosine have strong hydrogen bonds, whereas adenine and uracil have weak hydrogen bonds. Additionally, in terms of chemical functionality, adenine and cytosine can be classified into the same group, called amino group, while guanine and uracil into the keto group. Therefore, the four kinds of nucleotides can be classified into three different groups in terms of these chemical properties (Table 4).

Table 4 Chemical property of nucleotide in RNA sequence.
Figure 4
figure 4

Chemical structure of each nucleotide.

In order to include these chemical properties in RNA encoding, we define three coordinates (x, y, z) to represent three chemical groups and assign 1 or 0 values. Hence, each nucleotide si = (xi, yi, zi) in the sequence can be encoded by the following formula27.

where the coordinate value of each nucleotide is determined by their chemical property of the nucleotide as shown in Table 4. Thus, based on chemical properties, A can be represented by coordinates (1, 1, 1), C can be represented by coordinates (0, 1, 0), G can be represented by coordinates (1, 0, 0), U can be represented by coordinates (0, 0, 1).

Accumulated nucleotide frequency

In order to include the nucleotide frequency information and the distribution of each nucleotide in the RNA sequence, we define the density di of any nucleotide si at position i in RNA sequence by the following formula26,

where l is the sequence length, |Si| is the length of the i-th prefix string {s1, s2, …, si} in the sequence, q {A, C, G or U}. Suppose an example sequence “UCGUUCAUGG”. The density of ‘U’ is 1 (1/1), 0.5 (2/4), 0.6 (3/5), 0.5 (4/8) at positions 1, 4, 5 and 8, respectively. The density of ‘C’ is 0.5 (1/2), 0.33 (2/6) at positions 2 and 6, respectively. The density of ‘G’ is 0.33 (1/3), 0.22 (2/9), 0.3 (3/10) at positions 3, 9 and 10, respectively. The density of ‘A’ is 0.14 (1/7) at position 7.

By integrating both the nucleotide chemical property and accumulated nucleotide information, the sample sequence “UCGUUCAUGG” can be represented by {(0, 0, 1, 1), (0, 1, 0, 0.5), (1, 0, 0, 0.33), (0, 0, 1, 0.5), (0, 0, 1, 0.6), (0, 1, 0, 0.33), (1, 1, 1, 0.14), (0, 0, 1, 0.5), (1, 0, 0, 0.22), (1, 0, 0, 0.3)}. By doing so, not only the chemical property was considered, but also the long range sequence order information was incorporated. Therefore, the samples in the benchmark dataset were encoded in terms of both nucleotide chemical property and nucleotide densities.

Support vector machine

The SVM classification algorithm has been widely used in the realm of bioinformatics28,29,30. Its basic principle is to transform the input vector into a high-dimension Hilbert space and seek a separating hyperplane with the maximal margin in this space.

In this study, the libsvm-3.18 package was used as an implementation of SVM, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Because of its effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) was selected to perform the prediction. A grid search method was used to optimize the regularization parameter C and kernel parameter γ. The probability score obtained from SVM was used to make predictions.

Performance evaluations

The performance of the model was evaluated using the following metrics: sensitivity (Sn) also named recall, specificity (Sp), precision and accuracy (Acc), which can be expressed as

where TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively.

Meanwhile, in order to provide a graphical illustration to show the performance of the model as its discrimination threshold varied, the ROC (receiver operating characteristic) curve was created, where its vertical coordinate is for the true positive rate while horizontal coordinate for the false positive rate. The best possible prediction method would yield a point with the coordinate (0, 1) representing 100% true positive rate and 0 false positive rate or 100% specificity. Therefore, the (0, 1) point is also considered as a perfect classification. A completely random guess would give a point along a diagonal from the point (0, 0) to (1, 1). The AUROC (area under the ROC curve) is often used to indicate the performance quality of a binary classifier: the value 0.5 of AUROC is equivalent to random prediction while 1 of AUROC represents a perfect one.

Additional Information

How to cite this article: Chen, W. et al. Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome. Sci. Rep. 5, 13859; doi: 10.1038/srep13859 (2015).