Accurate detection of m6A RNA modifications in native RNA sequences

The epitranscriptomics field has undergone an enormous expansion in the last few years; however, a major limitation is the lack of generic methods to map RNA modifications transcriptome-wide. Here, we show that using direct RNA sequencing, N6-methyladenosine (m6A) RNA modifications can be detected with high accuracy, in the form of systematic errors and decreased base-calling qualities. Specifically, we find that our algorithm, trained with m6A-modified and unmodified synthetic sequences, can predict m6A RNA modifications with ~90% accuracy. We then extend our findings to yeast data sets, finding that our method can identify m6A RNA modifications in vivo with an accuracy of 87%. Moreover, we further validate our method by showing that these ‘errors’ are typically not observed in yeast ime4-knockout strains, which lack m6A modifications. Our results open avenues to investigate the biological roles of RNA modifications in their native RNA context.


BACKGROUND
In the last few years, our ability to map RNA modifications transcriptome-wide has revolutionized our understanding of how these chemical entities shape cellular processes, modulate cancer risk, and govern cellular fate [1][2][3][4]. Systematic efforts to characterize this regulatory layer have revealed that RNA modifications are far more widespread than previously thought, can be subjected to dynamic regulation, and can profoundly impact RNA processing stability and translation [5][6][7][8][9][10]. A fundamental challenge in the field, however, is the lack of a generic approach for mapping and quantifying RNA modifications, as well as the lack of single molecule resolution [11].
Current technologies to map the epitranscriptome rely on next-generation sequencing and, as such, they are typically blind to nucleotide modifications. Consequently, indirect methods are required to identify RNA modifications transcriptome-wide, which has been mainly approached using two different strategies: (i) antibody immunoprecipitation, which specifically recognizes the modified ribonucleotide [5,6,[12][13][14]; and (ii) chemical-based detection, using chemical compounds that selectively react with the modified ribonucleotide of interest, followed by reverse-transcription of the RNA fragment, which leads to accumulation of reads that have the same identical ends [8,9,15]. Although these methods have provided highly valuable information, they are limited by the available repertoire of commercial antibodies and the lack of selective chemical reactivities towards a particular RNA modification [16], often lack single nucleotide resolution [5][6][7] or require complex protocols to achieve it [17], cannot provide quantitative estimates of the stoichiometry of the modification at a given site, and are often unable to identify the underlying RNA molecule that is modified.
To overcome these limitations, third-generation sequencing technologies, such as the platforms provided by Oxford Nanopore Technologies (ONT) [18] and Pacific Biosciences (PacBio) ( [3], have been proposed as a new means to detect RNA modifications in native RNA sequences. These technologies can detect RNA modifications by measuring the kinetics of the reverse transcriptase as it encounters a modified RNA -in the case of PacBio-, or by directly sequencing the RNA in its native form by pulling a native RNA through the nanopore -in the case of ONT-. Although ONT direct RNA sequencing is already a reality [19,20], extracting RNA modification information from ONT reads is an unsolved challenge. RNA modifications are known to cause disruptions in the pore current that can be detected upon comparison of raw current intensities -also known as 'squiggles'- [18,19]. However, current efforts have not yet yielded an efficient and accurate RNA modification detection algorithm, largely due to the challenges in the alignment and re-squiggling of RNA current intensities. As an alternative strategy, we hypothesized that the current intensity changes caused by the presence of RNA modifications may lead to increased 'errors' and decreased qualities from the output of base-calling algorithms that do not model base modifications (Figure 1A). Indeed, here we find that base-calling 'errors' can accurately identify N6-methyladenosine (m6A) RNA modifications in native RNA sequences, and propose a novel algorithm, EpiNano (github.com/enovoa/EpiNano), which can be used to identify m6A RNA modifications from native RNA reads with an overall accuracy of ~90%. Our results provide a proof of concept for the use of base-called features to identify RNA modifications using direct RNA sequencing, and open new avenues to explore additional RNA modifications in the future.
Therefore, to systematically identify the current intensity changes caused by the presence of a given RNA modification, perturbations of the current intensity signals should be measured and analyzed for each possible 5mer (n=1024). To this end, we designed a set of synthetic sequences that comprised all possible 5-mers (median occurrence of each 5-mer=10), while minimizing the RNA secondary structure (see Methods and File S1). We then compared the direct RNA sequencing reads of in vitro-transcribed constructs that incorporated N6methyladenosine ('m6A') to those with unmodified ribonucleotides ('unm') ( Figure 1A). Comparison of the two datasets revealed that base-called m6A-modified reads are significantly enriched in mismatches compared to their unmodified counterparts ( Figure 1B and 1C), and that these 'errors' are mainly, but not exclusively, located in adenine positions. We observed that, in addition to mismatch frequency, other metrics including perbase quality, insertion frequency, deletion frequency and current intensity, were significantly altered ( Figure 1C and Figure S1). Moreover, these 'errors' were reproducible in independent biological replicates with respect to mismatch frequency, deletion frequency, per-base quality and current intensity (Figures 1D-G). By contrast, insertion frequencies were not reproducible across biological replicates (Figure S1), suggesting that this feature is likely unrelated to the presence of RNA modifications, and thus was not further considered in downstream analyses.
We then examined whether these observed differences would be sufficient to accurately classify a given site into "modified" or "unmodified". For this aim, we first focused our analysis on 5-mers that matched the known m6A motif RRACH, as these would be the most relevant in which to detect m6A modifications. To reveal whether the features from m6A-modified RRACH k-mers were distinct from unmodified RRACH k-mers, we compiled the base-called features (base quality, mismatch frequency and deletion frequency) for each position of the kmer (-2, -1, 0, +1, +2) (Figure 2A, see also Figure S2), and performed Principal Component Analysis (PCA) of the features, finding that the two populations (m6A-modified and unmodified RRACH k-mers) were largely non-overlapping ( Figure 2B). As a control, we performed the same analysis in k-mers with identical sequence context, but centered in C, G, or U (instead of A), finding that no differences could be observed between these populations ( Figure 2C), suggesting that the observed differences are m6A-specific and not dataset-specific.
To statistically determine whether these features could be used to accurately classify a given site into 'm6Amodified' or 'unmodified', we trained multiple Support Vector Machines (SVM) using as input the base-called features from m6A-containing RRACH k-mers and unmodified RRACH k-mers (see Methods). We first tested whether each individual feature at position 0 (the modified site) was able to classify a given RRACH k-mer into m6A-modified or unmodified. Our results show that base quality, deletion frequency and mismatch frequency alone were able to accurately predict the modification status with reasonable accuracy (70-86% accuracy, depending on the feature used) (Figure 2D, see also Table S1 and Methods). By contrast, we find that the current mean intensity values and current intensity standard deviation were poor predictors of the modification status of the k-mer (43-65% accuracy). As a control, we used the same set of features in control k-mers (i.e., those with the same sequence context, but centered in C, G or U), finding that the features did not distinguish between m6A-modified and m6A-unmodified datasets ( Figure 2E, see also Figure S3). To improve the performance of the algorithm, we then examined whether a combination of the features might improve the prediction accuracy, finding that the combination of the 3 features (base quality, mismatch and deletion frequency) increased the accuracy of the model (88-91%) ( Figure 2F, see also Table S1). Finally, we tested whether the inclusion of all the features from the neighbouring positions (-2, -1, +1, +2) would improve the performance of the algorithm. We find that the inclusion of neighbouring features slightly improves the performance of the algorithm (accuracy = 97-99%), however, this was at the expense of increasing the detection of false positives that in the control k-mer set -which do not contain the modification-( Figure 2G, see also  Overall, our results provide a proof of principle for the use of base-calling 'errors' as an accurate and computationally simple solution to identify m6A modifications with an overall accuracy of ~90%. Moreover, our method does not require the manipulation of raw current intensities or squiggle alignments. We envision that similar approaches can be developed to detect other RNA modifications in the future, opening new avenues to explore the role of RNA modifications for which we currently lack transcriptome-wide maps. In addition, our m6A-modified and unmodified datasets covering all possible 5-mers, which are publicly available, can be employed to train different machine learning algorithms than the ones tested here (e.g., signal-based machine learning, base-caller training, etc.), and thus obtain improved methods to detect RNA modifications in the future.

CONCLUSIONS
The human epitranscriptome is still largely uncharted. Only a handful of the 170 different RNA modifications that are known to exist have been mapped. Importantly, several of these modifications have been found to be involved in central biological processes, such as sex determination or cell fate transition, and their dysregulation has been linked to multiple human diseases, including neurological disorders and cancers. Yet, our understanding of this regulatory layer is restricted to a few RNA modifications, largely due to the lack of a generic methodology to map them in a transcriptome-wide fashion. This work provides a novel strategy to identify RNA modifications from base-called features, without the need of squiggle realignments, opening new avenues to study the epitranscriptome with unprecedented resolution. The establishment of the ONT platform as a tool to map virtually any given modification will allow us to query the epitranscriptome in ways that, until now, had not been possible. Future work can expand to other modifications like 5-methylcytosine (m5C), as well as provide additional thresholds for controlling specificity and sensitivity.

Supplemental materials and Methods
File S1. Fasta sequences of the in vitro constructs used in this work to train and test the RNA modification algorithm.

File S2.
Step-by-step protocol to produce m6A-modified and unmodified RNA sequences, to be sequenced using direct RNA sequencing, and used to train/test the RNA modification detection algorithm.    Table S1. Accuracy of prediction of m6A-modified sites in m6A motifs, relative to control k-mers, using different feature combinations, and for both replicates. Table S2. Metrics of quality and yield of in vitro transcription of both modified and unmodified RNAs, which were used as input for the direct RNA sequencing runs. Table S3. Metrics of the number of sequenced reads, base-called reads and mapped reads using direct RNA sequencing, for the 4 runs included in this study                antim6Amotif_noAs.rep2