Accurate identification of RNA editing sites from primitive sequence with deep neural networks

RNA editing is a post-transcriptional RNA sequence alteration. Current methods have identified editing sites and facilitated research but require sufficient genomic annotations and prior-knowledge-based filtering steps, resulting in a cumbersome, time-consuming identification process. Moreover, these methods have limited generalizability and applicability in species with insufficient genomic annotations or in conditions of limited prior knowledge. We developed DeepRed, a deep learning-based method that identifies RNA editing from primitive RNA sequences without prior-knowledge-based filtering steps or genomic annotations. DeepRed achieved 98.1% and 97.9% area under the curve (AUC) in training and test sets, respectively. We further validated DeepRed using experimentally verified U87 cell RNA-seq data, achieving 97.9% positive predictive value (PPV). We demonstrated that DeepRed offers better prediction accuracy and computational efficiency than current methods with large-scale, mass RNA-seq data. We used DeepRed to assess the impact of multiple factors on editing identification with RNA-seq data from the Association of Biomolecular Resource Facilities and Sequencing Quality Control projects. We explored developmental RNA editing pattern changes during human early embryogenesis and evolutionary patterns in Drosophila species and the primate lineage using DeepRed. Our work illustrates DeepRed’s state-of-the-art performance; it may decipher the hidden principles behind RNA editing, making editing detection convenient and effective.

We introduced three popular parallel ensemble methods that were based on deep learning, including classic bagging, modified bagging, and bagging-style bootstrapped resampling methods 1 , to develop a flexible single-cell module of DeepRed that could dynamically solve the class-imbalanced problem and be easily expanded to various cells/tissues and even other species. Although partition of the majority class method is a typical approach for classimbalanced data, it is a static method and does not satisfy our demand that the imbalance degree can dynamically change. Thus, we excluded this method to solve the class-imbalanced problem. To select the most suitable method from the three bagging-style methods for DeepRed, we tested and evaluated them independently using class-imbalanced data of various imbalance degrees (ratio of negative samples/positive samples) in two cell types (Fig. S4). For a fair comparison, the number and structure of individual DNN classifiers were kept the same (10 individual DNNs with the same structure). Three performance indicators, namely, sensitivity, specificity and GM, were reported as suitable for assessing imbalanced data sets and were measured using modified 5-fold cross-validation. We observed that for both the classic bagging and modified bagging methods, sensitivity and GM decreased rapidly and specificity increased slightly with an increasing imbalance degree of class-imbalanced data ( Fig. S4A-D).
The result suggested that these two methods tend to be biased towards the majority class and cannot alleviate the class-imbalanced problem. In contrast, for the bagging-style bootstrapped resampling method, the three performance indicators were all unbiased for different imbalance degrees of class-imbalanced data , suggesting that the bagging-style bootstrapped resampling method can sufficiently eliminate the class-imbalanced influence. Thus, we used the bagging-style bootstrapped resampling method in the single-cell module of DeepRed to process dynamic imbalance-class data.

Simple average as combination method of ensemble learning
As for the combination methods of ensemble learning, we tried and evaluated three different methods, including simple averaging, weighted averaging, and stacking. In the simple averaging method, the predicted probabilities from each individual DNNs were averaged to produce a single estimation. We set the weight of the weighted average according to the AUC or accuracy of the single-cell module and found that the difference between the weighted average and simple average was negligible. Although the stacking method improved model performance to some extent, the boost in performance was extremely limited (less than 1% for AUC). The weighted averaging and stacking methods achieved 0.0045% and 0.7482% higher AUC than the simple averaging method, respectively. In addition, stacking integration doubled the computing time and computing resources and had the limited scalability. Importantly, simple averaging can be extended to additional classifiers based on the original training; however, stacking must be re-trained for the integration of an additional classifier. Considering computing time, computing resources, and more importantly, expandability, we chose simple averaging as the final combination method for ensemble learning.

SNV calling method
The reference genome, dbSNP and gene model used in this study are listed in Table S18. We used STAR 2 (Version: 2.5.2b) to align RNA-seq reads to a reference genome and used the MarkDuplicates tool from Picard (https://broadinstitute.github.io/picard/) to remove identical reads (PCR duplicates) that mapped to the same location. Reads with a mapping quality < 20 were removed by SAMtools (Version: 1.3.1) 3 . For human and mouse, SNVs were called using the GATK 4 (Version: 3.5.0) HaplotypeCaller tool with options stand_call_conf set at 20 and stand_emit_conf set at 0. For Drosophila, SNVs were called using the SAMtools pileup program with option "-Q 15" due to the insufficiency of SNP information in Drosophila species.

Identifying RNA editing sites with existing methods
In the separate samples method, SNVs are called with separate RNA-seq alignments of each sample, and reoccurring variants are retrieved 5 . In the pooled samples method, SNVs are called with pooled RNA-seq alignments from all samples 5 . The SNVs were filtered by five steps: 1) all known SNPs present in dbSNP (except SNPs of molecular type "cDNA") were removed; 2) mismatches in the first six bases of each read were discarded to avoid artificial mismatches derived from random-hexamer priming; 3) intronic sites were removed in non-Alu regions if they were located within 4 base pairs of a known splice junction, and sites in homopolymer runs of 5 base pairs and simple repeats were removed; 4) sites in regions that were highly similar to other parts of the genome using the BLAST-like alignment tool (BLAT) were removed; (5) We inferred the editing type of each site based on the strand of overlapping annotated genes.
In GIREMI method 6 , SNVs were filtered by five steps: 1) mismatches in the first six bases of each read were discarded to avoid artificial mismatches derived from random-hexamer priming; 2) sites located in simple repeat regions or homopolymer runs of 5 nt were discarded; 3) variant sites with total read coverage < 5 and supporting reads < 3 were discarded; 4) sites with extreme variant allele frequencies (>95% or <10%) were discarded; and 5) sites located within 4 nt of a known spliced junction were removed. The remaining SNVs were annotated in the required format for GIREMI using GENCODE (V25 lift37). Then, we ran GIREMI to calculate mutual information (MI) associated with SNPs or RNA editing sites and identify RNA editing sites in RNA-seq reads. GIREMI was downloaded from https://github.com/zhqingit/giremi.
In Prediction method 7 , SNVs were retained as RNA editing to meet these five criterias: 1) mismatch sites with Hits Per Billionmapped-bases (HPB) > 5; 2) mismatch sites with mismatch ratio between 5% ~ 40% or between 60% ~ 95%; 3) mismatch sites with effective signal > 95%; 4) require at least two individual reads with the same type of nucleotide conversion; 5) mismatch sites do not exist in gSNPs from the common SNP database (build 137).
RANEditor is an easy-to-use tool 8 . We used the script CallingEditingSites.py from RNAEditor package to call RNA editing sites.

Assessing multiple factors on identifying RNA editing sites
We examined a series of factors, including library preparation methods, RNA degradation methods, laboratory, sequence depth, read mapping and variant calling methods, and their impact on the identification of RNA editing by applying DeepRed to RNA-seq data from the ABRF 9 and SEQC project 10 .
We assessed the impact of library preparation methods and RNA degradation methods by analysing 34 samples from the ABRF project 9 (Table S10), 8 of which were intact RNA prepared by the poly-A enrichment method, 8 that were intact RNA prepared by the ribodepleted method, and 18 that were degraded RNA prepared by the ribo-depleted method. The degraded samples were degraded using one of three methods, namely, heat, sonication or

RNase-A.
Widespread adoption of RNA-seq has led to plentiful data from multiple sites; however, there has been no systematic examination of the impact of lab-specific bias in detecting RNA editing sites. We explored the effect of laboratory by analysing 94 samples collected from 7 laboratories, including AGR (Australian Genome Research Facility), BGI (Beijing Genomics Institute), CNL (Weill Cornell Medical College), COH (City of Hope), MAY (Mayo Clinic), NVS (Novartis), and NYG (the New York Genome Center) (Table S11).
We pooled together human brain RNA-seq samples of 16 individuals from the SEQC project 10 (Table S9) and down sampled the pooled alignment file to explore the relationship between the detection of RNA editing sites and sequence depth to examine the impact of the sequence depth. For each sequence depth, we sampled the corresponding RNA-seq reads and calculated the A-to-I ratio and number of A-to-I editing sites. We repeated this down sampled analysis 10 times for each sequence depth, then calculated the average values.
RNA-seq read mapping and variant calling is an important step in the detection of RNA editing sites. Various read mapping and variant calling strategies can be adopted, such as BWA 11 , STAR 2 , Tophat 12 , GATK 4 and SAMtools 3 , but the best practice for detecting RNA editing sites has not been fully defined. To this end, we used the same pooled human brain RNA-seq alignment (Table S9)

SUPPLEMENTARY TABLE LEGENDS
Supplementary

SUPPLEMENTARY FIGURES LEGENDS
Supplementary Figure       3) fine-tune the weights W of the DNN in term of cross-entropy using mini-batch gradient descent in a supervised manner. The cost function is: where k is the number of classes, m is the number of input samples and n is the dimension of feature.