An improved method for analyzing deep RNA sequencing data provides a comprehensive view of RNA editing sites.
Recent advances in high-throughput sequencing technology have made it possible to study RNA editing on a genome-wide scale. But realizing the potential of this approach requires stringent data analysis methods that control for genomic variation, sequencing errors and biases introduced by read-mapping procedures. In this issue, Peng et al.1 introduce such methods and apply them to conduct a careful, large-scale study of RNA editing in the transcriptome of a Han Chinese individual. These data provide the first reliable map of RNA edits in a person.
RNA editing is a post-transcriptional process that alters the sequence of primary RNA transcripts. In humans, the most common type of editing is believed to be conversion of adenosine to inosine (A→I) in double-stranded RNA. When edited RNA is reverse-transcribed to cDNA, inosine behaves like guanine, resulting in an A→G change in the cDNA. A→I editing has been studied extensively as it can affect post-transcriptional regulation and translation and play a role in developmental processes and disease. Early studies of RNA editing showed that it takes place in many tissues and organisms. In humans, it is thought to occur predominantly in the brain and may be a key regulator of neural development. RNA editing has also been directly implicated in several diseases2 and for these reasons is a phenomenon of great interest to both molecular biologists and biomedical scientists.
In principle, the identification of edited sites is straightforward: comparing sequenced RNA (in the form of cDNA) and the DNA from which it is derived should reveal recurrent differences caused by editing. The locations of such sites, called RNA-DNA differences3, were first systematically explored when expressed sequence tag technology became available in the early 1990s. However, high error rates and the lack of reference genomes precluded large-scale cataloging of RNA-DNA differences until the early 2000s, when the throughput of sequencing instruments increased enormously and the NCBI trace archive was established4. Once a reference human genome had been assembled, researchers investigated the use of expressed sequence tags for finding A→I edits and discovered 12,723 such sites with a reported accuracy of 95% (ref. 5). The most complete study of edits based on the trace archive identified numerous A→I and C→U (cytosine deaminated to uracil) edits by restricting analyses to clusters of mismatches6. The report also highlighted the presence of systematic errors and the difficulty they cause in identifying RNA-DNA differences.
Until last year, RNA editing studies focused on RNA-DNA differences that could be explained by known biological mechanisms (A→I and C→U) on the assumption that other types of RNA-DNA differences were probably the result of sequencing errors. Then, in a landmark 2011 study that took advantage of advances in high-throughput sequencing to detect RNA-DNA differences in human B cells from 27 individuals, Li et al.3 found large numbers of noncanonical RNA-DNA differences. Specifically, they identified 28,766 apparent editing events distributed among 10,210 sites, and 20,848 of these events (73%) did not correspond to canonical A→I or C→U edits.
The scale of this study was unprecedented, and the striking suggestion that noncanonical edits constitute the majority of edits led to immediate questions about their functional significance and underlying molecular mechanisms. But the initial excitement gave way to skepticism because of the complexity of the bioinformatics analysis and the realization that many of the identified RNA-DNA differences could be explained by errors in sequencing or by mapping errors in the assignment of RNA-Seq reads to the reference transcriptome.
In a follow-up study examining the same data and using more stringent bioinformatics filters, another group could confirm only 1,809 of the 10,210 sites (18%), with >50% of the 1,809 verified sites corresponding to A→I edits7. A second follow-up study examined A→I edits using RNA-Seq data from a human glioblastoma cell line and found 9,936 sites8, of which only 73 coincided with the 10,210 sites (7%) found by Li et al.3. The low concordance might be due to the different cell lines used in the two studies. However, it is interesting to note that the later study8 found very few noncanonical RNA-DNA differences, most of which were already cataloged in RNA-editing databases.
These studies leave wide open fundamental questions about RNA editing: how prevalent is it, and what is the frequency of noncanonical RNA-DNA differences? They also underscore the problems associated with designing computational algorithms for identifying true RNA-DNA differences in the presence of sequencing errors and ambiguities arising from the short read lengths of current sequencing technologies (Fig. 1). For example, systematic sequencing errors9 can affect both DNA and RNA sequence, masking true edit sites or creating mirages; and errors that result in the incorrect assignment of RNA-Seq reads to paralogous genes can lead to sequence mismatches that look like RNA-DNA differences.
Peng et al.1 have addressed these challenges in a landmark study of their own. They generated one of the most comprehensive RNA-Seq data sets for a single individual ever produced by sequencing polyadenylated, nonpolyadenylated and small RNAs isolated from a lymphoblastoid cell line to an unprecedented depth—767 million 75- to 100-bp reads. The genome of the individual, a male Han Chinese designated 'YH', had been sequenced previously by the same group, allowing comparison of the genome and the transcriptome.
To identify RNA-DNA differences, Peng et al.1 developed a computational pipeline consisting of 11 filters for discarding RNA-DNA differences that arise from sequencing errors or bioinformatics problems. The filters check for strand bias to remove RNA-DNA differences that result from previously described9 systematic errors, eliminate multiple aligning reads and apply a sophisticated sequence of checks to ensure that single-nucleotide polymorphisms, copy-number variants and random errors in reads do not confound results (Fig. 1). Many of these issues were not addressed by Li et al.3. By filtering for sites that are associated with single-nucleotide polymorphisms and copy-number variants, Peng et al.1 go beyond previous work8 that focused on mapping errors.
Applying the pipeline to the RNA-Seq data yielded 22,688 sites with RNA-DNA differences. This number is similar to that found by Li et al.3, but unlike in that work, 93% of the sites found by Peng et al.1 were A→I (or A→G changes in cDNA). Among the 1,575 noncanonical sites, two-thirds were T→C, G→A and C→T changes. Both canonical and noncanonical sites could be validated by Sanger sequencing of DNA and cDNA from the same batch of cells, although the noncanonical sites were validated at a false discovery rate of ∼49% compared with <10% for canonical sites. Significantly, these results confirm that A→I and T→C changes are the predominant form of RNA editing, and although the high false-discovery rate of noncanonical RNA-DNA differences means that individual sites cannot be determined with confidence, such sites do appear to constitute a nonnegligible portion of the differences that remain to be explained.
The work of Peng et al.1 also provides a comprehensive reference Han Chinese transcriptome to accompany the Han Chinese genome. Notably, by mining this valuable resource for edits in small RNAs, the authors discovered 44 editing sites in microRNAs, some of which fall in seed regions that may affect the microRNAs' target specificity. They also provide ample evidence that their pipeline is robust at varying sequence depths in both the genome and transcriptome and show that it can be applied to RNA-Seq data obtained in typical experiments. Taken together, these tools and results pave the way for comprehensive cataloging of editing events in multiple tissue types in different populations.
Resolving the exact extent of noncanonical RNA-DNA differences will require novel statistical approaches for the joint analysis of genome and transcriptome sequencing reads. Such approaches will have to take into account transcript abundances while incorporating models not only for sequence errors but also for the assignment of ambiguously mapped reads among transcripts10. Functional genomics questions related to RNA editing will follow, with many more connections between transcriptional regulation and RNA editing likely to emerge in the coming years.
Peng, Z. et al. Nat. Biotechnol. 30, 253–260 (2012).
Maas, S. et al. RNA Biol. 3, 1–9 (2006).
Li, M. et al. Science 333, 53–58 (2011).
Wulff, B.-E. et al. Nat. Genet. 12, 81–85 (2011).
Levanon, E.Y. et al. Nat. Biotechnol. 22, 1001–1005 (2004).
Zaranek, A.W. et al. PLoS Genet. 6, e1000954 (2010).
Schrider, D. et al. PLoS ONE 6, e25842 (2011).
Bahn, J.H. et al. Genome Res. 22, 142–150 (2012).
Meacham, F. et al. BMC Bioinformatics 12, 451 (2011).
Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).
The author declares no competing financial interests.
About this article
RNA editing independently occurs at three mir-376a-1 sites and may compromise the stability of the microRNA hairpin
Molecular Biology and Evolution (2016)
Genome-wide identification and characterization of tissue-specific RNA editing events inD. melanogasterand their potential role in regulating alternative splicing
RNA Biology (2015)
Genome Research (2013)