Efforts to catalog eukaryotic transcripts have uncovered many small RNAs (sRNAs) derived from gene termini and splice sites. Their biogenesis pathways are largely unknown, but a mechanism based on backtracking of RNA polymerase II (RNAPII) has been suggested. By sequencing transcripts 12–100 nucleotides in length from cells depleted of major RNA degradation enzymes and RNAs associated with Argonaute (AGO1/2) effector proteins, we provide mechanistic models for sRNA production. We suggest that neither splice site–associated (SSa) nor transcription start site–associated (TSSa) RNAs arise from RNAPII backtracking. Instead, SSa RNAs are largely degradation products of splicing intermediates, whereas TSSa RNAs probably derive from nascent RNAs protected by stalled RNAPII against nucleolysis. We also reveal new AGO1/2-associated RNAs derived from 3′ ends of introns and from mRNA 3′ UTRs that appear to draw from noncanonical microRNA biogenesis pathways.
New technologies have revealed a wealth of eukaryotic noncoding RNAs (ncRNAs) and added new members to existing families1,2,3,4,5,6. Of the more established classes, microRNAs (miRNAs) are regulators of gene expression ~21–24 nucleotides (nt) in length that control many cellular processes7,8. Canonical miRNAs originate from RNAPII transcripts that form extended hairpins. These are processed by the sequential action of the RNase III family enzymes Drosha and Dicer and are incorporated into complexes containing effector proteins of the AGO family9,10,11.
Not so well established is a diverse set of sRNAs less than 200 nucleotides (nt) in length recently reported to be derived from genes in higher eukaryotes. The best documented are sRNAs from the promoter regions of protein-coding loci. In humans, these include promoter-associated small RNAs (PASRs)12,13 as well as the uncapped TSSa14 and transcription initiation RNAs (tiRNAs)15,16. The latter two species have been proposed to be byproducts of RNAPII arrest followed by backtracking and sRNA liberation by transcription factor IIS (TFIIS)-assisted cleavage of the nascent transcript3,4,15. Uncapped sRNAs whose 3′ termini map precisely to the 3′ end of exons have been suggested to derive from a similar mechanism and are termed splice-site RNAs (spliRNAs)16. Finally, sRNAs are also found at the 3′ end of genes12,17. These termini-associated RNAs (TASRs) can be both sense and antisense (aTASRs) with respect to the gene. The aTASRs carry a non–genomically encoded 5′ poly(U) tail and have been proposed to be generated by an as-yet-unidentified RNA-dependent RNA polymerase17.
To gain a better understanding of the biogenesis and potential utility of some of these sRNA species, we conducted deep sequencing analyses of RNAs of different size ranges and origins. Our study provides mechanistic models for the sources of several reported sRNAs and reveals previously unknown types of AGO1/2-associated molecules.
General mapping patterns of sRNAs from genic regions
To obtain an overview of sRNA species originating from protein-coding genes, we prepared and sequenced libraries of HeLa sRNAs carrying 5′ end monophosphates and size-selected them into pools of 18- to 30-nt sRNAs and 30- to 100-nt sRNAs (HeLa18–30 and HeLa30–100, respectively). The impact of RNA degradation machineries was investigated by depleting the hRRP40 core subunit of the 3′–5′ exo- and endonucleolytic RNA exosome (HeLa18–30(RRP40) and HeLa30–100(RRP40)) or both of the 5′–3′ exonucleases XRN1 and XRN2 (HeLa18–30(XRN1/2) and HeLa30–100(XRN1/2)). Moreover, HeLa nuclear RNAs in two different size ranges (12–20 nt and 18–30 nt) were sequenced (HeLa12–20(N) and HeLa18–30(N)). Finally, to reveal candidates related to the RNA-mediated interference (RNAi) pathway, we also prepared libraries from sRNAs immunoprecipitated by AGO1/2 proteins in either the presence (HeLa18–30(AGO1/2)) or the absence (HeLa18–30(AGO1/2/RRP40)) of hRRP40. All RNA libraries are listed in Supplementary Table 1. Appropriate factor depletion-, immunoprecipitation- or nucleocytoplasmic fractionation efficiencies were verified (Supplementary Fig. 1 and ref. 18). Sequencing reactions yielded from ~1.5 million to ~11 million single-site mappable sRNAs, of which 77% in the AGO1/2 libraries and 50–60% from the other libraries of 18- to 30-nt sRNAs originated from annotated miRNAs. However, a few highly expressed miRNAs dominated, and only 2–8% of all unique RNAs from the non-AGO1/2 18- to 30-nt libraries derived from annotated miRNAs (Supplementary Fig. 2). Thus, many other distinct sRNAs were present.
To focus on genic regions, densities of uniquely mapped sRNAs were plotted using a set of gene-associated reference points: the transcription start site, the 5′ splice site, the 3′ splice site and the 3′ end cleavage-polyadenylation site (Fig. 1a and Supplementary Fig. 2). To avoid biases due to high-read outliers, we removed RNAs overlapping repeat regions as well as known sRNAs (for example, miRNAs, tRNAs and snoRNAs). Moreover, as we were interested in general sRNA patterns, RNA counts were 'collapsed' so that a unique read was only counted once, regardless of how many times it was sequenced. We generally observed similar overall distributions using noncollapsed RNA counts, although high-read outliers (possibly unannotated miRNAs) occasionally confounded results. A schematic overview of the mapping pattern obtained is shown in Figure 1a. It shows that small RNAs are enriched in exons, upstream and downstream of TSSs, at exon-intron borders and near polyadenylation sites.
sRNAs as degradation intermediates of introns and exons
For all sRNA sizes interrogated, mapping densities within exons are ~10- to 25-fold higher than within introns (Fig. 1b and Supplementary Figs. 2 and 3a). Moreover, when mapping sRNAs to cDNA, tag density is largely uniform over exon-exon junctions (Fig. 2a, middle, and Supplementary Fig. 3b,c), arguing that the overall majority of exonic sRNAs originate from spliced mRNA and that the diverse sizes probably arise from different degradation intermediates. Consistently, a larger fraction of sRNAs from HeLa18–30(XRN1/2), HeLa30–100(XRN1/2), HeLa18–30(RRP40) and HeLa30–100(RRP40) libraries map to exons, compared to the HeLa18–30 and HeLa30–100 libraries, in line with a contribution of both 5′–3′ and 3′–5′ exonucleolysis to mRNA decay (Fig. 2a, middle; Supplementary Fig. 3). We speculate that the decrease in sRNA 5′ ends at the −1 and +1 positions has a technical origin and may be due to an adaptor ligation bias toward these nucleotides, as previously reported19.
One specific exonic feature, which stood out from this general 'background layer' of sRNA, was the presence of 5′ splice site–associated (5′SSa) RNAs whose 3′ ends aligned with the exon-intron border (Fig. 2a, left and middle, and Supplementary Fig. 4a,b), akin to the previously reported spliRNAs16. However, exonic 5′SSa RNAs were detectable independent of RNA size even when considering the RNAs that were 12- to 20-nt and 30- to 100-nt long and thus collectively show a staggered appearance of their 5′ ends (Fig. 2a, left and middle, and Supplementary Fig. 4a,b). The distinct RNA 3′ end positions and the fact that these sRNAs have lengths ranging from 12 to >36 nt, preferentially <15 nt, are difficult to reconcile with the idea that RNAPII backtracking followed by nascent transcript cleavage is the source of their production16. This is because TFIIS-induced relief of backtracked RNAPII typically liberates RNA products up to 9 nt in length and almost never exceeding 14 nt in length20,21,22 (see also Discussion). Instead, the observed mapping pattern is more consistent with 5′–3′ exonucleolytic trimming of exons liberated from their downstream introns, possibly as the result of a failure to undergo the second step of splicing (Fig. 2b, left). Thus, 5′SSa RNAs may be signature molecules of a rate-limiting step in the degradation process, for example, protection by leftover components of the splicing machinery, which could result in the enrichment of certain sRNAs; namely, 18 nt in the HeLa18–30 library and 16–17 nt in the HeLa12–20(N) library (Fig. 2a). Unexpectedly, the 5′SSa 18-nt species is more abundant in the total RNA preparation (HeLa18–30) (Fig. 2a) than in the nuclear fraction (HeLa18–30(N)) (Supplementary Fig. 3b), possibly indicating the export of a fraction of these species into the cytoplasm.
The sRNAs whose termini align to exon-intron borders are also enriched at intronic 5′- and 3′ ends (Fig. 2a, left and right; Supplementary Figs. 3b and 4). Those sRNAs mapping to the 5′ end of introns show 3′ end staggering (Fig. 2a, left, and Supplementary Fig. 4a,b). Conversely, RNAs at intron 3′ ends have staggered 5′ ends (Fig. 2a, right, and Supplementary Fig. 4c,d). Again, these profiles are most compatible with a scenario where introns are degraded exo- and/or endonucleolytically and where the final removal of intron termini provides a rate-limiting step (Fig. 2b, right). Tag densities of these RNAs are increased at both intron ends in the HeLa18–30(XRN1/2) and HeLa30–100(XRN1/2) libraries (Fig. 2a, right and middle, and Supplementary Fig. 4). Thus, with some redundancy provided by 3′–5′ exonucleolysis, intronic SSa RNAs and their precursors appear to be primarily removed by 5′–3′ degradation.
As observed in several previous studies14,15,16, HeLa cells also accumulate TSSa sRNAs that map in both sense and antisense directions with respect to the gene (Fig. 3a). As many promoters have an array of different start sites ('broad promoters') as opposed to a single, predominant TSS ('sharp promoters')23, we subdivided promoters into these two categories according to their TSS distributions as defined by cap-selected RNA 5′ ends (CAGE) data24, and we plotted 3′ end reads for the 18- to 30-nt–sized libraries (Fig. 3a and Supplementary Fig. 5). Broad promoters show a wider distribution of sense TSSa RNAs with the peak of sRNA 3′ ends located between 30 nt and 40 nt downstream of the TSS. Moreover, these promoters are associated with antisense TSSa RNAs whose 3′ ends show an even wider distribution, mapping 150–200 nt upstream of the TSS. Conversely, sharp promoters create a fairly narrow average sense TSSa RNA 3′ end peak at +38–39 nt relative to the TSS, although some 3′ end staggering is also evident (Fig. 3a). Notably, relative levels of antisense TSSa RNA signals from the −250 to −50 region were decreased by a factor of ~1.32 for sharp compared to broad promoters, whereas they increased by a factor of ~1.36 in the sense direction from the +1 to +100 region (Fig. 3a). Both changes are statistically significant (P << 0.001, exact binomial test). Thus, assuming that these RNAs are indicative of RNAPII transcription mechanisms, it appears that sharp promoters provide more accurate directionality, as also previously suggested24.
Like exonic 5′ SSa RNAs, TSSa sRNAs have been proposed to be byproducts of RNAPII backtracking3,4,15,16. When we plotted the size distribution of TSSa sRNAs from HeLa18–30 libraries, we found that all sizes could be detected, albeit with a demonstrated preference for RNAs that were <22-nt long and, most prominently, 20-nt long (Fig. 3b). However, when we considered HeLa12–20(N) library reads that map perfectly to either a single or to multiple locations of the genome, sense TSSa RNAs >16 nt in length were clearly enriched immediately downstream of the TSS, whereas smaller RNAs showed a low and uniform density over the region, without any strand preference (Fig. 3c). This lack of enrichment of RNAs <17 nt in length was not due to mapping ambiguities caused by the short length of the sequence tags, as we reliably detected aggregations of similarly sized RNAs around splice sites (Fig. 2a). Thus, the size of TSSa RNAs is restricted to a length of ≥17 nt. As for exonic 5′SSa RNAs, this size range conflicts with the notion that these molecules are liberated as a result of RNAPII backtracking20,21,22, and we therefore considered alternative mechanisms for their origin.
In both Drosophila melanogaster and human cells, a substantial number of genes harbor stalled RNAPII immediately downstream of their TSSs22,25,26,27,28,29. To analyze the relationship between emission of TSSa RNAs and RNAPII positioning, we focused on genes that have one or more sRNA 3′ ends positioned exactly at +38 downstream of 'sharp' CAGE-defined promoters (the peak in Fig. 3a). For all four HeLa18–30 libraries tested, a markedly tight overlap between TSSa RNA 3′ end position and the position of the center of RNAPII as determined by chromatin immunoprecipitation (ChIP)30 was obtained (Fig. 4a and Supplementary Fig. 6a). Moreover, RNAPII levels appear to increase with the number of sRNAs. Taken together, these results strongly suggest that RNAPII and TSSa RNA 3′ ends are positioned as illustrated in Figure 4b. As the RNA residing inside the RNAPII complex is ~17–20 nt in length31, much like the average size of TSSa RNAs, an appealing model is that these transcripts are remnants of the decay of RNAs partly protected by stalled RNAPII complexes failing to resume transcription elongation. If such degradation is caused by 5′–3′ exonucleolysis, it is expected that XRN1/2 depletion would result in a higher proportion of intact RNAs >30-nt long, whose 5′ ends would map to the TSS, and fewer RNAs < 30-nt long ending approximately at position +38. Indeed, this is what we observed when plotting the density of RNA 3′ ends from the HeLa18–30 and HeLa18–30(XRN1/2) libraries and RNA 5′ ends from the HeLa30–100 and HeLa30–100(XRN1/2) libraries within TSS regions of genes having at least one HeLa18–30 sRNA 3′ end between positions +36 and +40 (Fig. 4c and Supplementary Fig. 6b,c). Although not conclusive, this combined pattern suggests that 5′–3′ exonucleolysis is a mechanism for the creation of TSSa RNAs (Fig. 4b).
Identification of human tailed mirtrons
Although exonic sRNAs associated with TSSs and 5′SSs are generally not bound by AGO1/2 proteins, we observed one peak enriched by the AGO1/2 immunoprecipitate whose sRNA 5′ ends align with the 5′ ends of introns (Fig. 5a, left) and two additional peaks close to intron 3′SSs (Fig. 5a, right). All these sRNAs were enriched for molecules 20–24 nt in length, with 22 nt, the average size of miRNAs9, being the most prominent (Fig. 5b). The sharp 3′SS proximal sRNA peak was positioned such that the 3′ ends of its reads precisely coincided with the intron-exon junction (Fig. 5c and Supplementary Fig. 7). In cases where the read extended across the 3′SS, the additional nucleotides were most often nontemplated additions (Supplementary Fig. 8). The 5′ ends of the upstream and broader peak centered ~60 nt from the 3′SSs, reflecting a distance typical of the size of a precursor miRNA (pre-miRNA).
These observed patterns of putative miRNA 5′ or 3′ ends aligning with exon-intron or intron-exon junctions are reminiscent of a mirtron biogenesis pathway, best known from D. melanogaster32,33, where the pre-miRNA is generated by pre-mRNA splicing instead of processing by Drosha. We therefore searched our data for candidate human mirtron genes. To this end, we subjected 289 mirtron candidates to a set of criteria (see Methods); most importantly, AGO1/2 protein-association as defined by their presence in the HeLa18–30(AGO1/2) or HeLa18–30(AGO1/2/RRP40) libraries, the propensity of the predicted pre-miRNA to fold into an RNA hairpin, and alignment of sRNA termini with at least one of the SSs. This analysis resulted in 37 newly revealed, confidently annotated putative mirtrons (Supplementary Table 2). Unexpectedly, only one of these, located in the ZYX gene, had both hairpin ends at the intron junctions. Inspection of the remaining candidates (see Fig. 5c,d and Supplementary Fig. 7 for examples), revealed a read signature characteristic of so-called tailed mirtrons, where only one pre-miRNA end is defined by splicing, and the other is processed by removal of the flanking tail as recently reported in D. melanogaster34. However, unlike in flies, where all known tailed pre-mirtrons bear 3′ extensions, we only identified two such examples (Fig. 5d and Supplementary Table 2). The remaining ones carried tails at their 5′ ends (Fig. 5c and Supplementary Table 2). The sRNA read pattern from a 5′ tailed Mus musculus mirtron35 (mmu-miR-1982) is very similar to the patterns found here, arguing that their biogenesis is also similar. Importantly, mmu-miR-1982 sRNA reads are depleted in Dicer, but not Drosha, knockout cells, demonstrating Dicer-dependent, Drosha-independent biogenesis35. We used a splinted ligation technique36 to validate expression of a representative 5′ tailed mirtron candidate in the EEF1G gene (Fig. 5e). Furthermore, we reclassified seven previously annotated human miRNAs as five 5′-tailed and two 3′-tailed mirtrons. Finally, a substantial fraction of the identified 5′-tailed mirtron reads carry 3′ non-templated A and/or U additions (Supplementary Fig. 8 and data not shown), a feature that is also frequently found on canonical microRNAs37.
The biogenesis pathway of the mature AGO1/2-bound miRNAs from these loci is unlikely to follow the exosome-dependent route described in flies34. Perhaps reflecting the requirement that 5′ exonucleases with limited processivity remove their tails, human introns harboring 5′-tailed mirtrons are markedly shorter than the overall average (1,069 nt versus 6,150 nt; Supplementary Table 2).
Argonaute-associated sRNAs derived from mRNA termini
We also found an enrichment of AGO1/2-associated sRNAs in 3′ untranslated regions (UTRs) compared to upstream protein-coding exons (Fig. 6a). Within 3′ UTRs, AGO1/2-associated sRNAs 22–24 nt in length (Fig. 6b) particularly cluster close to the mRNA 3′ end (Fig. 6c, compare non-AGO1/2 immunoprecipitate libraries (HeLa18–30) to HeLa18–30 (AGO1/2) libraries). Visual inspection of selected loci confirmed the presence of miRNA-sized sRNAs whose 3′ ends aligned with the annotated polyadenylation sites, suggesting either the canonical pre-mRNA 3′ end cleavage machinery in their biogenesis, or 3′ nucleolytic trimming of polyadenylated transcripts. We refer to these sRNAs as transcription termination site–associated (TTSa) RNAs. The most prominent example was found at the end of the RPL5 gene (Fig. 6d). Additional examples are shown in Supplementary Figure 9. Again, splinted ligation36 was used to validate the presence of TTSa RNAs of the expected size originating from the RPL5 locus. Importantly, the signal was enriched in the AGO1/2 immunoprecipitate material (Fig. 6e). The regions surrounding the AGO1/2-associated TTSa RNAs have poor potential to form secondary structures (data not shown), and we did not find evidence for molecules corresponding to the respective passenger strand10. Therefore, these sRNAs are probably not generated by the canonical miRNA biogenesis pathway.
In recent years a multitude of previously unknown eukaryotic ncRNAs have been exposed. As most of these discoveries are not directed by genetic analyses, there is a growing need to sort these molecules by their modes of biogenesis and putative function. Here, we have focused on RNA species <100 nt in size originating from within and in close proximity to protein-coding genes. To get a complete view of the general origin of these sRNAs, we mapped library reads to both genomic and cDNA (mature mRNA) sequence information after filtering the data against highly expressed RNA species that would otherwise obscure any generic features. In all libraries investigated, sequence reads were derived more frequently from exonic than intronic regions (Fig. 1b), and read density was typically constant over exon-exon boundaries (Fig. 2a), indicating that these sRNAs derived from the degradation of mature mRNA. Moreover, we suggest that the prominent sRNA peaks not associated with AGO1/2, near the 3′ end of exons and at both intron termini, also originate from RNA decay. This is because one end of these reads generally aligns with exon-intron or intron-exon junctions, whereas the other appears staggered, creating multiple sRNA lengths of 12–100 nt that are consistent with exonucleolysis. Such a pattern most likely stems from degradation intermediates of 5′ exons that failed to complete splicing as well as the 3′–5′ and/or 5′–3′ removal of excised and debranched introns (Fig. 2b). The preferred size of exonic 5′SSa RNAs from the HeLa18–30 library is 18 nt, which was previously suggested to be a conserved feature of these molecules16. However, in the HeLa12–20(N) library, similarly positioned sRNAs of 12–17 nt in length with their 3′ ends aligned to the exon-intron border are detected at high density compared to flanking regions. We suggest that this represents a constraint to RNA decay, either as a result of the intrinsic properties of the responsible degradation enzyme(s) or by obstructing RNA binding proteins, possibly by splicing factors that remain associated with splicing intermediates. Similarly, we propose that the sRNA peaks positioned at the 5′ and 3′ ends of introns result from the same kind of rate-limiting steps of complete intron removal (Fig. 2b).
Because of the presence of 5′-monophosphate and 3′-hydroxyl groups, TSSa sRNAs have been proposed to arise from endocleavage of nascent RNA 3′ ends extruding from the RNAPII exit channel following backtracking away from impediments in transcription3,4,15,16. According to this model, realignment of the RNA 3′ end with the RNAPII active site would require TFIIS, which triggers an RNA cleavage activity intrinsic to RNAPII. Backtracking has been studied in vitro20,21 and in D. melanogaster S2 cells in vivo22, and in both systems the majority of TFIIS-dependent liberated RNA fragments were found to be in the size range of 4–14 nt. This appears to be incompatible with observations in this study that TSSa RNAs are predominantly ≥17 nt long (Fig. 3c). Rather, the minimal TSSa RNA length of 17 nt fits very well with the size of the nascent RNA residing inside, and presumably protected by, the RNAPII complex31 (Fig. 4b). In line with this idea, we find a strong correlation between the position of TSSa RNA 3′ ends and the center of RNAPII as defined by its ChIP sequencing (ChIP-Seq) peak (Fig. 4a), suggesting that this molecular arrangement indeed takes place in vivo. Data from D. melanogaster S2 cells have shown that the RNAPII ChIP-Seq peak corresponds to the position of RNAPII on the DNA template after backtracking22,26,38, making it further unlikely that TSSa production results from TFIIS-induced endocleavage, as TSSa RNA 3′ ends and the catalytic center of RNAPII would then have to be offset by ~20 nt relative to each other. We instead suggest that TSSa RNAs arise as a result of unsuccessful transcription elongation events, after RNAPII stalling at, for example, the +1 nucleosome. Notably, this does not rule out that backtracking-mediated TFIIS cleavage also occurs, generating sRNAs too short for our libraries to capture. Moreover, our data suggest that 5′–3′ exonucleolysis may contribute to TSSa RNA production (Fig. 4b,c). One intriguing possibility, therefore, is that an early transcriptional 'checkpoint' is associated with RNAPII stalling to discard transcription complexes erroneously engaged in the elongation of uncapped RNAs, a phenomenon previously reported in Saccharomyces cerevisiae39.
mRNA processing generates diverse miRNA-class small RNAs
First discovered in D. melanogaster and the nematode Caenorhabditis elegans, mirtrons are a class of short introns that can be spliced and debranched to form pre-miRNA mimics, thereby bypassing the need for Drosha to directly undergo Dicer cleavage and incorporation into silencing complexes32,33. Computational methods and high-throughput sequencing later suggested the presence of mirtrons in vertebrates ranging from Gallus gallus (chicken) to humans35,40,41,42,43. In contrast to mirtrons where the ends of the hairpin coincide precisely with both splice sites, only the 5′ end of the D. melanogaster locus mir-1017 coincides with the 5′SS. To allow Dicer cleavage, the tail separating the pre-miRNA hairpin from the 3′SS needs to first be trimmed by the exosome34.
Here, we identify 36 tailed human mirtrons (Supplementary Table 2). Because mirtrons have more lenient secondary structure requirements compared to classical miRNAs, often tolerating an extended stem or an increased size of the terminal loop44, this number is probably underestimated. It thus appears that tailed mirtrons constitute an underappreciated subgroup of human miRNAs. Despite the fact that a few identified candidates have already been annotated in miRBase45, none of the affiliated papers classify them as tailed mirtrons. Notably, in 34 out of 36 cases, the 3′ end of the proposed pre-miRNA coincides with the 3′SS, whereas the 5′ end is separated from the 5′SS by a tail of variable length. Recent studies in murine35, avian41 and bovine42 cells have identified a total of nine tailed mirtrons, all of which are 5′ tailed, suggesting that the 5′ tail preference is conserved in vertebrates. How this 5′ tail is removed before Dicer processing remains unknown. Possible mechanisms include 5′–3′ exonucleolysis or endonucleolysis. The putative mirtrons are evolutionarily poorly conserved, even among mammalian genomes. Lack of conservation of mirtrons has also been observed between closely related Drosophila species46, confirming that mirtrons evolve more rapidly than canonical miRNAs.
Ultimately, the function of these newly discovered species of AGO1/2-associated RNAs remains enigmatic. They might operate like canonical miRNAs by regulating gene expression in trans. Alternatively, their location, in particular the overlap of the precursor with the branch point and the polypyrimidine tract—two important splicing elements—positions them to putatively influence splicing of their host introns in cis. Indeed, intronic sequences with secondary structures reminiscent of mirtrons have been shown to be involved in alternative splicing47. Conversely, splicing regulators might affect the efficiency of mirtron production. Notably, the majority of the candidate 5′ tailed mirtrons fold into hairpins, such that the splicing branch-point consensus sequences ('YUNAY')48 of the host intron fall into the loop region between the two arms (Fig. 5c and Supplementary Fig. 7), which is often targeted by regulators of miRNA biogenesis, including splicing factors such as K(H)SRP, ASF/SF2 and hnRNPA7.
We also identified another rare class of sRNAs (TTSa RNAs) that was enriched by AGO1/2 immunoprecipitate. Notably, the 3′ ends of many of these ~23-nt long species align with the polyadenylation tail addition site of annotated genes (Fig. 6d and Supplementary Fig. 9), suggesting that mRNA 3′ end processing is part of their biogenesis pathway. The absence of any hairpin potential also strongly suggests that TTSa RNAs are generated by a mechanism distinct from known microRNA maturation. It is noteworthy that overlap analysis of TTSa RNAs from our AGO1/2 libraries and the recently discovered aTASRs, which run antisense to the very 3′ end of annotated transcripts17, shows that 73% of mRNA 3′ ends with at least one TTSa RNA read overlap with an aTASRs (Fig. 6d, Supplementary Fig. 9 and data not shown). Thus, a TTSa RNA biogenesis pathway involving RNA–RNA pairing between an aTASR and the mRNA 3′ end is one possibility. We also found several selected examples of 3′ UTR-derived AGO1/2-associated sRNAs that do not map to polyadenylation sites (data not shown). These may be similar to the Schizosaccharomyces pombe49 primal RNAs (priRNAs) and/or PIWI RNAs (piRNAs) found in flies and mammals50,51,52,53,54, because it is suggested that both these classes are processed from single-stranded host mRNA molecules. Our finding therefore points to these genic regions as conserved sources of sRNAs capable of interacting with a broad spectrum of AGO family proteins.
New sequencing technologies are identifying RNAs at a fast pace, creating an increasing gap between sRNA identification and characterization of their function and biogenesis. We used AGO1/2 association to suggest the function of sRNAs originating from human genic regions and found that such molecules are derived from introns and 3′ UTRs. Future analyses will reveal whether they operate as bona fide miRNAs and/or interrelate with the processing reactions from which they may derive. Although TSSa and SSa RNAs are not bound by AGO1/2 proteins, a putative function cannot be readily dismissed. However, we note for now that these sRNAs are probably signature molecules that reveal mechanistic features of eukaryotic transcription and splicing.
Cell culture, RNAi, RNA preparation and western blot analysis.
HeLa cells were grown in DMEM GlutaMAX medium (Invitrogen) supplemented with 10% (v/v) fetal bovine serum. Transfections were done with 20 nM siRNA for 3 d and repeated for another 3 d, each time using Lipofectamin2000 as transfecting agent according to the manufacturer's instructions (Invitrogen). XRN1, XRN2, RRP40 and control (eGFP) siRNA sequences were as previously described55. Total RNA was extracted using TRIzol-reagent (Invitrogen) according to the manufacturer's instructions. RNA was subjected to DNase I treatment, repurified by phenol-chloroform extraction and reprecipitated. Western blot analyses were carried out according to standard procedures and developed by enhanced chemoluminescence (ECL Plus; GE Healthcare). Polyclonal anti-XRN2 antibodies (A301-102A-1) were purchased from Bethyl Laboratories. Polyclonal anti-XRN1 and anti-RRP40 antibodies were gifts from J. Lykke-Andersen and G. J. Pruijn, respectively. Monoclonal anti-hnRNPc and anti-ADAR1 antibodies were gifts from D. L. Black and K. Nishikura, respectively.
Monoclonal antibodies for human AGO1 (4B8)56 and AGO2 (11A9)57 were coupled to protein G–Sepharose beads overnight at 4 °C. Beads were washed once with PBS buffer and twice with lysis buffer (25 mM Tris-HCl, pH 7.4, 150 mM KCl, 0.5% (v/v) NP-40, 2 mM EDTA, 1 mM NaF). HeLa cell pellets (300 mg each) were lysed in ten volumes of lysis buffer for 20 min on ice. Lysates were cleared by centrifugation at 17,000g for 30 min before adding them to the beads. After 4 h of rotation at 4 °C, the beads were washed three times with wash buffer (300 mM NaCl, 50 mM Tris-HCl, pH 7.4, 1 mM MgCl2 and 0.1% (v/v) NP-40) and once with PBS. RNA was recovered by proteinase K digestion, followed by acidic phenol extraction and ethanol precipitation. Immunoprecipitation efficiency was assessed by carrying out western blot analysis on a fraction of the protein recovered on the beads (Supplementary Fig. 1c).
Library construction and sequencing.
Library construction and sequencing was a paid service from the Beijing Genome Institute (BGI). In brief, RNA of the desired size was isolated from polyacrylamide gels and ligated to 3′ and 5′ adaptor oligonucleotides. Ligation products were purified, reverse transcribed and PCR amplified. Sequences of the adaptors and primers are as published (Illumina). Samples were sequenced on an Illumina 2G Genome Analyzer.
Small RNA detection by splinted ligation was carried out essentially as described36. The following oligonucleotides were used: ligation oligonucleotide: 5′-CGCTTATGACATT-3ddC-3′ (where '3ddC' denotes a 2′,3′-dideoxycytidine residue); bridging oligonucleotides: RPL5: 5′-GAATGTCATAAGCGGCTGTTCATAAGTTTATTATCTAT-3′; EEF1G 3p: 5′-GAATGTCATAAGCGGCTGGTGCAGAGGAAGGCAGGAAA-3′; EEF1G 5p: 5′-GAATGTCATAAGCGTGTTCTGCCTCTTTCCACACCCCT-3′.
We used Bowtie58 with default settings to map all >18-nt libraries to hg18. If more than ten hits were found, the rest were discarded. Reads were normalized for multiple alignments (number of reads per alignments), and unless otherwise specified (for example, single mapping), we used these normalized scores. For the 12-nt to 20-nt library, we required zero mismatches and used a threshold of 50 hits.
Filtering and collapsing of reads.
Unless otherwise specified, we used the following conventions: we filtered reads overlapping repeats, RNA genes (wgRna, rnaGenes and rmsk from UCSC hg18) or the genes targeted in the knockdown. We collapsed sRNA reads so that identical reads could only contribute with a count of 1, regardless of how many times it was sequenced.
RNA distributions around reference locations.
We used FANTOM3 human CAGE-tag clusters24 with >30 tags for defining TSSs unless otherwise specified. We selected the largest peak within each tag cluster as our reference location. We used GM-distance clustering59, to divide the tag clusters into single and broad peaks, similar to what has previously been done24. Collapsed single-mapping sRNAs were summed for each position around each TSS. For 12-nt to 20-nt long reads, we used both single- and multimapping reads.
University of California, Santa Cruz (UCSC) gene annotations from assembly hg18 were used for TSSs, splice sites and 3′ ends. We only counted a unique location once even if multiple isoforms detected it. For each set, we counted the sum of unique single-mapping sRNAs as above. For RNAs 12–20 nt in length, we also used multimapping sRNAs. For assessing spliced mRNAs, all known genes in UCSC were spliced together, and we mapped sRNAs within the 3,000 nt upstream of the TSS to 1,000 nt downstream of the TTS of the spliced fragments. Only internal exons and perfect sRNA matches were used in the analysis. We normalized mapped counts by the total number of alignments. When assessing the distribution of sRNAs over introns or 3′ UTRs, we normalized individual regions by dividing by the total length of the region. For 3′ UTRs, this was done on spliced mRNAs.
TTS overlap analysis.
We mapped collapsed reads to within 10 nt of unique TTS and overlapped these with aTASR reads17. The P value was calculated with a Fisher's exact test using the number of TTS with aTASRs, TTSa RNAs, none or both in the contingency table.
Densities of sRNA within specific gene features.
For a given feature, we counted the number of sRNA reads (total or unique) whose 5′ or 3′ ends were located within that feature (Supplementary Fig. 2) on the same strand. The exonic category did not include the overlaps with other exon-derived categories, and it took precedence over intronic in cases of multiple annotations. Reads were filtered as described above in the 'Filtering and collapsing of reads' section, except for the miRNA category.
RNAPII ChIP sequencing.
XRN1/2-dependent production of TSSa RNA.
We analyzed CAGE tag clusters as above. To account for relative abundance, we counted up to ten identical sRNAs instead of collapsing unique reads (Supplementary Fig. 6). We used promoters with 3′ ends of HeLa18–30 sRNAs between positions +36 and +40.
Annotation of newly identified mirtrons.
The following set of criteria was applied for confident annotation of tailed mirtrons. (i) Multiple sequence reads are detected. (ii) Sequence reads are mapped to both arms of a predicted stem loop structure. (iii) Both the hairpin and one of the sequenced arms precisely flank a splice site. (iv) There are no multiple reads covering the expected Dicer cleavage site. (v) At least one of the arms is detected by deep sequencing of RNA immunoprecipitated with anti-AGO1/2 antibodies. (vi) Lastly, there is no annotation suggesting non-miRNA biogenesis.
All RNA sequence data have been deposited in the NCBI Gene Expression Omnibus (GEO) database under accession number GSE29116.
Gene Expression Omnibus
We thank A. Jacquier, A.H. Lund, K. Adelman and members of the T.H.J. and A.S. laboratories for stimulating discussions. The following colleagues are acknowledged for sharing antibodies: J. Lykke-Andersen (Division of Biology, University of California, San Diego), D.L. Black (Howard Hughes Medical Institute, Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles), G.J. Pruijn (Department of Biomolecular Chemistry, Nijmegen Center for Molecular Life Sciences, Institute for Molecules and Materials, Radboud University) and K. Nishikura (The Wistar Institute). This work was supported by the Danish National Research Foundation, the Danish Cancer Society and the Lundbeck Foundation (to T.H.J.) and the EU 7th Framework Programme (FP7/2007–2013)/ERC grant agreement 204135, the Novo Nordisk Foundation, the Danish Cancer Society and the Lundbeck Foundation (to A.S.). E.V. was supported by the Danish Council for Independent Research. P.P. was the recipient of a research grant from the Lundbeck Foundation during part of this work. Work in the laboratory of G.M. was supported by the Bayerisches Staatsministerium für Wissenschaft, Forschung und Kunst (BayGene), the European Union (ERC grant 'sRNAs') and the Deutsche Forschungsgemeinschaft (DFG, Me 2064/2-2 and FOR855). Sequencing was carried out at the Beijing Genome Institute (BGI) in Shenzhen, China.
Supplementary Figures 1–9 and Supplementary Tables 1 and 2
About this article
BMC Genomics (2016)