To date, analyses of individual targets have provided evidence of a miRNA targetome that extends beyond the boundaries of messenger RNAs (mRNAs) and can involve non-Watson-Crick base pairing in the miRNA seed region. Here we report our findings from analyzing 34 Argonaute HITS-CLIP datasets from several human and mouse cell types. Investigation of the architectural (i.e. bulge vs. contiguous pairs) and sequence (Watson-Crick vs. G:U pairs) preferences for human and mouse miRNAs revealed that many heteroduplexes are “non-canonical” i.e. their seed region comprises G:U and bulge combinations. The genomic distribution of miRNA targets differed distinctly across cell types but remained congruent across biological replicates of the same cell type. For some cell types intergenic and intronic targets were more frequent whereas in other cell types mRNA targets prevailed. The findings suggest an expanded model of miRNA targeting that is more frequent than the standard model currently in use. Lastly, our analyses of data from different cell types and laboratories revealed consistent Ago-loaded miRNA profiles across replicates whereas, unexpectedly, the Ago-loaded targets exhibited a much more dynamic behavior across biological replicates.
MiRNAs are short non-coding RNAs (ncRNAs) that regulate their target mRNAs in a sequence-dependent manner thereby regulating the expression of the corresponding protein-coding gene1. MiRNAs are the best-studied group of ncRNAs and have been shown to be critical for many biological processes2,3,4,5 and cancers6,7, while exhibiting tissue and cell-state dependent expression profiles8. Ever since the first reported animal heteroduplex2,3, lin-4:lin-14, it has been clear that a portion of the 5′ region of a miRNA plays a central role in the recognition of the miRNA's target. This portion typically spans positions 2–7 from the miRNA's 5′ end and is known as the ‘seed.’ The presence of the seed sequence's reverse complement (i.e. of contiguous Watson-Crick base pairing in the seed region), the localization in the 3′ untranslated region (3′UTR) of a messenger RNA (mRNA) and, occasionally, the conservation of a candidate sequence across genomes have been typical criteria for determining mRNA targets1,2,4,9. In addition to contiguous Watson-Crick base pairing in the seed region, non-standard interactions where the base pairing was interrupted by bulges have also been reported2,4,5,10,11,12,13,14,15,16,17,18,19. Analogously, other reports showed instances of “seed-less” interactions5,15,16,20,21,22,23,24,25,26,27,28,29, targets located outside the 3′UTR5,22,27,28,30,31,32,33, and targets that were not conserved amongst various species5,15,33,34. However, the prevalence of such non-standard interactions as compared to those that are anticipated by the standard model remains unclear.
The advent of CLIP-seq (cross-linked immunoprecipitation followed by next generation sequencing) techniques such as HITS-CLIP35, PAR-CLIP36, and iCLIP37 has helped make great strides towards solving the problem of identifying miRNA targets with higher confidence. Rigorously speaking, CLIP-seq can identify miRNAs and targets that are part of the Ago silencing complex but does not directly establish which miRNA forms a heteroduplex with which target; the recently published CLASH38 is a first attempt towards solving this problem biochemically. Nonetheless, determining the specifics of the heteroduplexes captured in AGO CLIP-seq experiments is possible through additional analysis. Indeed, several such methods have already been developed by others36,39,40,41,42,43,44,45,46,47,48 as well as by us14.
Continuing our earlier work with non-standard heteroduplexes5,15,16,17,26,49 we expanded on our previously reported CLIP-seq analysis method14 and used it to investigate the sequence (i.e. possible presence of one or more G:U pairs) and architectural (i.e. possible presence of a bulge on either the miRNA or the target side) preferences that are present in the seed region of miRNA:target heteroduplexes. The result is a very large collection of computationally predicted interactions across the genome that are derived from seven different cell sources and two organisms. Our analyses included public datasets and CLIP-seq datasets generated in our laboratory from the hTERT-HPNE and MIA PaCa-2 cell lines.
We analyzed a total of 34 Ago CLIP-seq datasets (four human and 30 mouse – Supp. Table 1). As has been pointed out previously50 the HITS-CLIP and PAR-CLIP methodologies generate essentially the same results, an observation we were also able to recapitulate using public samples for which both types of data were available (see Supp. Table 2). In light of this and to ensure uniformity across the processed samples, we limited our analysis to public Argonaute HITS-CLIP (CLIP-seq) datasets only. We follow the approach that we published previously14 for analyzing CLIP-seq datasets (CLIPSim-MC) and which is summarized in Figure 1 (see also Materials and Methods).
The analyzed biological replicates show congruence in the Ago-loaded miRNAs but not in the Ago-loaded targets
It is important to stress that in this sub-section we aim to address two important questions. First: are the profiles of the top-expressed, Ago-loaded miRNAs concordant across biological replicates from the same cell type/tissue? Second: are the profiles of the statistically significant Ago-loaded targets concordant across biological replicates from the same cell type/tissue? In other words, we simply inspect the Ago-bound RNA across biological replicates from the same tissue/cell type to determine the extent to which the miRNA:target heteroduplexes remain unchanged. These two questions are of immediate relevance in light of recent data38,51 that suggest the possibility of a dynamic miRNA targetome. In what follows, we use the term “MRE cluster” to refer to genomic segments that do not correspond to any annotated miRNA locus and are delineated by a collection of overlapping reads (see also Methods for a detailed definition of the terms MRE, MRE motif, and MRE cluster). Each MRE cluster comprises at least one ‘miRNA response element’ (MRE) and typically encompasses a multitude of distinct and potentially overlapping miRNA binding sites for different miRNAs.
With regard to the first of these two questions, we find that the most abundant endogenous miRNAs that are loaded on Ago show a high degree of overlap across the biological replicates of a given cell type. Figure 2a shows the Spearman correlations amongst top-expressed miRNA for the analyzed datasets for which biological replicates were available. For all represented cell types, there is a high degree of correlation among the replicates for the top-expressed Ago-loaded miRNAs, indicating a very high consistency in the profile of top-expressed, Ago-loaded miRNAs within these datasets. This concordance is particularly striking in the pairwise comparisons of the replicates from the mouse CD4+ T-cell samples and for all 12 wild type (WT) and 12 miR-155 knockout (KO) samples.
With regard to the second of these two questions, we find that the concordance of the Ago-loaded miRNAs among the replicates does not extend to the MRE clusters. Our results indicate that the Ago-loaded MREs with statistically significant coverage have little overlap across biological replicates (Figure 2b). Additionally, and for each available tissue type in turn, we calculated the positional overlap among all expressed MRE clusters across the biological replicates (Figure 2c). We also calculated this overlap by restricting ourselves to only the significantly expressed MRE clusters (Figure 2d). The point of this exercise was to evaluate the extent of overlap exhibited by the MRE clusters in the biological replicates. In the ideal scenario, the same exact MRE clusters should arise in each biological replicate; however, as Figures 2c and 2d show, this is not the case. We report our calculations in terms of “the number of unique genomic positions that are captured by those MRE clusters and are present in at least n of the biological replicates available for the tissue or cell type at hand.” In Figure 2d we restrict the calculation to using statistically significant MRE clusters only. Clearly, the value of ‘n’ ranges from 1 to the total number of available replicates. Our results show a ten-fold decrease in the number of bases covered by at least two replicates compared to the number of bases covered by at least one replicate. Moreover, we note that the imposition of the statistical significance constraint alone reduces the breadth of genomic coverage by ~10 fold. As we increase the minimum number of replicates in which an MRE cluster is required to occur, the number of bases spanned by the surviving MRE clusters decreases exponentially underlining a dynamic nature in the targeted MREs.
The high correlation that we observed with the miRNA component of the Ago-loaded miRNA:MRE heteroduplexes indicates that the lack of correlation among Ago-loaded MREs does not reflect a technical issue but rather suggests the existence of a miRNA target repertoire that is highly dynamic and transient in nature, an observation recently reported by others as well38,51. Our finding is further supported by the fact that the replicates show limited Ago-target footprint overlap even when no statistical filtering is applied. An alternative explanation could be a possible dependence of the targeted transcript populations on cell cycle. In light of this observation and given the diversity in the breadth and depth of coverage across replicates, we chose to analyze and apply statistical significance filtering separately to each replicate: had we required that an MRE be present in two or more of the replicates we would have restricted our focus to an artificially small number of bases (evidenced by Figures 2c and 2d) neglecting the information that results from the apparently dynamic nature of the miRNA targetome.
The MRE clusters are spread across all genomic regions
Our analyses reveal that, for most of the analyzed samples, a considerable portion of the statistically significant (p-value ≤ 0.05) MRE clusters are located beyond the exonic space. Indeed, the intergenic portion of the statistically significant MRE clusters ranges between 10 and 25%. The HEK293 samples are an exception with ~45% of the MREs being intergenic (Figure 3). Looking at the data across samples, we find several of the intergenic MRE clusters in lncRNAs (human: 611 mouse: 4,107) and pseudogenes (human: 199 mouse: 2,284) – see Supp. Figures 3 and 4.
The analyzed datasets exhibit a wider variation in their portions of intronic and exonic MREs. In the mouse embryonic stem cell (mESC) samples, a mere 5% of the statistically significant MRE clusters are found in exonic space whereas the majority (~75%) arise from intronic loci. The mouse CD4+ T-cell samples exhibit the opposite behavior: here, the majority (~80%) of the statistically significant MRE clusters derive from exonic space; an additional ~18.5% derive from intergenic space and the remaining ~1.5% from intronic loci. Lastly, the mouse brain replicates, similarly to the mESC ones, exhibit a notable abundance (70%) of intronic MRE clusters: the remaining MREs are evenly divided among exonic and intergenic space. Figure 3 also makes evident that across samples, the majority of exonic MREs arise from 3′UTRs, with coding sequences (CDS) contributing the second highest number of MREs. Lastly, it is important to note that despite the diversity of the MRE clusters among biological replicates (Figure 2b, 2c, and 2d), the replicates exhibit far greater similarity with regard to the subset of the genomic space (i.e. intergenic, intronic, exonic-5′UTR, exonic-CDS, exonic-3′UTR) where the MREs are found (Figure 3).
Most MRE loci can be unambiguously associated with a single miRNA
For each dataset, we considered further only enriched (FDR ≤ 0.05) MRE-motifs (and associated miRNA informed heteroduplex architectures). As described in Methods, we only kept those of the enriched miRNA:MRE-motif pairs, derived from CLIPSim-MC, for which the associated heteroduplex exhibits bonded base pairs beyond the seed region and the RNA folding matches the prescribed architecture from which the MRE-motif was originally derived (Figure 1). As shown in Figure 4, and across all studied datasets, we can unambiguously identify the miRNA participating in a miRNA:MRE heteroduplex for ~70% of all heteroduplexes: i.e. in these cases, the MRE locus is paired with a single targeting miRNA. For an additional ~20% of the formed miRNA:target heteroduplexes the MRE locus is paired with exactly 2 targeting miRNAs. We manually examined the instances where an MRE is paired-up with two or more miRNAs and invariably found that in such cases the targeting miRNAs are paralogous members of miRNA families with known and extensive sequence similarity (e.g. let-7a/b/c/…, miR-29a/b/c, miR-103/107, etc.). This ambiguity is inherent and anticipated given the high sequence similarity among the paralogous members of these miRNA families. In order to generate conservative estimates, for the remainder of our analysis, we will work with only the miRNA:MRE pairs for which the MRE locus is paired with a single endogenous miRNA.
Both standard- and expanded-model miRNA:target heteroduplexes are frequent
Figure 5 shows, for each analyzed dataset, the breakdown of the enriched miRNA specific sequence/architecture arrangements (FDR ≤ 0.05) for the heteroduplexes in which an MRE is targeted by a single miRNA. In particular, Figure 5a shows for each sample what fraction of the endogenous miRNAs form heteroduplexes with the shown sequence/architecture arrangement in the seed region. Analogously, Figure 5b shows for each sample what fraction of the MRE loci form heteroduplexes with the shown seed/architecture arrangement in the seed region. Taken together, these two plots highlight the following observation for the final set of heteroduplexes for a given sample: although a specific sequence/architecture choice for the seed region for a miRNA may be proportionally small, this particular choice may be used in forming heteroduplexes with a proportionally larger pool of MREs within the sample. Proportionally, and across all analyzed datasets, the standard model (non-bulged contiguous Watson-Crick base-pairing in the seed region) represents the least abundant category (~3–12%). The most abundant category, in all datasets, corresponds to G:U wobbles in the seed region together with a single miRNA bulge within the seed (30–50%). The second most abundant category comprises G:U wobbles in the seed region and a single bulge on the MRE side of the seed region (20–40%). Non-bulged heteroduplexes with at least one G:U wobble in the seed region account for an additional 15–25% of the cases. Examination of admissible formations that contained a bulge on either the miRNA or the MRE side of the seed region revealed that individual miRNAs have distinct bulge-positioning preferences. However, when we considered all of the heteroduplexes containing a bulge within the seed region of the heteroduplex that survived our analyses we found that bulges were equally likely at all seed region positions of either the miRNA or the MRE. It is evident that despite the abundance of heteroduplexes containing a bulge within the MRE and at least one G:U wobble, these heteroduplexes correspond to proportionally fewer MREs when compared to the other heteroduplex architectures that we considered.
In concordance with previous reports the 3′UTRs harbor a large portion of all the exonic loci in almost all datasets. In analogy with Figure 5, we examined the distribution of elucidated architectures for 3′UTRs (as well as 5′UTRs and CDSs) and found a prevalence of expanded-model heteroduplexes – see Supp. Figures 5 and 6.
Overlap with other available predictions
Two recent publications studied in detail two mouse miRNAs, miR-12452 and miR-15553, and reported findings on the miRNAs' targeting preferences. These two miRNAs are good test cases as their preferences represent instances of non-standard interactions. In the case of miR-124, only exonic MRE clusters were considered in the original report52, thus we repeated our CLIPSim-MC simulations for the mouse brain samples by considering only the exonic subset of all MRE clusters instead of genome-wide MRE clusters. As can be seen from Table 1A, the guanine bulge site at seed position 6 on the MRE side that was reported52 is correctly captured through the enriched TGGCCTT MRE-motif. The MRE-motif corresponding to the standard model of targeting, i.e. the one whose seed region sequence is the reverse complement of miR-124-3p's seed is also among the statistically significant ones as are several additional MRE-motifs capturing expanded-model interactions (Table 1A). The complete list of exonic preferences (sequence and architecture) for miR-124-3p and the corresponding MREs are available on line at: https://cm.jefferson.edu/tools_and_downloads/clip_2014/output_exonic/mmu_miR_124_3p.output_exonic_bymiR.txt.
For the mouse miRNA miR-155-5p, all of the enriched MRE-motifs and corresponding formations that result from our analysis are presented in Table 1B. The entries of Table 1B show that in addition to the MRE-motif (GCATTA) that corresponds to the standard model and is enriched across many replicates, we find additional enriched expanded-model formations including several of those previously reported53.
We also performed functional enrichment of miR-155 targets identified by our analysis in order to determine which gene ontology (GO) biological process terms are enriched among the identified mRNA targets of miR-155. The analyses were carried out using DAVID54,55 and only those biological processes with an FDR ≤ 0.05 were considered further (Supp. Table 3). Our results indicate that miR-155 targets mRNAs significantly involved in transcriptional regulation, cell fate and differentiation as well as several other immune related processes. Our results are consistent with previous CLIP-seq based findings53, and with the relevant T-cell literature56,57.
Of the available public repositories of CLIP-seq analyzed data47,58, Starbase47 makes their predictions available in a manner that permits direct comparisons. Starbase contains 601,189 human and 111,809 mouse target predictions with ~93% of these predictions being located in 3′UTR space. We note here that several of the target-site prediction algorithms that Starbase uses report only standard-model targets, which leads to an over-representation of such formations in the Starbase pool of data. Consequently, for this comparison, we focused on the 3′UTR and canonical subset of our predictions. We find that 76.5% of our human standard-model predictions (2,355) and 42.3% of our mouse standard-model predictions (12,047) are identical to those reported in Starbase (Supp. Table 4). The difference is due to the fact that Starbase reports many more human targets, and we report many more mouse targets due to the specifics of the samples analyzed: note that for the mouse genome, we report nearly 5 times as many statistically significant heteroduplexes as we do for our human predictions.
MiRNAs can have many distinct targets in a given cell type
The analyzed datasets were obtained from diverse sources and allow us to shed some light on the question of how many mRNAs are targeted by a miRNA. We re-emphasize that in what follows, we focus only on MREs for which we can identify a single targeting miRNA within the corresponding dataset and as such our estimates are conservative and represent lower bounds of the true number of targets that a miRNA can have.
After processing each of the datasets separately, we formed the union of these miRNA:MRE interactions across the replicates of each cell type. From the pooled set of data, we find that a notable portion of the top-expressing miRNAs have hundreds of distinct targets each. As shown in Supp. Fig. 1 the magnitude of the miRNA targetome differs across the five tissues that we analyzed. In mouse brain, ~40% of the 106 top-expressing miRNAs have at least 55 distinct targets each; in mESC, ~40% of the 165 top-expressing miRNAs have at least 140 distinct targets each; in wild-type mouse CD4+ T-cells, ~40% of the 177 top-expressing miRNAs have at least 415 distinct targets each; in mmu-miR-155 KO Cd4+ T-cell samples, ~40% of the 164 top-expressing miRNAs have at least 200 distinct targets each; in the hTERT-HPNE and MIA PaCa-2 cell lines, ~40% of the 160 top-expressing miRNAs have at least 15 distinct targets each; and, in HEK293 cells, ~40% of the 67 top-expressing miRNAs have more than 5 distinct targets each. These data, taken together with the results of Figures 3, suggest that each miRNA has a large repertoire of cell-type specific targets. The findings also indicate that a given endogenous miRNA can have a rather distinct targetome within a given tissue, with a given endogenous miRNA targeting many MREs in one tissue type and fewer in another. As an example let us consider miR-18a-3p, a member of the miR-17/92 oncogenic cluster that is conserved across vertebrates59,60,61. MiR-18a-3p is associated with 917 unique MREs across all mESC samples, 253 unique MREs across all mouse CD4+ T-cells, 33 unique MREs across all miR-155 KO mouse CD4+ T-cells, and 25 MREs in the hTERT-HPNE/MIA PaCa-2 samples. On the other hand it is not associated with any targets within the mouse brain samples or in the HEK293 cell line samples.
To appreciate how many distinct MREs may be targeted by a single miRNA across different tissues of the same organism we formed the union of miRNA:MRE interactions we obtained from the three analyzed mouse cell types (30 datasets) and the two human cell types (4 samples corresponding to three cell lines) respectively. We only considered MREs with an unambiguously determined targeting miRNA across all mouse samples and find 228,688 unique MREs that are targeted by 294 unique miRNAs through 233,364 unique interactions. For the human samples, we find 7,851 unique MREs targeted by 197 unique miRNAs through 7,866 unique interactions. In Figure 6a, we consider only MREs that have been associated with a single miRNA in our analysis of human samples to derive the count of miRNAs (primary Y-axis) that are associated with a given number of predicted targets (X-axis). All analyzed human samples are considered for this purpose. The secondary Y-axis shows the cumulative distribution of the expressed human miRNAs that are associated with a given number of predicted targets. This histogram is meant to provide estimates for the number of distinct targets that an endogenous miRNA can have across tissues/cell types. Figure 6b shows the same histogram for the analyzed mouse tissues/cells. For the mouse datasets, more than 20% of the 294 analyzed miRNAs have more than 1,000 distinct targets each. These findings demonstrate that numerous discrete MRE loci are unambiguously associated with a putative targeting miRNA. In the Supplement, we also address and present results for a related question namely how many distinct MREs can a given miRNA target in an mRNA.
On-line exploration of the data
The complete data (in both miRNA-centric and genome-centric views) for each analyzed human and mouse miRNA for all 34 datasets are available for interactive exploration on-line at https://cm.jefferson.edu/clip_2014/. The data has been compiled in two different ways: First, we provide a miRNA-centric view: for each miRNA, we present the sequence of the targeted MRE-motif, and the corresponding p-value and FDR for each analyzed sample (one per sample). Also stated is the resulting MRE-formation, e.g. G:U wobbles, MRE-side bulge, miRNA-side bulge, etc. The second view is genome-centric and is meant to acknowledge the increasing realization that miRNAs target numerous transcripts, protein-coding as well as non-coding RNA. In this case, we list the genome identifier, chromosomal location of the strand where the MRE is found, cell type in which the interaction in encountered, identity of the replicate supporting the target, p-value and FDR for the MRE motif, identity of the targeting miRNA, and the Gibbs free energy of the associated heteroduplex. For those miRNAs that are not among the 34 analyzed CLIP-seq datasets, we make available version 2.0 of the rna22 method17 at https://cm.jefferson.edu/rna22v2/.
Through our analysis of 34 independent CLIP-seq samples, we identified computationally predicted, high confidence, statistically enriched seed-region formations and full-length heteroduplexes. With regard to the location of the miRNA targets our analysis shows that many statistically significant MREs are present in exonic space, which is expected, with the rest of them located in intergenic and intronic regions. The portion of exonic MREs was consistent across biological replicates while it ranged from sample to sample: from 20–40% in HEK293 cells and mouse brain to ~75% in mouse CD4+ T-cells. The three mESC datasets represented an exception to these findings in that nearly two thirds of the statistically significant MREs were located in introns. Among the exonic MREs, approximately half were located in the 3′UTRs.
Additionally, we examined the specifics of the architecture (presence or absence of bulges) and sequence (presence or absence of G:U wobbles) preferences for the statistically significant heteroduplexes. In concordance with earlier findings, our analysis of these heteroduplexes revealed a biologically diverse miRNA targetome comprising MREs that participate in both standard and expanded seed-region formations with the targeting miRNA. The expanded formations include various combinations of G:U wobbles and single nucleotide bulges within the seed-region of the heteroduplex and outnumber the standard formations. Moreover, we found that many of the endogenous top-expressed miRNAs of a given sample exhibited concrete non-standard targeting preferences that were cell-type specific. Looking across all samples, approximately one third of the statistically significant MREs participated in standard seed-region interactions (contiguous Watson-Crick base pairing). Formations that involved a single bulge on the miRNA side of the heteroduplex as well as the presence of at least one G:U wobble represented another abundant category.
Another thing we considered was the profiles of Ago-loaded miRNA and targets. With regard to the top-abundant endogenous miRNAs, we found them to be consistently present across the replicates of a given sample. Somewhat surprisingly, the profiles of the MREs exhibited a more dynamic behavior across the replicates of the same cell type. Interestingly, the breakdown of MRE locations across intergenic-intronic-5′UTR-CDS-3′UTR space was preserved across the replicates even though the exact target locations were not. These observations held true for all considered cell/tissue types and for both human and mouse suggesting that a complex and dynamic process is at play. These results add to the growing evidence that a set of highly expressed miRNAs regulate a dynamic pool of MREs transcribed from across the genome38,42,47,62,63,64.
Our findings also shed some light on the number of distinct transcripts that can be targeted by a miRNA. Indeed, we found evidence for a rich target repertoire for many miRNAs, a repertoire that can comprise hundreds of distinct targets for a given miRNA in the same cellular context. The number of distinct targets for a given miRNA increases further when one considers the miRNA's targetome across cell types.
We conclude by commenting on one more ramification of our results from the standpoint of miRNA-effected regulation. The apparent abundance of non-protein-coding miRNA targets in conjunction with the finding that several miRNAs can have many targets and that an mRNA can be targeted by many miRNAs simultaneously provides additional support to the concept of miRNA sequestration65,66 and competing endogenous RNAs (ceRNAs)67. The diversity of involved genomic transcripts and the large number of promiscuous miRNAs encountered in each of the five cell types indicate that a large number of ways exist in which sequestering of miRNAs by sponges and ceRNAs through target decoying can regulate protein-coding transcripts.
Cell culture, Ago HITS-CLIP and RNA-sequencing
The hTERT-HPNE and MIA PaCa-2 cell lines were obtained from American Type Culture Collection (Manassas, VA) and from Dr. Jonathan Brody, and propagated in Dulbecco's Modified Eagle Medium supplemented with 10% fetal bovine serum and 1% penicillin/streptomycin (Cellgro, Manassas, VA). Ago HITS-CLIP was performed as described previously with modifications to increase stringency68. Briefly, cells were grown to 70% confluency, washed once with PBS and UV irradiated at 254 nm for a total energy dispersion of 600 mJ/cm2 (Spectroline, Westbury, NY). RNA digestion was carried out as per Hafner et al36. Cell lysates were treated initially with RNAse T1 at a concentration of 1 U/μl for 15 minutes at room temperature in PXL buffer prior to co-immunoprecipitation of RNA-protein complexes on protein A Dynabeads (Life Technologies) using the pan-Ago antibody 2A8 for 4 hours at 4°C (Millipore, Billerica, MA). Beads were then washed twice with PXL buffer and subjected to a secondary, complete RNA digestion with 100 U/μl of RNAse T1 for 15 minutes at room temperature. Following complete digestion, CLIP-RNAs were liberated from their on-bead protein complexes by treatment with 4 mg/ml proteinase K and subsequent phenol/chloroform extraction as described earlier35. CLIP-RNA libraries were constructed using the small RNA library preparation protocol as described above. All libraries were sequenced on Applied Biosystems 5500XL sequencers (Life Technologies).
Definitions of “MRE” and “MRE motif”
The term miRNA response element or MRE was originally coined to capture the full span of a miRNA target69 and not just the target's six-nucleotide-long seed region. Since then, the term been overloaded and is also used to refer to the target's nucleotide stretch that interacts with the seed region of the targeting miRNA. In what follows, we use the more specific term “MRE-motif” to refer to the portion of the target opposite the miRNA's seed region (positions 2–7 inclusive) and use “MRE” to refer to the full-length miRNA target. Also, we will use the term “formation” to refer to an arrangement of the base pairs in the seed region that comprises any combination of sequence (Watson-Crick pairs or G:U wobbles) and architecture (bulge or no bulge). Finally, we use the term “heteroduplex” to refer to miRNA:target interactions that span the full length of the targeting miRNA (as opposed to only the seed region). In all of our analyses, we use the string of the MRE as a reference string; as such we need to introduce notation that will allow us to indicate the presence and location of bulges on either the MRE or the miRNA side, and of G:U wobbles. To this end, we use a ‘.’ to denote a seed-region bulge on the side of the MRE-motif (target). For example, TG.CCTT indicates that the nucleotide of the MRE-motif that occupies the ‘.’ position, e.g. G in 5p → TGGCCTT → 3p, will be unpaired. Analogously, we use a ‘~’ to denote a seed-region bulge on the side of the miRNA. For example, TG~CTT indicates that the nucleotide of the miRNA that occupies the position across the ‘~’ symbol, e.g. G in 3p ← ACGGAA ← 5p, will be unpaired. To denote the potential of G:U wobbles forming we use bracketed expression: e.g. the last four positions of 5p → TGG[CT][CT][CT][CT] → 3p.
Preprocessing of raw reads and sequence mapping
In addition to our in house samples, we also analyzed 32 publicly available CLIP-seq samples that were precipitated using monoclonal antibodies against Argonaute 2 from four distinct studies that represent four cellular phenotypes35,50,52,70. Following adapter sequence removal and quality trimming with the help of cutadapt71, reads were mapped to their respective reference genome (human-hg19, mouse-NCBIM37) using SHRIMP272. Only reads that could be placed unambiguously on the genome by allowing up to 4% mismatches (replacements only – no insertions or deletions were permitted) were considered in the subsequent analyses (Supp. Tab. 1).
Selecting miRNAs and MREs
We used the reads that mapped to the mature miRNA sequences (human and mouse) listed in Rel. 20 of miRBase73 to generate endogenous miRNA profiles for each analyzed CLIP-seq sample. We identified the top-expressed miRNAs on a per sample basis by keeping only those miRNAs with abundance that was within 10 PCR cycles (a ratio of 1:1024) relatively to the sample's most abundant miRNA. Unambiguously-mapped reads that did not map to miRNA loci were taken to pinpoint MREs and were merged into “MRE clusters.” MRE clusters are thus defined by overlapping reads that do not map to any annotated miRNA locus and may contain multiple target sites for a variety of miRNAs. We required that each MRE cluster comprise a minimum number of overlapping reads before it could be selected for subsequent analysis: this minimum required number of reads is determined by adapting a previously reported method74 and carried out in a sample-specific manner that takes into account the depth of sequencing. Only statistically significant MREs (p-value ≤ 0.05) were kept for further processing. Considering the reported time-dependence of miRNA-targeting among biological replicates38,42,62 and in order to be comprehensive in our characterization of the analyzed samples we identified and analyzed MRE clusters separately for each sample (see also Results, Figure 2, and Supp. Table 1). For the three mouse brain samples35, MRE clusters were formed from the 130 kDa sample set only. The remaining CLIP-seq datasets included three biological replicates from mouse embryonic stem cells (mESCs)70, 12 wild-type replicates 12 miR-155 knockout (KO) from mouse CD4+ T-cell samples53, two biological replicates from human embryonic kidney (HEK293) cells50, and two CLIP-seq datasets that we generated from the hTERT-HPNE and MIA PaCa-2 cell lines (SRP034075).
Enumerating standard- and expanded-model seed-region formations
For each endogenous miRNA expressed in a given sample, we enumerated the following putative MRE-motif variants: a) the exact reverse complement of the miRNA's 6-nt seed region (this is the standard-model MRE-motif); b) all possible variants of the reverse complement that would necessitate that one or more G:U wobble base pairings, but no bulge, be formed if the corresponding heteroduplex were realized; c) all possible variants of the reverse complement that would require a single bulge on the miRNA side, but no G:U wobbles, if the corresponding heteroduplex were realized; d) all possible variants of the reverse complement that would require a single bulge on the MRE side, but no G:U wobbles, if the corresponding heteroduplex were realized; e) all possible variants of the reverse complement that would facilitate a single bulge on the MRE side of the potential heteroduplex with at least one G:U wobble within the seed; and f) all possible variants of the reverse complement that would facilitate a single bulge on the miRNA side of the putative heteroduplex in combination with at least one G:U wobble base pair within the seed region (Figure 1). In the presence of a single-nucleotide bulge, the MRE-motif will span five nucleotides (if the bulge is on the miRNA side) or seven nucleotides (if the bulge is on the target side). Because of this enumeration, these candidate formations include both standard-model and expanded-model arrangements; also, because of the way our method arrives at these candidates we obviate any biases that could have been introduced by the use of target prediction tools to generate miRNA:target candidates from CLIP-seq data35,42,43,44,46,47,48,50,75,76. We refer to heteroduplexes that fall in cases b) through f) inclusive as instances of an “expanded model” of miRNA targeting.
Statistical enrichment of seed-region formations (CLIPSim-MC)
The observed counts for each observed seed-region formation were calculated by finding the number of instances of the variant within the pool of MRE clusters. The expected count distribution for each observed seed-region formation was determined by carrying out a Monte-Carlo simulation in which each observed MRE-motif is queried against a pool of representative read-pileups from the original MRE clusters. In each iteration of the simulation, a randomly generated sequence with the same read-weighted base composition, and the same length and average coverage is generated for each significantly expressed MRE cluster. This pool of simulated CLIP-seq reads is then used to generate an expected count distribution for each MRE-motif with a non-zero observed count value. The total number of expected counts for iteration i is the cumulative number of reads present within the pool of simulated reads that harbor the MRE sequence of the seed-region formation (expected count ci for miRNA j). The process was carried out one million times for each enumerated seed-region formation in turn in order to build a distribution of expected occurrences for the MRE-motif. The p-values for the enumerated seed-region formation were then calculated by fitting the expected count distribution of the variant with a negative binomial distribution (Supp. Fig. 2). Multiple test correction was performed using the Benjamini-Hochberg procedure and only those MRE-motifs with an FDR ≤ 0.05 were deemed to be significant and kept for further analysis. To enable a direct comparison between our work and those earlier efforts in which only exonic MRE clusters were considered and analyzed from the standpoint of miRNA targeting formations for miR-12452, we repeated our Monte-Carlo simulations for the mouse brain samples considering only the set of exonic MRE clusters (instead of the full genome-wide set of MRE clusters). To this end, we first identified the MREs that are located within exonic regions and recomputed MRE significance. Then, we sub-selected and processed only statistically significant MREs (p-value ≤ 0.05). During the shuffling phase of the Monte Carlo simulation, the sub-selected exonic MREs could only be repositioned (shuffled) to other exonic regions within the mouse genome.
Selecting among ‘competing’ seed-region formations
Since our analysis transcends the standard model, on occasion we may find multiple seed-region formations competing for the same MRE. For example: miRNA X may match a given segment of an MRE using a variant containing multiple G:U wobbles in the seed region whereas miRNA Y may match the exact same segment of the same MRE using a variant that incorporates a bulge in the seed region. We resolve such conflicts with a multi-tiered approach. First, we filter the candidate seed-region formations using their associated False Discovery Rate (FDR): only variants with FDR ≤ 0.05 are considered significant. Second, we take into account the part beyond the seed of the miRNA that competes for a given MRE and examine how well and how extensively the full-length candidate miRNA base-pairs with the region that is adjacent and immediately upstream of the MRE at hand. To this end, we form full-length miRNA:target heteroduplexes using the sequence of each miRNA and a 25-nt stretch of the genome whose 3′ end extends one nucleotide past the 6-nt segment of the MRE-motif at hand using the Vienna package77. On the output of the co-folding we impose two additional constraints: first, we discard heteroduplexes whose Vienna-derived seed-region interactions do not match the sequence composition and architecture that are expected by the seed-region formation being considered; and, second, we discard heteroduplexes that contain instances of self-hybridization or comprise fewer than 12 base pairs. Results obtained from biological replicates of the same cellular phenotype were pooled together and duplicate entries removed.
Data Availability The sequence data for the hTERT-HPNE and MIA PaCa-2 HITS-CLIP are available on GEO under accession # SRP034075.
The authors wish to thank Eleftheria Hatzimichael, for helpful feedback and stimulating discussions during the length of this project. This research was supported in part by the William M. Keck Foundation (IR), the Hirshberg Foundation for Pancreatic Cancer Research (IR and JB), NIH-NIAID (2U19AI056363-06/2030984 to IR), by institutional funds, and in part by a grant to IR from the Pennsylvania Department of Health which specifically disclaims responsibility for any analyses, interpretations or conclusions. The research was also supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA056036. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/