Mapping RNA–RNA interactome and RNA structure in vivo by MARIO

The pervasive transcription of our genome presents a possibility of revealing new genomic functions by investigating RNA interactions. Current methods for mapping RNA–RNA interactions have to rely on an ‘anchor' protein or RNA and often require molecular perturbations. Here we present the MARIO (Mapping RNA interactome in vivo) technology to massively reveal RNA–RNA interactions from unperturbed cells. We mapped tens of thousands of endogenous RNA–RNA interactions from mouse embryonic stem cells and brain. We validated seven interactions by RNA antisense purification and one interaction using single-molecule RNA–FISH. The experimentally derived RNA interactome is a scale-free network, which is not expected from currently perceived promiscuity in RNA–RNA interactions. Base pairing is observed at the interacting regions between long RNAs, including transposon transcripts, suggesting a class of regulatory sequences acting in trans. In addition, MARIO data reveal thousands of intra-molecule interactions, providing in vivo data on high-order RNA structures.

. Each RNA1-Linker-RNA2 type of chimeric RNA was plotted with the RNA1 and the RNA2 fragments mapped to the respective genomic regions, connected by an oblique line representing the linker. Red and blue blocks represent the "peaks" of overlapping MARIO reads, which were candidate RNA interaction sites. A semi-transparent polygon connecting two RNA interaction sites (red or blue blocks) represents a strong interaction. (C) A global view of the RNA-RNA interactions. The read densities of the RNA1 and the RNA2 fragments were shown in purple and blue tracks, respectively, inside chromatin cytoband ideogram. Each identified RNA-RNA interaction was shown as a curve connecting the genomic loci of the two RNAs, and colored by the types of the interacting RNAs. The FPKM values were obtained by mapping raw reads from mouse ENCODE dataset ENCSR000CWC (paired-end RNA-Seq from E14 mouse ES cells) with bowtie2-2.2.4 against mm9, followed by processing with cufflink 2.2.1. All the genes with unique Ensembl IDs that were found in both ENCSR000CWC data and our mouse ES cell data are included in panels (C) and (D). Supplementary Figure 20. Comparison of the conservation levels. Conservation levels were quantified by the average PhyloP score per nucleotide of the interaction sites (y axis). To adjust for the difference of conservation of exons, introns, and UTRs, the interaction sites (red) in annotated exons, introns, and UTRs (dubbed genomic features) were compared to 200,000 randomly sampled genomic sequences from the same genomic feature (blue). The sizes of the randomly sampled genomic sequences shared the same mean and variation as the sizes of interaction sites. P-values were calculated from one-sided two-sample t-test. **: p-value <10 -12 ; *: p-value < 10 -6 .
Supplementary Figure 21. Correlation of RNase I digestion density and single-stranded regions. The frequency of digestion measured by the number of read fragments ending or starting at each position (y axis) was compared to known secondary structure (fRNAdb database v3.4) (x axis). Brackets on the x axis represent double-stranded regions. The total counts of read fragments ending or starting at each position in single-stranded (ss) and double-stranded (ds) are summarized on the right panels.
Supplementary Figure 22. Intramolecular ligations. (A) An intramolecular (self) ligation was generated by RNase I digestions of a transcript followed by a linker ligation and a proximity ligation. Therefore, the two RNA fragments on the two sides of the linker came from the same RNA molecule. These intramolecular ligation events were identified with stringent bioinformatic criteria, filtering out pair-end reads that could have been generated from a consecutive transcript. The pair-end reads that could only been generated by a cut-and-ligation process were used for RNA structure analysis.

Supplementary Note 1. Control experiments for MARIO.
The first control experiment skipped the cross-linking step in the procedure. The second control experiment skipped the protein biotinylation step. The third control experiment carried out the entire procedure on the mixed cell lysate of mouse ES cells and Drosophila S2 cells.
First, we carried out a non-cross-linking control with approximately 3×10 8 mouse ES cells. The RNAs immobilized with proteins on streptavidin beads were purified by protein digestion as previously described. The purified RNAs were subjected to quantification by Qubit RNA HS assay (Invitrogen). The RNAs were below the detection limit of the assay (250 pg/μl). Our sample volume was 20 μl (the same as previously described), which suggests that the RNA abundance was no more than 5 ng. At this point, we stopped the experiment because there was no chance to accomplish linker selection and library construction. In our previously described experiments, the purified RNAs would be in the μg range at this step.
Second, we did another control by not doing protein biotinylation (keeping cross-linking) with 3×10 8 mouse ES cells. It turned out the RNAs purified from the beads were below the detection limit of Qubit RNA HS assay.
Third, we started with 3×10 8 Drosophila S2 cells and 3×10 8 mouse ES cells (cross-species control). The cells were cross-linked and lysed. The lysate from the two cell lines were mixed before protein biotinylation and proximity ligation. The mixture was subjected to the rest of the experimental procedure to produce a sequencing library (Fly-Mm). Fly-Mm contained 27,748,688 read pairs. After removing duplicate reads, 17,330,193 read pairs remained. After splitting by the linker sequence, 3,550,225 read pairs remained, which had the RNA1-RNA2 configuration. Among them, 86,826 (2.5% of 3,550,225) had the two RNA parts mapped to different genomes (RNA1 mapped to dm6 and RNA2 to mm9 and vice versa). Thus, 2.5% of the ligation products were estimated to be generated from random ligations. Furthermore, we asked if this estimate would be affected by assembling the two genomes into a pan-genome (dm9 and mm9) before mapping. A total of 2,697,115 read pairs in the RNA1-RNA2 configuration could be unambiguously mapped to the pan-genome 1,2 , among which 184,380 (6.8%) had one RNA part uniquely mapped to the dm9 fraction and the other RNA part uniquely mapped to the mm9 fraction. We chose the more conservative estimate (from the pan-genome method), that 6.8% of the ligation products were generated by random ligations.

Data synthesis.
In order to estimate the sensitivity and specificity of MARIO, including its experimental and computational procedures, we carried out a simulation analysis. We simulated 1,000,000 pair-end reads by computationally mimicking the data generation process. The parameters used for the simulation were derived from real data. The simulated data generation process is as follows.
For each pair-end read (2 × 100 bases), we: 1. Choose a sample barcode from the four sample barcodes with equal probabilities and concatenate it with a 6nt random barcode (as in Supplementary Figure 26A Figure 26C). 3. If this read-pair was assigned to a linker-containing type, randomly choose 1 or 2 linkers with equal probability. We note that a small percentage of linker-containing read-pairs contained 2 linkers; the use of equal probability was a conservative choice for estimating worst cases. 4. Generate the sequences for the RNA1 and the RNA2 parts, according to the cDNA type determined in Step 2. For both RNA1 and RNA2, a. simulate its length from , b. choose an RNA type from ["miRNA", "mRNA", "lincRNA", "snoRNA", "snRNA", "tRNA"] based on the following probabilities: producing a synthetic cDNA sequence. 6. If the synthetic cDNA in Step 5 is 100bp or longer, take the 100 bases from the two ends of the synthetic cDNA in forward and reverse strands respectively. 7. If the synthetic cDNA in Step 5 is shorter than 100bp, assign its forward and reverse strands as the forward and the reverse reads, and concatenate P5 and P7 primer sequences to the two reads. 8. Simulate sequencing errors with a rate of 0.01 on each base 3 .
Steps 1 -5 simulated a cDNA sequence according the experimental procedure, and steps 6 -8 simulated a pair-end read based on this cDNA sequence. The simulated interacting RNA pairs, as well as the cDNA type and the length of each part (RNA1, linker, and RNA2, if applicable) were kept for comparison with the computational predictions.

Evaluation of intermediate and final results.
We used the synthetic data to evaluate the sensitivities and specificities of two intermediate analysis steps, as well as the final predictions.
First, we compared the program-identified cDNA lengths (output of Step 3 of MARIO-Tools) to the actual (synthesized) lengths (Supplementary Table 5). This step "3. Recovering the cDNAs in the sequencing library" assigns each cDNA into four types with respect to their lengths, namely Type 1 (<100 bp); Type 2 (100~200 bp); Type 3 (>200 bp); Type 4 (unknown) (Supplementary Figure 27). The algorithm achieved high sensitivity and specificity for identifying each type. Only very few (0.58%) of the cDNAs shorter than 200bp were identified as longer than 200bp. These errors were due to a small overlap (typically between 0 and 5 bps) of the forward and the reverse reads, which were not detected by the local alignment.
When the program identified length was shorter than 200 bp (Types 1 and 2), the exact length could be computed. In these cases, the program identified lengths often precisely matched the lengths of the simulated cDNAs (Supplementary Figure 28A).
Next, we compared the program identified chimeric configuration of each cDNA (output of Step 4 of RNA-HiC-Tools) with the synthesized configuration. In Step "4. Parsing the chimeric cDNAs", the algorithm assigned the cDNAs into five categories, based on the presence of the linker sequence. The algorithm reached 99.89% sensitivity and 95.82% specificity for the cDNAs in the "RNA1-linker-RNA2" form (Supplementary Table 6).
Lastly, we compared the program identified and the simulated RNA-RNA interactions. The simulated dataset contained 200,200 chimeric RNA pairs, among which 131,571 pairs of RNAs were detected (sensitivity = 65.72%, specificity = 92.57%, Supplementary Figure 28C). We also separately calculated the sensitivity and specificity for interactions of each type of RNAs (Supplementary Figure 28C). Regardless of the types of participating RNAs, the method showed few false positives (specificity ≥ 90%). Interactions that did not involve transposon RNA or snRNA exhibited fewer false negatives than those that did. This was due to the repetitive nature of transposon and snRNA sequences. The worst cases involved LINE RNAs, where sensitivities dropped to 52%. We therefore conservatively estimate that about a half of the interactions involving transposon RNAs could have been missed by this procedure. We estimate that about 2/3 to 3/4 of the interactions that do not involve transposon RNAs would have been identified.