Abstract
Structure probing coupled with high-throughput sequencing could revolutionize our understanding of the role of RNA structure in regulation of gene expression. Despite recent technological advances, intrinsic noise and high sequence coverage requirements greatly limit the applicability of these techniques. Here we describe a probabilistic modeling pipeline that accounts for biological variability and biases in the data, yielding statistically interpretable scores for the probability of nucleotide modification transcriptome wide. Using two yeast data sets, we demonstrate that our method has increased sensitivity, and thus our pipeline identifies modified regions on many more transcripts than do existing pipelines. Our method also provides confident predictions at much lower sequence coverage levels than those recommended for reliable structural probing. Our results show that statistical modeling extends the scope and potential of transcriptome-wide structure probing experiments.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Kubota, M., Tran, C. & Spitale, R.C. Progress and challenges for chemical probing of RNA structure inside living cells. Nat. Chem. Biol. 11, 933–941 (2015).
Wu, Y. et al. Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data. Nucleic Acids Res. 43, 7247–7259 (2015).
Ouyang, Z., Snyder, M.P. & Chang, H.Y. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data. Genome Res. 23, 377–387 (2013).
Mortimer, S.A. & Weeks, K.M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 (2007).
Spitale, R.C. et al. RNA SHAPE analysis in living cells. Nat. Chem. Biol. 9, 18–20 (2013).
Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505, 696–700 (2014).
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J.S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
Hector, R.D. et al. Snapshots of pre-rRNA structural flexibility reveal eukaryotic 40S assembly dynamics at nucleotide resolution. Nucleic Acids Res. 42, 12138–12154 (2014).
van Dijk, E.L., Jaszczyszyn, Y. & Thermes, C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res. 322, 12–20 (2014).
Talkish, J., May, G., Lin, Y., Woolford, J.L. Jr. & McManus, C.J. Mod-seq: high-throughput sequencing for chemical probing of RNA structure. RNA 20, 713–720 (2014).
Siegfried, N.A., Busan, S., Rice, G.M., Nelson, J.A.E. & Weeks, K.M. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP). Nat. Methods 11, 959–965 (2014).
Ben-Shem, A. et al. The structure of the eukaryotic ribosome at 3.0 Å resolution. Science 334, 1524–1529 (2011).
Aylett, C.H.S., Boehringer, D., Erzberger, J.P., Schaefer, T. & Ban, N. Structure of a yeast 40S-eIF1-eIF1A-eIF3-eIF3j initiation complex. Nat. Struct. Mol. Biol. 22, 269–271 (2015).
Kielpinski, L.J. & Vinther, J. Massive parallel-sequencing-based hydroxyl radical probing of RNA accessibility. Nucleic Acids Res. 42, e70 (2014).
Tang, Y. et al. StructureFold: genome-wide RNA secondary structure mapping and reconstruction in vivo. Bioinformatics 31, 2668–2675 (2015).
Kielpinski, L.J., Sidiropoulos, N. & Vinther, J. Reproducible analysis of sequencing-based RNA structure probing data with user-friendly tools. Methods Enzymol. 558, 153–180 (2015).
Reuter, J.S. & Mathews, D.H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129 (2010).
Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Puchta, O. et al. Network of epistatic interactions within a yeast snoRNA. Science 352, 840–844 (2016).
Méreau, A. et al. An in vivo and in vitro structure-function analysis of the Saccharomyces cerevisiae U3A snoRNP: protein-RNA contacts and base-pair interaction with the pre-ribosomal RNA. J. Mol. Biol. 273, 552–571 (1997).
Kudla, G., Murray, A.W., Tollervey, D. & Plotkin, J.B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).
Tuller, T., Waldman, Y.Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl. Acad. Sci. USA 107, 3645–3650 (2010).
Kertesz, M. et al. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).
Takyar, S., Hickerson, R.P. & Noller, H.F. mRNA helicase activity of the ribosome. Cell 120, 49–58 (2005).
Arribere, J.A., Doudna, J.A. & Gilbert, W.V. Reconsidering movement of eukaryotic mRNAs between polysomes and P bodies. Mol. Cell 44, 745–758 (2011).
Aviran, S. et al. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. USA 108, 11069–11074 (2011).
Deng, F., Ledda, M., Vaziri, S. & Aviran, S. Data-directed RNA secondary structure prediction using probabilistic modeling. RNA 22, 1109–1119 (2016).
Eddy, S.R. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes. Annu. Rev. Biophys. 43, 433–456 (2014).
Tollervey, D. A yeast small nuclear RNA is required for normal processing of pre-ribosomal RNA. EMBO J. 6, 4169–4175 (1987).
Webb, S., Hector, R.D., Kudla, G. & Granneman, S. PAR-CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast. Genome Biol. 15, R8 (2014).
Murdoch, D.J., Tsai, Y.-L. & Adcock, J. P-values are random variables. The American Statistician 62, 242–245 (2008).
Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977).
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Low, J.T. & Weeks, K.M. SHAPE-directed RNA secondary structure prediction. Methods 52, 150–158 (2010).
Lucks, J.B. et al. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. USA 108, 11063–11068 (2011).
Nawrocki, E.P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).
Granneman, S. et al. Role of pre-rRNA base pairing and 80S complex formation in subnucleolar localization of the U3 snoRNP. Mol. Cell. Biol. 24, 8600–8610 (2004).
Acknowledgements
We thank all the members of the Granneman and Sanguinetti labs for critically reading the manuscript. This work was supported by grants from the Wellcome Trust to S.G. (091549) and I.I. (102334), a European Research Council grant to G.S. (MLC306999) and the Wellcome Trust Centre for Cell Biology core grant (092076). A.S. is supported in part by grants from the UK Engineering and Physical Sciences Research Council, Biological Sciences Research Council, and the UK Medical Research Council (EP/F500385/1 and BB/F529254/1 to the University of Edinburgh Doctoral Training Centre in Neuroinformatics and Computational Neuroscience). Next generation sequencing was carried out by Edinburgh Genomics, The University of Edinburgh. Edinburgh Genomics is partly supported through core grants from NERC (R8/H10/56), MRC (MR/K001744/1) and BBSRC (BB/J004243/1).
Author information
Authors and Affiliations
Contributions
All authors contributed to planning the experiments and computational procedures. C.S., I.I., and S.G. carried out the experiments. G.S. and A.S. developed the computational analysis pipeline. A.S., C.S., S.G., and G.S. performed the bioinformatics and computational analyses of the sequencing data. All authors contributed to writing the manuscript and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 ChemModSeq library preparation design.
Chemically probed RNAs were reverse transcribed with an oligonucleotide containing a random hexamer and an Illumina compatible sequence for PCR amplification. Subsequently, adapters were ligated to the 3’ end of cDNAs that contained six random nucleotides and a six nucleotide barcode followed by another random nucleotide. The latter was introduced to minimize sequence bias representation introduced during the CircLigase ligation reaction. The six random nucleotides were used to eliminate potential PCR duplicates. Indexing barcodes were added to the 3’ adapter sequence by PCR. The in-read barcodes in the 5’ end of the PCR product were processed using pyBarcodeFilter.py and reads were collapsed using pyFastqDuplicateRemover.py from the pyCRAC package1.
1. Webb, S., Hector, R. D., Kudla, G. & Granneman, S. “PAR-CLIP data indicate that Nrd1Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast.” Genome biology 15, R8 (2014).
Supplementary Figure 2 Coverage- and sequence-dependent biases were identified in the transcriptome data set.
(a, b) Presence of a coverage-dependent bias, reflected by the dependency between the LDR and the mean coverage at each nucleotide position in a pair of control replicate samples, for all such pairs, computed from the yeast transcriptome-wide data set on both strands. (c, d) Same dependency plotted as in (a, b) after applying a bias-correcting strategy to the LDRs. (e, f) Presence of a sequence-dependent bias, reflected by differing null distributions of LDRs. Each boxplot represents the null distribution (y-axis shows LDR) computed only for the nucleotide positions corresponding to a given trinucleotide pattern (indicated on the x-axis).
Supplementary Figure 3 Distributions of empirical P values for the transcriptome data set closely follow the Beta-Uniform distribution on both strands.
The histograms show the distributions of empirical P values associated with LDRs between all combinations of treatment and control samples on the transcriptome data set for both strands.
Supplementary Figure 4 BUM-HMM correctly identifies many flexible A’s and C’s as modified nucleotides.
Secondary structures of the 18S ribosomal RNA with bases colored according to the reactivity score or posterior probability at the corresponding nucleotide position, generated by BUM-HMM, ∆TCR, Mod-seq, and structure-seq analysis pipelines on the data set using a DMS probe.
Supplementary Figure 5 Using BUM-HMM output as constraints results in more consistent secondary structure prediction across different methods.
(a) Distribution of Hamming distances between the structures predicted for SCM4 by Fold (n=20) and by MaxExpect (n=3 with sequence, n=1 with BUM-HMM) when using only sequence (blue) and when adding the BUM-HMM output as constraints (red). (b, c) Same as in (a), for RPL37A (b) and RPL19B (c) (with Fold, n=20 structures were generated, with MaxExpect, n=1 structure).
Supplementary Figure 6 BUM-HMM retains good accuracy at 18S secondary structure reconstruction at lower coverage levels.
Agreement with the 18S crystal structure of the posterior probabilities generated by BUM-HMM on data sets with progressively lower mean coverage (shown on the x-axis), synthesized from the DMS data set for the 18S ribosomal RNA. Agreement was measured with the AUC statistic (shown on the y-axis) between the binary ‘ground truth’ matrix derived from the crystal structure and the generated probabilities for each synthetic data set. The subsets of 2 million, 1 million, 100,000, 30,000, 20,000, 10,000, and 1,000 reads (corresponding to 7 progressively reducing coverage levels) were randomly selected from the full data set 10 times for each coverage level. The error bars quantify the variability in the agreement of the BUM-HMM predictions with the crystal structure across these 10 selections for each coverage level.
Supplementary Figure 7 The ∆TCR algorithm produces very high numbers in regions with low coverage.
Shown is a genome browser image of a gene (YHB1) with an FPKM of 190. The red-dotted box shows a region near the 3’ end of the gene where there is low coverage. The top two panels show the ∆TCR output, with the second panel displaying the same data but scaled to a maximum ∆TCR value of 0.025. The third panel shows the BUM-HMM posterior probabilities for the same region. The last four panels show the cDNA coverage over the gene from the two control RNA sequencing data and the two NAI treated sequencing data.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–7 and Supplementary Tables 1 and 2. (PDF 2409 kb)
Supplementary Table 3
KEGG pathway analysis of the k-means clusters shown in Fig. 4d. (XLSX 153 kb)
Rights and permissions
About this article
Cite this article
Selega, A., Sirocchi, C., Iosub, I. et al. Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments. Nat Methods 14, 83–89 (2017). https://doi.org/10.1038/nmeth.4068
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.4068
This article is cited by
-
Differential analysis of RNA structure probing experiments at nucleotide resolution: uncovering regulatory functions of RNA structure
Nature Communications (2022)
-
diffBUM-HMM: a robust statistical modeling approach for detecting RNA flexibility changes in high-throughput structure probing data
Genome Biology (2021)
-
Prediction and differential analysis of RNA secondary structure
Quantitative Biology (2020)
-
dStruct: identifying differentially reactive regions from RNA structurome profiling data
Genome Biology (2019)
-
reactIDR: evaluation of the statistical reproducibility of high-throughput structural analyses towards a robust RNA structure prediction
BMC Bioinformatics (2019)