Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments

Abstract

Structure probing coupled with high-throughput sequencing could revolutionize our understanding of the role of RNA structure in regulation of gene expression. Despite recent technological advances, intrinsic noise and high sequence coverage requirements greatly limit the applicability of these techniques. Here we describe a probabilistic modeling pipeline that accounts for biological variability and biases in the data, yielding statistically interpretable scores for the probability of nucleotide modification transcriptome wide. Using two yeast data sets, we demonstrate that our method has increased sensitivity, and thus our pipeline identifies modified regions on many more transcripts than do existing pipelines. Our method also provides confident predictions at much lower sequence coverage levels than those recommended for reliable structural probing. Our results show that statistical modeling extends the scope and potential of transcriptome-wide structure probing experiments.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of the BUM-HMM computational analysis pipeline.
Figure 2: BUM-HMM identifies many modified nucleotides of 18S ribosomal RNA with high accuracy and specificity.
Figure 3: Using BUM-HMM output results in more consistent secondary structure prediction.
Figure 4: BUM-HMM is highly consistent at low coverage and calls more nucleotides modified at all coverage levels.
Figure 5: Flexibility of 5′ UTR and ribosome occupancy do not show a significant positive correlation in vivo.

Similar content being viewed by others

Accession codes

Primary accessions

Gene Expression Omnibus

Referenced accessions

Gene Expression Omnibus

References

  1. Kubota, M., Tran, C. & Spitale, R.C. Progress and challenges for chemical probing of RNA structure inside living cells. Nat. Chem. Biol. 11, 933–941 (2015).

    Article  CAS  Google Scholar 

  2. Wu, Y. et al. Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data. Nucleic Acids Res. 43, 7247–7259 (2015).

    Article  CAS  Google Scholar 

  3. Ouyang, Z., Snyder, M.P. & Chang, H.Y. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data. Genome Res. 23, 377–387 (2013).

    Article  CAS  Google Scholar 

  4. Mortimer, S.A. & Weeks, K.M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 (2007).

    Article  CAS  Google Scholar 

  5. Spitale, R.C. et al. RNA SHAPE analysis in living cells. Nat. Chem. Biol. 9, 18–20 (2013).

    Article  CAS  Google Scholar 

  6. Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505, 696–700 (2014).

    Article  CAS  Google Scholar 

  7. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J.S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).

    Article  CAS  Google Scholar 

  8. Hector, R.D. et al. Snapshots of pre-rRNA structural flexibility reveal eukaryotic 40S assembly dynamics at nucleotide resolution. Nucleic Acids Res. 42, 12138–12154 (2014).

    Article  CAS  Google Scholar 

  9. van Dijk, E.L., Jaszczyszyn, Y. & Thermes, C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res. 322, 12–20 (2014).

    Article  CAS  Google Scholar 

  10. Talkish, J., May, G., Lin, Y., Woolford, J.L. Jr. & McManus, C.J. Mod-seq: high-throughput sequencing for chemical probing of RNA structure. RNA 20, 713–720 (2014).

    Article  CAS  Google Scholar 

  11. Siegfried, N.A., Busan, S., Rice, G.M., Nelson, J.A.E. & Weeks, K.M. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP). Nat. Methods 11, 959–965 (2014).

    Article  CAS  Google Scholar 

  12. Ben-Shem, A. et al. The structure of the eukaryotic ribosome at 3.0 Å resolution. Science 334, 1524–1529 (2011).

    Article  CAS  Google Scholar 

  13. Aylett, C.H.S., Boehringer, D., Erzberger, J.P., Schaefer, T. & Ban, N. Structure of a yeast 40S-eIF1-eIF1A-eIF3-eIF3j initiation complex. Nat. Struct. Mol. Biol. 22, 269–271 (2015).

    Article  CAS  Google Scholar 

  14. Kielpinski, L.J. & Vinther, J. Massive parallel-sequencing-based hydroxyl radical probing of RNA accessibility. Nucleic Acids Res. 42, e70 (2014).

    Article  CAS  Google Scholar 

  15. Tang, Y. et al. StructureFold: genome-wide RNA secondary structure mapping and reconstruction in vivo. Bioinformatics 31, 2668–2675 (2015).

    Article  CAS  Google Scholar 

  16. Kielpinski, L.J., Sidiropoulos, N. & Vinther, J. Reproducible analysis of sequencing-based RNA structure probing data with user-friendly tools. Methods Enzymol. 558, 153–180 (2015).

    Article  CAS  Google Scholar 

  17. Reuter, J.S. & Mathews, D.H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129 (2010).

    Article  Google Scholar 

  18. Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).

    Article  Google Scholar 

  19. Puchta, O. et al. Network of epistatic interactions within a yeast snoRNA. Science 352, 840–844 (2016).

    Article  CAS  Google Scholar 

  20. Méreau, A. et al. An in vivo and in vitro structure-function analysis of the Saccharomyces cerevisiae U3A snoRNP: protein-RNA contacts and base-pair interaction with the pre-ribosomal RNA. J. Mol. Biol. 273, 552–571 (1997).

    Article  Google Scholar 

  21. Kudla, G., Murray, A.W., Tollervey, D. & Plotkin, J.B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).

    Article  CAS  Google Scholar 

  22. Tuller, T., Waldman, Y.Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl. Acad. Sci. USA 107, 3645–3650 (2010).

    Article  CAS  Google Scholar 

  23. Kertesz, M. et al. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).

    Article  CAS  Google Scholar 

  24. Takyar, S., Hickerson, R.P. & Noller, H.F. mRNA helicase activity of the ribosome. Cell 120, 49–58 (2005).

    Article  CAS  Google Scholar 

  25. Arribere, J.A., Doudna, J.A. & Gilbert, W.V. Reconsidering movement of eukaryotic mRNAs between polysomes and P bodies. Mol. Cell 44, 745–758 (2011).

    Article  CAS  Google Scholar 

  26. Aviran, S. et al. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. USA 108, 11069–11074 (2011).

    Article  CAS  Google Scholar 

  27. Deng, F., Ledda, M., Vaziri, S. & Aviran, S. Data-directed RNA secondary structure prediction using probabilistic modeling. RNA 22, 1109–1119 (2016).

    Article  CAS  Google Scholar 

  28. Eddy, S.R. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes. Annu. Rev. Biophys. 43, 433–456 (2014).

    Article  CAS  Google Scholar 

  29. Tollervey, D. A yeast small nuclear RNA is required for normal processing of pre-ribosomal RNA. EMBO J. 6, 4169–4175 (1987).

    Article  CAS  Google Scholar 

  30. Webb, S., Hector, R.D., Kudla, G. & Granneman, S. PAR-CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast. Genome Biol. 15, R8 (2014).

    Article  Google Scholar 

  31. Murdoch, D.J., Tsai, Y.-L. & Adcock, J. P-values are random variables. The American Statistician 62, 242–245 (2008).

    Article  Google Scholar 

  32. Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977).

    Google Scholar 

  33. Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).

    Article  CAS  Google Scholar 

  34. Low, J.T. & Weeks, K.M. SHAPE-directed RNA secondary structure prediction. Methods 52, 150–158 (2010).

    Article  CAS  Google Scholar 

  35. Lucks, J.B. et al. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. USA 108, 11063–11068 (2011).

    Article  CAS  Google Scholar 

  36. Nawrocki, E.P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).

    Article  CAS  Google Scholar 

  37. Granneman, S. et al. Role of pre-rRNA base pairing and 80S complex formation in subnucleolar localization of the U3 snoRNP. Mol. Cell. Biol. 24, 8600–8610 (2004).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank all the members of the Granneman and Sanguinetti labs for critically reading the manuscript. This work was supported by grants from the Wellcome Trust to S.G. (091549) and I.I. (102334), a European Research Council grant to G.S. (MLC306999) and the Wellcome Trust Centre for Cell Biology core grant (092076). A.S. is supported in part by grants from the UK Engineering and Physical Sciences Research Council, Biological Sciences Research Council, and the UK Medical Research Council (EP/F500385/1 and BB/F529254/1 to the University of Edinburgh Doctoral Training Centre in Neuroinformatics and Computational Neuroscience). Next generation sequencing was carried out by Edinburgh Genomics, The University of Edinburgh. Edinburgh Genomics is partly supported through core grants from NERC (R8/H10/56), MRC (MR/K001744/1) and BBSRC (BB/J004243/1).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to planning the experiments and computational procedures. C.S., I.I., and S.G. carried out the experiments. G.S. and A.S. developed the computational analysis pipeline. A.S., C.S., S.G., and G.S. performed the bioinformatics and computational analyses of the sequencing data. All authors contributed to writing the manuscript and approved the final manuscript.

Corresponding authors

Correspondence to Sander Granneman or Guido Sanguinetti.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 ChemModSeq library preparation design.

Chemically probed RNAs were reverse transcribed with an oligonucleotide containing a random hexamer and an Illumina compatible sequence for PCR amplification. Subsequently, adapters were ligated to the 3’ end of cDNAs that contained six random nucleotides and a six nucleotide barcode followed by another random nucleotide. The latter was introduced to minimize sequence bias representation introduced during the CircLigase ligation reaction. The six random nucleotides were used to eliminate potential PCR duplicates. Indexing barcodes were added to the 3’ adapter sequence by PCR. The in-read barcodes in the 5’ end of the PCR product were processed using pyBarcodeFilter.py and reads were collapsed using pyFastqDuplicateRemover.py from the pyCRAC package1.

1. Webb, S., Hector, R. D., Kudla, G. & Granneman, S. “PAR-CLIP data indicate that Nrd1Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast.” Genome biology 15, R8 (2014).

Supplementary Figure 2 Coverage- and sequence-dependent biases were identified in the transcriptome data set.

(a, b) Presence of a coverage-dependent bias, reflected by the dependency between the LDR and the mean coverage at each nucleotide position in a pair of control replicate samples, for all such pairs, computed from the yeast transcriptome-wide data set on both strands. (c, d) Same dependency plotted as in (a, b) after applying a bias-correcting strategy to the LDRs. (e, f) Presence of a sequence-dependent bias, reflected by differing null distributions of LDRs. Each boxplot represents the null distribution (y-axis shows LDR) computed only for the nucleotide positions corresponding to a given trinucleotide pattern (indicated on the x-axis).

Supplementary Figure 3 Distributions of empirical P values for the transcriptome data set closely follow the Beta-Uniform distribution on both strands.

The histograms show the distributions of empirical P values associated with LDRs between all combinations of treatment and control samples on the transcriptome data set for both strands.

Supplementary Figure 4 BUM-HMM correctly identifies many flexible A’s and C’s as modified nucleotides.

Secondary structures of the 18S ribosomal RNA with bases colored according to the reactivity score or posterior probability at the corresponding nucleotide position, generated by BUM-HMM, ∆TCR, Mod-seq, and structure-seq analysis pipelines on the data set using a DMS probe.

Supplementary Figure 5 Using BUM-HMM output as constraints results in more consistent secondary structure prediction across different methods.

(a) Distribution of Hamming distances between the structures predicted for SCM4 by Fold (n=20) and by MaxExpect (n=3 with sequence, n=1 with BUM-HMM) when using only sequence (blue) and when adding the BUM-HMM output as constraints (red). (b, c) Same as in (a), for RPL37A (b) and RPL19B (c) (with Fold, n=20 structures were generated, with MaxExpect, n=1 structure).

Supplementary Figure 6 BUM-HMM retains good accuracy at 18S secondary structure reconstruction at lower coverage levels.

Agreement with the 18S crystal structure of the posterior probabilities generated by BUM-HMM on data sets with progressively lower mean coverage (shown on the x-axis), synthesized from the DMS data set for the 18S ribosomal RNA. Agreement was measured with the AUC statistic (shown on the y-axis) between the binary ‘ground truth’ matrix derived from the crystal structure and the generated probabilities for each synthetic data set. The subsets of 2 million, 1 million, 100,000, 30,000, 20,000, 10,000, and 1,000 reads (corresponding to 7 progressively reducing coverage levels) were randomly selected from the full data set 10 times for each coverage level. The error bars quantify the variability in the agreement of the BUM-HMM predictions with the crystal structure across these 10 selections for each coverage level.

Supplementary Figure 7 The ∆TCR algorithm produces very high numbers in regions with low coverage.

Shown is a genome browser image of a gene (YHB1) with an FPKM of 190. The red-dotted box shows a region near the 3’ end of the gene where there is low coverage. The top two panels show the ∆TCR output, with the second panel displaying the same data but scaled to a maximum ∆TCR value of 0.025. The third panel shows the BUM-HMM posterior probabilities for the same region. The last four panels show the cDNA coverage over the gene from the two control RNA sequencing data and the two NAI treated sequencing data.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7 and Supplementary Tables 1 and 2. (PDF 2409 kb)

Supplementary Table 3

KEGG pathway analysis of the k-means clusters shown in Fig. 4d. (XLSX 153 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Selega, A., Sirocchi, C., Iosub, I. et al. Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments. Nat Methods 14, 83–89 (2017). https://doi.org/10.1038/nmeth.4068

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.4068

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing