Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments

Selega, Alina; Sirocchi, Christel; Iosub, Ira; Granneman, Sander; Sanguinetti, Guido

doi:10.1038/nmeth.4068

Article
Published: 07 November 2016

Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments

Alina Selega¹,
Christel Sirocchi²,
Ira Iosub²,
Sander Granneman ORCID: orcid.org/0000-0003-4387-1271² &
…
Guido Sanguinetti^1,2

Nature Methods volume 14, pages 83–89 (2017)Cite this article

3729 Accesses
21 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Structure probing coupled with high-throughput sequencing could revolutionize our understanding of the role of RNA structure in regulation of gene expression. Despite recent technological advances, intrinsic noise and high sequence coverage requirements greatly limit the applicability of these techniques. Here we describe a probabilistic modeling pipeline that accounts for biological variability and biases in the data, yielding statistically interpretable scores for the probability of nucleotide modification transcriptome wide. Using two yeast data sets, we demonstrate that our method has increased sensitivity, and thus our pipeline identifies modified regions on many more transcripts than do existing pipelines. Our method also provides confident predictions at much lower sequence coverage levels than those recommended for reliable structural probing. Our results show that statistical modeling extends the scope and potential of transcriptome-wide structure probing experiments.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of the BUM-HMM computational analysis pipeline.**

**Figure 2: BUM-HMM identifies many modified nucleotides of 18S ribosomal RNA with high accuracy and specificity.**

**Figure 3: Using BUM-HMM output results in more consistent secondary structure prediction.**

**Figure 4: BUM-HMM is highly consistent at low coverage and calls more nucleotides modified at all coverage levels.**

**Figure 5: Flexibility of 5′ UTR and ribosome occupancy do not show a significant positive correlation *in vivo*.**

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Improving prime editing with an endogenous small RNA-binding protein

Article Open access 03 April 2024

Accession codes

Primary accessions

Gene Expression Omnibus

GSE78208

Referenced accessions

Gene Expression Omnibus

GSE52878

References

Kubota, M., Tran, C. & Spitale, R.C. Progress and challenges for chemical probing of RNA structure inside living cells. Nat. Chem. Biol. 11, 933–941 (2015).
Article CAS Google Scholar
Wu, Y. et al. Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data. Nucleic Acids Res. 43, 7247–7259 (2015).
Article CAS Google Scholar
Ouyang, Z., Snyder, M.P. & Chang, H.Y. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data. Genome Res. 23, 377–387 (2013).
Article CAS Google Scholar
Mortimer, S.A. & Weeks, K.M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 (2007).
Article CAS Google Scholar
Spitale, R.C. et al. RNA SHAPE analysis in living cells. Nat. Chem. Biol. 9, 18–20 (2013).
Article CAS Google Scholar
Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505, 696–700 (2014).
Article CAS Google Scholar
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J.S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
Article CAS Google Scholar
Hector, R.D. et al. Snapshots of pre-rRNA structural flexibility reveal eukaryotic 40S assembly dynamics at nucleotide resolution. Nucleic Acids Res. 42, 12138–12154 (2014).
Article CAS Google Scholar
van Dijk, E.L., Jaszczyszyn, Y. & Thermes, C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res. 322, 12–20 (2014).
Article CAS Google Scholar
Talkish, J., May, G., Lin, Y., Woolford, J.L. Jr. & McManus, C.J. Mod-seq: high-throughput sequencing for chemical probing of RNA structure. RNA 20, 713–720 (2014).
Article CAS Google Scholar
Siegfried, N.A., Busan, S., Rice, G.M., Nelson, J.A.E. & Weeks, K.M. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP). Nat. Methods 11, 959–965 (2014).
Article CAS Google Scholar
Ben-Shem, A. et al. The structure of the eukaryotic ribosome at 3.0 Å resolution. Science 334, 1524–1529 (2011).
Article CAS Google Scholar
Aylett, C.H.S., Boehringer, D., Erzberger, J.P., Schaefer, T. & Ban, N. Structure of a yeast 40S-eIF1-eIF1A-eIF3-eIF3j initiation complex. Nat. Struct. Mol. Biol. 22, 269–271 (2015).
Article CAS Google Scholar
Kielpinski, L.J. & Vinther, J. Massive parallel-sequencing-based hydroxyl radical probing of RNA accessibility. Nucleic Acids Res. 42, e70 (2014).
Article CAS Google Scholar
Tang, Y. et al. StructureFold: genome-wide RNA secondary structure mapping and reconstruction in vivo. Bioinformatics 31, 2668–2675 (2015).
Article CAS Google Scholar
Kielpinski, L.J., Sidiropoulos, N. & Vinther, J. Reproducible analysis of sequencing-based RNA structure probing data with user-friendly tools. Methods Enzymol. 558, 153–180 (2015).
Article CAS Google Scholar
Reuter, J.S. & Mathews, D.H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129 (2010).
Article Google Scholar
Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Article Google Scholar
Puchta, O. et al. Network of epistatic interactions within a yeast snoRNA. Science 352, 840–844 (2016).
Article CAS Google Scholar
Méreau, A. et al. An in vivo and in vitro structure-function analysis of the Saccharomyces cerevisiae U3A snoRNP: protein-RNA contacts and base-pair interaction with the pre-ribosomal RNA. J. Mol. Biol. 273, 552–571 (1997).
Article Google Scholar
Kudla, G., Murray, A.W., Tollervey, D. & Plotkin, J.B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).
Article CAS Google Scholar
Tuller, T., Waldman, Y.Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl. Acad. Sci. USA 107, 3645–3650 (2010).
Article CAS Google Scholar
Kertesz, M. et al. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).
Article CAS Google Scholar
Takyar, S., Hickerson, R.P. & Noller, H.F. mRNA helicase activity of the ribosome. Cell 120, 49–58 (2005).
Article CAS Google Scholar
Arribere, J.A., Doudna, J.A. & Gilbert, W.V. Reconsidering movement of eukaryotic mRNAs between polysomes and P bodies. Mol. Cell 44, 745–758 (2011).
Article CAS Google Scholar
Aviran, S. et al. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. USA 108, 11069–11074 (2011).
Article CAS Google Scholar
Deng, F., Ledda, M., Vaziri, S. & Aviran, S. Data-directed RNA secondary structure prediction using probabilistic modeling. RNA 22, 1109–1119 (2016).
Article CAS Google Scholar
Eddy, S.R. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes. Annu. Rev. Biophys. 43, 433–456 (2014).
Article CAS Google Scholar
Tollervey, D. A yeast small nuclear RNA is required for normal processing of pre-ribosomal RNA. EMBO J. 6, 4169–4175 (1987).
Article CAS Google Scholar
Webb, S., Hector, R.D., Kudla, G. & Granneman, S. PAR-CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast. Genome Biol. 15, R8 (2014).
Article Google Scholar
Murdoch, D.J., Tsai, Y.-L. & Adcock, J. P-values are random variables. The American Statistician 62, 242–245 (2008).
Article Google Scholar
Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977).
Google Scholar
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Article CAS Google Scholar
Low, J.T. & Weeks, K.M. SHAPE-directed RNA secondary structure prediction. Methods 52, 150–158 (2010).
Article CAS Google Scholar
Lucks, J.B. et al. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. USA 108, 11063–11068 (2011).
Article CAS Google Scholar
Nawrocki, E.P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).
Article CAS Google Scholar
Granneman, S. et al. Role of pre-rRNA base pairing and 80S complex formation in subnucleolar localization of the U3 snoRNP. Mol. Cell. Biol. 24, 8600–8610 (2004).
Article CAS Google Scholar

Download references

Acknowledgements

We thank all the members of the Granneman and Sanguinetti labs for critically reading the manuscript. This work was supported by grants from the Wellcome Trust to S.G. (091549) and I.I. (102334), a European Research Council grant to G.S. (MLC306999) and the Wellcome Trust Centre for Cell Biology core grant (092076). A.S. is supported in part by grants from the UK Engineering and Physical Sciences Research Council, Biological Sciences Research Council, and the UK Medical Research Council (EP/F500385/1 and BB/F529254/1 to the University of Edinburgh Doctoral Training Centre in Neuroinformatics and Computational Neuroscience). Next generation sequencing was carried out by Edinburgh Genomics, The University of Edinburgh. Edinburgh Genomics is partly supported through core grants from NERC (R8/H10/56), MRC (MR/K001744/1) and BBSRC (BB/J004243/1).

Author information

Authors and Affiliations

School of Informatics, University of Edinburgh, Edinburgh, UK
Alina Selega & Guido Sanguinetti
Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, UK
Christel Sirocchi, Ira Iosub, Sander Granneman & Guido Sanguinetti

Authors

Alina Selega
View author publications
You can also search for this author in PubMed Google Scholar
Christel Sirocchi
View author publications
You can also search for this author in PubMed Google Scholar
Ira Iosub
View author publications
You can also search for this author in PubMed Google Scholar
Sander Granneman
View author publications
You can also search for this author in PubMed Google Scholar
Guido Sanguinetti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to planning the experiments and computational procedures. C.S., I.I., and S.G. carried out the experiments. G.S. and A.S. developed the computational analysis pipeline. A.S., C.S., S.G., and G.S. performed the bioinformatics and computational analyses of the sequencing data. All authors contributed to writing the manuscript and approved the final manuscript.

Corresponding authors

Correspondence to Sander Granneman or Guido Sanguinetti.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 ChemModSeq library preparation design.

Chemically probed RNAs were reverse transcribed with an oligonucleotide containing a random hexamer and an Illumina compatible sequence for PCR amplification. Subsequently, adapters were ligated to the 3’ end of cDNAs that contained six random nucleotides and a six nucleotide barcode followed by another random nucleotide. The latter was introduced to minimize sequence bias representation introduced during the CircLigase ligation reaction. The six random nucleotides were used to eliminate potential PCR duplicates. Indexing barcodes were added to the 3’ adapter sequence by PCR. The in-read barcodes in the 5’ end of the PCR product were processed using pyBarcodeFilter.py and reads were collapsed using pyFastqDuplicateRemover.py from the pyCRAC package¹.

1. Webb, S., Hector, R. D., Kudla, G. & Granneman, S. “PAR-CLIP data indicate that Nrd1Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast.” Genome biology 15, R8 (2014).

Supplementary Figure 2 Coverage- and sequence-dependent biases were identified in the transcriptome data set.

(a, b) Presence of a coverage-dependent bias, reflected by the dependency between the LDR and the mean coverage at each nucleotide position in a pair of control replicate samples, for all such pairs, computed from the yeast transcriptome-wide data set on both strands. (c, d) Same dependency plotted as in (a, b) after applying a bias-correcting strategy to the LDRs. (e, f) Presence of a sequence-dependent bias, reflected by differing null distributions of LDRs. Each boxplot represents the null distribution (y-axis shows LDR) computed only for the nucleotide positions corresponding to a given trinucleotide pattern (indicated on the x-axis).

Supplementary Figure 3 Distributions of empirical P values for the transcriptome data set closely follow the Beta-Uniform distribution on both strands.

The histograms show the distributions of empirical P values associated with LDRs between all combinations of treatment and control samples on the transcriptome data set for both strands.

Supplementary Figure 4 BUM-HMM correctly identifies many flexible A’s and C’s as modified nucleotides.

Secondary structures of the 18S ribosomal RNA with bases colored according to the reactivity score or posterior probability at the corresponding nucleotide position, generated by BUM-HMM, ∆TCR, Mod-seq, and structure-seq analysis pipelines on the data set using a DMS probe.

Supplementary Figure 5 Using BUM-HMM output as constraints results in more consistent secondary structure prediction across different methods.

(a) Distribution of Hamming distances between the structures predicted for SCM4 by Fold (n=20) and by MaxExpect (n=3 with sequence, n=1 with BUM-HMM) when using only sequence (blue) and when adding the BUM-HMM output as constraints (red). (b, c) Same as in (a), for RPL37A (b) and RPL19B (c) (with Fold, n=20 structures were generated, with MaxExpect, n=1 structure).

Supplementary Figure 6 BUM-HMM retains good accuracy at 18S secondary structure reconstruction at lower coverage levels.

Agreement with the 18S crystal structure of the posterior probabilities generated by BUM-HMM on data sets with progressively lower mean coverage (shown on the x-axis), synthesized from the DMS data set for the 18S ribosomal RNA. Agreement was measured with the AUC statistic (shown on the y-axis) between the binary ‘ground truth’ matrix derived from the crystal structure and the generated probabilities for each synthetic data set. The subsets of 2 million, 1 million, 100,000, 30,000, 20,000, 10,000, and 1,000 reads (corresponding to 7 progressively reducing coverage levels) were randomly selected from the full data set 10 times for each coverage level. The error bars quantify the variability in the agreement of the BUM-HMM predictions with the crystal structure across these 10 selections for each coverage level.

Supplementary Figure 7 The ∆TCR algorithm produces very high numbers in regions with low coverage.

Shown is a genome browser image of a gene (YHB1) with an FPKM of 190. The red-dotted box shows a region near the 3’ end of the gene where there is low coverage. The top two panels show the ∆TCR output, with the second panel displaying the same data but scaled to a maximum ∆TCR value of 0.025. The third panel shows the BUM-HMM posterior probabilities for the same region. The last four panels show the cDNA coverage over the gene from the two control RNA sequencing data and the two NAI treated sequencing data.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7 and Supplementary Tables 1 and 2. (PDF 2409 kb)

Supplementary Table 3

KEGG pathway analysis of the k-means clusters shown in Fig. 4d. (XLSX 153 kb)

Source data

Source data to Fig. 1

Source data to Fig. 2

Source data to Fig. 3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Selega, A., Sirocchi, C., Iosub, I. et al. Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments. Nat Methods 14, 83–89 (2017). https://doi.org/10.1038/nmeth.4068

Download citation

Received: 25 March 2016
Accepted: 03 October 2016
Published: 07 November 2016
Issue Date: January 2017
DOI: https://doi.org/10.1038/nmeth.4068

This article is cited by

Differential analysis of RNA structure probing experiments at nucleotide resolution: uncovering regulatory functions of RNA structure
- Bo Yu
- Pan Li
- Lin Hou
Nature Communications (2022)
diffBUM-HMM: a robust statistical modeling approach for detecting RNA flexibility changes in high-throughput structure probing data
- Paolo Marangio
- Ka Ying Toby Law
- Sander Granneman
Genome Biology (2021)
Prediction and differential analysis of RNA secondary structure
- Bo Yu
- Yao Lu
- Lin Hou
Quantitative Biology (2020)
dStruct: identifying differentially reactive regions from RNA structurome profiling data
- Krishna Choudhary
- Yu-Hsuan Lai
- Sharon Aviran
Genome Biology (2019)
reactIDR: evaluation of the statistical reproducibility of high-throughput structural analyses towards a robust RNA structure prediction
- Risa Kawaguchi
- Hisanori Kiryu
- Jun Sese
BMC Bioinformatics (2019)

Subjects

Abstract

Access options

Similar content being viewed by others

Accession codes

Primary accessions

Gene Expression Omnibus

Referenced accessions

Gene Expression Omnibus

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links