Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Love, Michael I; Hogenesch, John B; Irizarry, Rafael A

doi:10.1038/nbt.3682

Letter
Published: 26 September 2016

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Nature Biotechnology volume 34, pages 1287–1291 (2016)Cite this article

13k Accesses
76 Citations
98 Altmetric
Metrics details

Subjects

Abstract

We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Quantification of transcript abundance from RNA-seq experiments.**

**Figure 2: Problems with current transcript abundance estimation methods.**

**Figure 3: Modeling and correcting fragment sequence bias.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Accession codes

Accessions

Gene Expression Omnibus

BC011380

NCBI Reference Sequence

References

Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS Google Scholar
Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
Article Google Scholar
't Hoen, P.A. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013).
Article CAS Google Scholar
Su, Z. & SEQC/MAQC-III Consortium A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 32, 903–914 (2014).
Article CAS Google Scholar
Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).
Article CAS Google Scholar
Hansen, K.D., Irizarry, R.A. & Wu, Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204–216 (2012).
Article Google Scholar
Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011).
Article CAS Google Scholar
Zheng, W., Chung, L.M. & Zhao, H. Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics 12, 290 (2011).
Article CAS Google Scholar
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
Article CAS Google Scholar
Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Article CAS Google Scholar
Leek, J.T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
Article Google Scholar
Li, J., Jiang, H. & Wong, W.H. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11, R50 (2010).
Article Google Scholar
Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
Article Google Scholar
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22–R14 (2011).
Article CAS Google Scholar
Nicolae, M., Mangul, S., Maˇndoiu, I.I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol. Biol. 6, 9 (2011).
Article Google Scholar
Li, W. & Jiang, T. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28, 2914–2921 (2012).
Article CAS Google Scholar
Lahens, N.F. et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 15, R86 (2014).
Article Google Scholar
Hayer, K.E., Pizarro, A., Lahens, N.F., Hogenesch, J.B. & Grant, G.R. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31, 3938–3945 (2015).
CAS PubMed PubMed Central Google Scholar
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Article CAS Google Scholar
Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).
Article CAS Google Scholar
Benjamini, Y. & Speed, T.P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).
Article CAS Google Scholar
Li, J.J., Jiang, C.R., Brown, J.B., Huang, H. & Bickel, P.J. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108, 19867–19872 (2011).
Article CAS Google Scholar
Hron, T., Pajer, P., Pačes, J., Baru˚tneˇk, P. & Elleder, D. Hidden genes in birds. Genome Biol. 16, 164 (2015).
Article Google Scholar
Patro, R., Mount, S.M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).
Article CAS Google Scholar
Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS Google Scholar
Patro, R., Duggal, G. & Kingsford, C. Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv (2015).
Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).
Article CAS Google Scholar
Frazee, A.C., Jaffe, A.E., Langmead, B. & Leek, J.T. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31, 2778–2784 (2015).
Article CAS Google Scholar
Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 32, 915–925 (2014).
Article Google Scholar
Katz, Y. et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics 31, 2400–2402 (2015).
Article CAS Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Article Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Article CAS Google Scholar
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).
Article CAS Google Scholar
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
Article CAS Google Scholar

Download references

Acknowledgements

The authors are grateful for helpful suggestions from Y. Benjamini, W. Huber, N. Lahens, L. Pinello, C. Meyer, R. Patro, Z. Xu, and Y. Li. M.I.L. was supported by NIH grant 5T32CA009337-35. J.B.H. was supported by NIH R01 grant HG005220, the National Institute of Neurological Disorders and Stroke (5R01NS054794-08 to J.B.H.), the Defense Advanced Research Projects Agency (DARPA-D12AP00025, to John Harer, Duke University). R.A.I. was supported by NIH R01 grant HG005220.

Author information

Authors and Affiliations

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
Michael I Love & Rafael A Irizarry
Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA
Michael I Love & Rafael A Irizarry
Department of Pharmacology, Institute for Translational Medicine and Therapeutics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
John B Hogenesch

Authors

Michael I Love
View author publications
You can also search for this author in PubMed Google Scholar
John B Hogenesch
View author publications
You can also search for this author in PubMed Google Scholar
Rafael A Irizarry
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.I.L. and R.A.I. designed the method. M.I.L., J.B.H., and R.A.I. wrote the manuscript.

Corresponding author

Correspondence to Rafael A Irizarry.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–5 and Supplementary Figures 1–25 (PDF 6827 kb)

Supplementary Note (PDF 376 kb)

Supplementary Code (ZIP 30 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Love, M., Hogenesch, J. & Irizarry, R. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 34, 1287–1291 (2016). https://doi.org/10.1038/nbt.3682

Download citation

Received: 01 September 2015
Accepted: 22 August 2016
Published: 26 September 2016
Issue Date: December 2016
DOI: https://doi.org/10.1038/nbt.3682

This article is cited by

Cell-type-specific CAG repeat expansions and toxicity of mutant Huntingtin in human striatum and cerebellum
- Kert Mätlik
- Matthew Baffuto
- Nathaniel Heintz
Nature Genetics (2024)
Coding and noncoding transcriptomes of NODULIN HOMEOBOX (NDX)-deficient Arabidopsis inflorescence
- Orsolya Feró
- Zsolt Karányi
- Lóránt Székvölgyi
Scientific Data (2023)
Molecular quantitative trait loci
- François Aguet
- Kaur Alasoo
- Tuuli Lappalainen
Nature Reviews Methods Primers (2023)
Got milk? Maternal immune activation during the mid-lactational period affects nutritional milk quality and adolescent offspring sensory processing in male and female rats
- Holly DeRosa
- Salvatore G. Caradonna
- Amanda C. Kentner
Molecular Psychiatry (2022)
CAMPAREE: a robust and configurable RNA expression simulator
- Nicholas F. Lahens
- Thomas G. Brooks
- Gregory R. Grant
BMC Genomics (2021)