Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

This article has been updated

Abstract

We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA sequencing (RNA-seq) performance for junction discovery and differential expression profiling and compare it to microarray and quantitative PCR (qPCR) data using complementary metrics. At all sequencing depths, we discover unannotated exon-exon junctions, with >80% validated by qPCR. We find that measurements of relative expression are accurate and reproducible across sites and platforms if specific filters are used. In contrast, RNA-seq and microarrays do not provide accurate absolute measurements, and gene-specific biases are observed for all examined platforms, including qPCR. Measurement performance depends on the platform and data analysis pipeline, and variation is large for transcript-level profiling. The complete SEQC data sets, comprising >100 billion reads (10Tb), provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: The SEQC (MAQC-III) project and experimental design.
Figure 2: Gene detection and junction discovery.
Figure 3: Sensitivity, specificity and reproducibility of differential expression calls.
Figure 4: Built-in truths for assessing RNA-seq.
Figure 5: Cross-platform agreement of expression levels.
Figure 6: Multiple performance metrics for the quantification of genes and alternative transcripts.

Accession codes

Primary accessions

Gene Expression Omnibus

Change history

  • 09 September 2014

    In the version of this article initially published online, the superscript 95 for the footnote for “these authors contributed equally to this work” was omitted for the first three authors. The error has been corrected for the print, PDF and HTML versions of this article.

References

  1. 1

    Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    CAS  Article  Google Scholar 

  2. 2

    Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).

    CAS  Article  Google Scholar 

  3. 3

    Łabaj, P.P. et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27, i383–i391 (2011).

    Article  Google Scholar 

  4. 4

    Liu, S., Lin, L., Jiang, P., Wang, D. & Xing, Y. A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res. 39, 578–588 (2011).

    CAS  Article  Google Scholar 

  5. 5

    McIntyre, L.M. et al. RNA-seq: technical variability and sampling. BMC Genomics 12, 293 (2011).

    CAS  Article  Google Scholar 

  6. 6

    Toung, J.M., Morley, M., Li, M. & Cheung, V.G. RNA-sequence analysis of human B-cells. Genome Res. 21, 991–998 (2011).

    CAS  Article  Google Scholar 

  7. 7

    Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

    CAS  Article  Google Scholar 

  8. 8

    Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).

  9. 9

    International Cancer Genome Consortium. International network of cancer genome projects. Nature 464, 993–998 (2010).

  10. 10

    Shi, L. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006).

    CAS  Article  Google Scholar 

  11. 11

    Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 10.1038/nbt.3000 (24 August 2014).

  12. 12

    Wang, C. et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 10.1038/nbt.3001 (24 August 2014).

  13. 13

    Yu, Y. et al. A rat RNA-seq transcriptomic Bodymap across eleven organs and four developmental stages. Nat. Commun. 5, 3230 (2014).

    Article  Google Scholar 

  14. 14

    Baker, S.C. et al. The External RNA Controls Consortium: a progress report. Nat. Methods 2, 731–734 (2005).

    CAS  Article  Google Scholar 

  15. 15

    Pruitt, K.D., Tatusova, T., Brown, G.R. & Maglott, D.R. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 40, D130–D135 (2012).

    CAS  Article  Google Scholar 

  16. 16

    Harrow, J. et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    CAS  Article  Google Scholar 

  17. 17

    Thierry-Mieg, D. & Thierry-Mieg, J. AceView: a comprehensive cDNA-supported gene and transcripts. Genome Biol. 7, S12 (2006).

    Article  Google Scholar 

  18. 18

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  Article  Google Scholar 

  19. 19

    Liao, Y., Smyth, G.K. & Shi, W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 41, e108 (2013).

    Article  Google Scholar 

  20. 20

    Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  21. 21

    Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 10.1038/nbt.2972 (24 August 2014).

  22. 22

    Xu, W. et al. Human transcriptome array for high-throughput clinical studies. Proc. Natl. Acad. Sci. USA 108, 3707–3712 (2011).

    CAS  Article  Google Scholar 

  23. 23

    Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).

    CAS  Article  Google Scholar 

  24. 24

    VanGuilder, H., Vrana, K. & Freeman, W. Twenty-five years of quantitative PCR for gene expression analysis. Biotechniques 44 (suppl.) 619–626 (2008).

    CAS  Article  Google Scholar 

  25. 25

    Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).

    CAS  Article  Google Scholar 

  26. 26

    Shippy, R. et al. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat. Biotechnol. 24, 1123–1131 (2006).

    CAS  Article  Google Scholar 

  27. 27

    Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).

    CAS  Article  Google Scholar 

  28. 28

    Pickrell, J.K., Pai, A.A., Gilad, Y. & Pritchard, J.K. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 6, e1001236 (2010).

    Article  Google Scholar 

  29. 29

    Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).

    Article  Google Scholar 

  30. 30

    Liu, Y. et al. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. PLoS ONE 8, e66883 (2013).

    CAS  Article  Google Scholar 

  31. 31

    Levin, J.Z. et al. Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome Biol. 10, R115 (2009).

    Article  Google Scholar 

  32. 32

    Agarwal, A. et al. Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC Genomics 11, 383 (2010).

    Article  Google Scholar 

  33. 33

    Raghavachari, N. et al. A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease. BMC Med. Genomics 5, 28 (2012).

    CAS  Article  Google Scholar 

  34. 34

    Qing, T., Yu, Y., Du, T. & Shi, L. mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-seq studies. Sci. China Life Sci. 56, 134–142 (2013).

    CAS  Article  Google Scholar 

  35. 35

    Benjamini, Y. & Speed, T.P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

    CAS  Article  Google Scholar 

  36. 36

    Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

    Article  Google Scholar 

  37. 37

    Smyth, G.K. in Bioinformatics and Computational Biology Solutions Using R Bioconductor (eds. Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A. & Dudoit, S.) 397–420 (Springer, New York, 2005).

  38. 38

    Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A. & Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18, S96–S104 (2002).

    Article  Google Scholar 

  39. 39

    Wu, Z., Irizarry, R., Gentleman, R., Murillo, F.M. & Spencer, F. A model based background adjustment for oligonucleotide expression arrays. J. Amer. Stat. Assoc. 99, 909–917 (2004).

    Article  Google Scholar 

  40. 40

    Hochreiter, S., Clevert, D.-A. & Obermayer, K. A new summarization method for affymetrix probe level data. Bioinformatics 22, 943–949 (2006).

    CAS  Article  Google Scholar 

  41. 41

    Fasold, M., Stadler, P.F. & Binder, H. G-stack modulated probe intensities on expression arrays–sequence corrections and signal calibration. BMC Bioinformatics 11, 207 (2010).

    Article  Google Scholar 

  42. 42

    Mueckstein, U., Leparc, G.G., Posekany, A., Hofacker, I. & Kreil, D.P. Hybridization thermodynamics of NimbleGen Microarrays. BMC Bioinformatics 11, 35 (2010).

    Article  Google Scholar 

  43. 43

    Sykacek, P. et al. The impact of quantitative optimization of hybridization conditions on gene expression analysis. BMC Bioinformatics 12, 73 (2011).

    Article  Google Scholar 

  44. 44

    Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 14, R95 (2013).

    Article  Google Scholar 

  45. 45

    Xu, J. et al. Cross-platform ultradeep transcriptomic profiling of human reference RNA samples by RNA-Seq. Sci. Data (in the press).

  46. 46

    Liu, S. et al. A comparison of RNA-seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res. 39, 578–588 (2011).

    CAS  Article  Google Scholar 

  47. 47

    Munro, S. et al. Nat. Commun. (in the press).

  48. 48

    David, M., Dzamba, M., Lister, D., Ilie, L. & Brudno, M. SHRiMP2: Sensitive yet practical short read mapping. Bioinformatics 27, 1011–1012 (2011).

    CAS  Article  Google Scholar 

  49. 49

    Glaus, P., Honkela, A. & Rattray, M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28, 1721–1728 (2012).

    CAS  Article  Google Scholar 

  50. 50

    Liao, Y., Smyth, G.K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

    CAS  Article  Google Scholar 

  51. 51

    Rasmussen, C.E. Gaussian Processes for Machine Learning (MIT Press, 2006).

  52. 52

    Law, C.W. et al. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article  Google Scholar 

  53. 53

    Dillies, M.-A. et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671–683 (2013).

    CAS  Article  Google Scholar 

  54. 54

    Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

All SEQC (MAQC-III) participants freely donated their time and reagents for the completion and analyses of the project. Many participants contributed to the sometimes-heated discussions on the topic of this paper during numerous e-mail exchanges, teleconferences and face-to-face project meetings. The common conclusions and recommendations reported in this paper evolved from this extended discourse. The authors gratefully acknowledge support by the National Center for Biotechnology Information (NCBI)'s Supercomputing Center, the FDA's Supercomputing Center, China's National Supercomputing Center of Tianjin, the Vienna Scientific Cluster High Performance Computing Facility (VSC), the Vienna Science and Technology Fund (WWTF), Baxter, the Austrian Institute of Technology, and the Austrian Centre of Biopharmaceutical Technology. This work was supported in part by China's Program of Global Experts. This work was supported in part by the US National Institutes of Health (NIH) grants R01CA163256, R01HG006798, R01NS076465, R44HG005297, U54CA119338, PO1HG00205, R24GM102656 and the Intramural Research Program of the NIH, National Library of Medicine, National Institute of Environmental Health Sciences (NIEHS) Z01 ES102345-04, Shriners Research Grant 85500, an Australia National Health and Medical Research Council (NH&MRC) Project grant (1023454) and Victorian State Government Operational Infrastructure Support (Australia), the National 973 Key Basic Research Program of China (2010CB945401), the National Natural Science Foundation of China (31240038 and 31071162), and the Science and Technology Commission of Shanghai Municipality (11DZ2260300). We greatly appreciate SAS Institute, Inc. for kindly hosting several face-to-face meetings of the SEQC (MAQC-III) project.

Author information

Affiliations

Consortia

Contributions

Project coordination: US Food and Drug Administration.

Project lead: Weida Tong & Leming Shi.

Manuscript lead: David P. Kreil.

Scientific management: David P. Kreil, Christopher E. Mason, Weida Tong & Leming Shi.

Next-generation sequencing technology lead: Christopher E. Mason.

The following authors contributed to project leadership: Zhenqiang Su, Paweł P. Łabaj, Sheng Li, Jean Thierry-Mieg, Danielle Thierry-Mieg, Wei Shi, Charles Wang, Gary P. Schroth, Robert A. Setterquist, John F. Thompson, Wendell D. Jones, Wenzhong Xiao, Weihong Xu, Roderick V Jensen, Reagan Kelly, Joshua Xu, Ana Conesa, Cesare Furlanello, Hanlin Gao, Huixiao Hong, Nadereh Jafari, Stan Letovsky, Yang Liao, Fei Lu, Edward J. Oakeley, Zhiyu Peng, Craig A. Praul, Javier Santoyo-Lopez, Andreas Scherer, Tieliu Shi, Gordon K. Smyth, Frank Staedtler, Peter Sykacek, Xin-Xing Tan, E. Aubrey Thompson, Jo Vandesompele, May D. Wang, Jian Wang, Russell D. Wolfinger, Jiri Zavadil, Weida Tong, David P. Kreil, Christopher E. Mason & Leming Shi.

The following authors contributed equally to this work: Zhenqiang Su, Paweł P. Łabaj & Sheng Li.

Corresponding authors

Correspondence to David P Kreil or Christopher E Mason or Leming Shi.

Ethics declarations

Competing interests

Some of the SEQC (MAQC-III) Consortium members are employed by companies that provide services or manufacture products or equipment related to gene expression profiling, as can be seen from the affiliations provided by the manuscript authors.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–46, Supplementary Tables 1–15 and Supplementary Notes (PDF 24302 kb)

Supplementary Data 1

RNA-seq read coverage flanking all 250 candidate junctions considered for validation. (TXT 870 kb)

Supplementary Data 2

Employed qPCR primer sequences, qPCR results and expression level estimates, as well as the corresponding RNA-seq expression level estimates for the 173 performed assays. (XLS 121 kb)

Supplementary Data 3

Supplementary Data 3 (ZIP 38371 kb)

Supplementary Protocols (PDF 1467 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

SEQC/MAQC-III Consortium., Su, Z., Łabaj, P. et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol 32, 903–914 (2014). https://doi.org/10.1038/nbt.2957

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing