Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.
At a glance
- Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010). , , &
- GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011). , , &
- A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671–683 (2013). et al.
- A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010). &
- Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204–216 (2012). , &
- Systematic comparison of RNA-Seq normalization methods using measurement error models. Bioinformatics 28, 2584–2591 (2012). &
- Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15 (2002). et al.
- Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 8, R2 (2007). , , &
- The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease. RNA 19, 876–888 (2013). et al.
- A modified LOESS normalization applied to microRNA arrays: a comparative evaluation. Bioinformatics 25, 2685–2691 (2009). , , &
- Revisiting global gene expression analysis. Cell 151, 476–482 (2012). et al.
- The external RNA controls consortium: a progress report. Nat. Methods 2, 731–734 (2005). et al.
- Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011). et al.
- A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003). , , &
- Locally weighted regression: an approach to regression analysis by local fitting. JASA 83, 596–610 (1988). &
- mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies. Sci. China Life Sci. 56, 134–142 (2013). , , &
- SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. doi:10.1038/nbt.2957 (24 August 2014).
- Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 24, 1115–1122 (2006). et al.
- Silencing of odorant receptor genes by G Protein βγ signaling ensures the expression of one odorant receptor per olfactory sensory neuron. Neuron 81, 847–859 (2014). et al.
- Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012). &
- Removing unwanted variation from high dimensional data with negative controls. Tech. Rep. 820, Department of Statistics, University of California, Berkeley (2013). , &
- Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
- ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636–640 (2004).
- Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007). &
- Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013). et al.
- Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Tech. Rep. 818, Department of Statistics, University of California, Berkeley (2013). , &
- Development and applications of single-cell transcriptome analysis. Nat. Methods 8, S6–S11 (2011). , &
- Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013). et al.
- Robust locally weighted regression and smoothing scatterplots. JASA 74, 829–836 (1979).
- Ensembl 2012. Nucleic Acids Res. 40, D84–D90 (2012). et al.
- TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009). , &
- Generalized Linear Models (Chapman and Hall, New York, 1989). &
- Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl. Acad. Sci. USA 107, 16465–16470 (2010). , , &
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010). , &
- Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010). &
- Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 3 (2004).
- Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008). , , , &
- Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc., B 57, 289–300 (1995). &
- Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004). et al.