Normalization of RNA-seq data using factor analysis of control genes or samples

Journal name:
Nature Biotechnology
Volume:
32,
Pages:
896–902
Year published:
DOI:
doi:10.1038/nbt.2931
Received
Accepted
Published online

Abstract

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

At a glance

Figures

  1. Unwanted variation in the SEQC RNA-seq data set.
    Figure 1: Unwanted variation in the SEQC RNA-seq data set.

    (a) Scatterplot matrix of first three principal components (PC) for unnormalized counts (log scale, centered). The principal components are orthogonal linear combinations of the original 21,559-dimensional gene expression profiles, with successively maximal variance across the 128 samples, that is, the first principal component is the weighted average of the 21,559 gene expression measures that provides the most separation between the 128 samples. Each point corresponds to one of the 128 samples. The four sample A and the four sample B libraries are represented by different shades of blue and red, respectively (16 replicates per library). Circles and triangles represent samples sequenced in the first and second flow-cells, respectively. As expected for the SEQC data set, the first principal component is driven by the extreme biological difference between sample A and sample B. The second and third principal components clearly show library preparation effects (the samples cluster by shade) and, to a lesser extent, flow-cell effects reflecting differences in sequencing depths (within each shade, the samples cluster by shape). (b) Same as a, for upper-quartile (UQ)-normalized counts. UQ normalization removes flow-cell effects (the circles and triangles now cluster together), but not library preparation effects. All other normalization procedures but RUV behave similarly as UQ (Supplementary Fig. 1).

  2. Unwanted variation in the zebrafish RNA-seq data set.
    Figure 2: Unwanted variation in the zebrafish RNA-seq data set.

    (a) Boxplots of RLE for unnormalized counts. Purple: treated libraries (Trt); green: control libraries (Ctl). We expect RLE distributions to be centered around zero and as similar as possible to each other. The RLE boxplots clearly show the need for normalization. (The bottom and top of the box indicate, respectively, the first and third quartiles; the inside line indicates the median; the whiskers are located at 1.5 the inter-quartile range (IQR) above and below the box.) (b) Same as a, for upper-quartile-normalized counts. UQ normalization centers RLE around zero, but fails to remove the excessive variability of library 11. (c) Scatterplot of first two principal components for unnormalized counts (log scale, centered). Libraries do not cluster as expected according to treatment. (d) Same as c, for UQ-normalized counts. UQ normalization does not lead to better clustering of the samples. All other normalization procedures but RUV behave similarly as UQ (Supplementary Figs. 2 and 3).

  3. RUVg normalization using in silico empirical control genes.
    Figure 3: RUVg normalization using in silico empirical control genes.

    (a) For the SEQC data set, scatterplot matrix of first three principal components after RUVg normalization (log scale, centered). RUVg adjusts for library preparation effects (cf. Fig. 1), while retaining the sample A versus B difference. (b) For the SEQC data set, empirical cumulative distribution function (ECDF) of P-values for tests of differential expression between sample A replicates (given a value x, the ECDF at x is simply defined as the proportion of P-values ≤ x). We expect no differential expression and P-values to follow a uniform distribution, with ECDF as close as possible to the identity line. This is clearly not the case for unnormalized (gray line) and upper-quartile-normalized (red) counts; only with RUVg (purple) do P-values behave as expected. (c) For the zebrafish data set, boxplots of RLE for RUVg-normalized counts. RUVg shrinks the expression measures for library 11 toward the median across libraries, suggesting robustness against outliers. (The bottom and top of the box indicate, respectively, the first and third quartiles; the inside line indicates the median; the whiskers are located at 1.5 the inter-quartile range above and below the box.) (d) For the zebrafish data set, scatterplot of first two principal components for RUVg-normalized counts (log scale, centered). Libraries cluster as expected by treatment.

  4. Behavior of the ERCC spike-in controls.
    Figure 4: Behavior of the ERCC spike-in controls.

    (a) For the SEQC data set, GLM regression coefficients of spike-in read counts on nominal concentrations. Each point corresponds to one of the 128 samples. The four sample A and the four sample B libraries are represented by different shades of blue and red, respectively (16 replicates per library). Circles and triangles represent samples sequenced in the first and second flow-cells, respectively. There are evident library preparation effects. (b) For the SEQC data set, the proportion of reads mapping to the spike-ins deviates markedly from the nominal value (dashed line). There are library preparation effects and troubling sample A versus B effects, which may bias the inference of differential expression. (c) For the zebrafish data set, the proportion of reads mapping to the spike-ins deviates markedly from the nominal value (dashed line). Again, there are library preparation and treatment effects (purple: treated libraries (Trt); green: control libraries (Ctl); data for the two runs of each library are displayed in adjacent bars). (d) For the zebrafish data set, mean-difference plot of unnormalized counts (log scale) for two control samples (library 5 versus library 1). The shading represents point density and spike-in counts are plotted using red symbols. The lines are the lowess robust local regression29 fits for genes (black) and spike-ins (red). As expected, log-fold-changes are scattered around the horizontal zero line, indicating that most genes are equally expressed in the two control samples. The negative slope of the black line suggests the need for normalization. The difference between the two lowess fits indicates that, disturbingly, the spike-ins do not behave as the bulk of the genes.

  5. Using the ERCC spike-in controls for normalization, zebrafish data set.
    Figure 5: Using the ERCC spike-in controls for normalization, zebrafish data set.

    (a) Boxplots of RLE for cyclic loess-normalized counts (purple: treated libraries (Trt); green: control libraries (Ctl)). The expression measures are clearly not comparable across replicate libraries and cyclic loess based on the spike-ins is not effective at normalizing the counts. (b) Boxplots of RLE for RUVg-normalized counts. RUVg based on the spike-ins leads to much more reasonable RLE distributions, similar to those obtained using a set of empirical controls (Fig. 3c). (c) Mean-difference plot (MD-plot) for cyclic loess-normalized counts (log scale) for the same control samples as in Figure 4c. By shifting the spike-in log-fold-changes toward zero, cyclic loess normalization leads to a global shift of the gene log-fold-changes away from zero. For control samples, with no expected differential expression, cyclic loess normalization is likely to bias expression measures. (d) MD-plot for RUVg-normalized counts (log scale) for the same control samples as in Figure 4c. Log-fold-changes for both the spike-ins and the genes are scattered around the zero line, yielding more realistic expression measures than cyclic loess normalization.

  6. Impact of normalization on differential expression analysis.
    Figure 6: Impact of normalization on differential expression analysis.

    (a) For the SEQC data set, difference between RNA-seq and qRT-PCR estimates of sample A/sample B log-fold-changes, that is, bias in RNA-seq when viewing qRT-PCR as the gold standard. All RUV versions lead to unbiased log-fold-change estimates; cyclic loess (CL) normalization based on the ERCC spike-ins leads to severe bias. (b) For the SEQC data set, ROC curves using a set of 370 positive and 86 negative qRT-PCR controls as gold standard. RUVg (based on either empirical or spike-in controls) and UQ normalization perform slightly better than no normalization. UQ normalization based on spike-ins performs similarly to no normalization and CL normalization based on spike-ins performs the worst. (c) For the zebrafish data set, distribution of edgeR P-values for tests of differential expression between treated and control samples. UQ and CL normalization based on spike-ins lead to distributions far from the expected uniform. (d) For the zebrafish data set, heatmap of expression measures for the 61 genes found differentially expressed between control (Ctl) and treated (Trt) samples after UQ but not after RUVg normalization. Clustering of samples is driven by outlying library 11. (e) Heatmap of expression measures for the 475 genes found differentially expressed after RUVg but not after UQ normalization. Samples cluster as expected by treatment.

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. Bullard, J., Purdom, E., Hansen, K. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010).
  2. Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011).
  3. Dillies, M.-A. et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671683 (2013).
  4. Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
  5. Hansen, K.D., Irizarry, R.A. & Zhijin, W. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204216 (2012).
  6. Sun, Z. & Zhu, Y. Systematic comparison of RNA-Seq normalization methods using measurement error models. Bioinformatics 28, 25842591 (2012).
  7. Yang, Y.H. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15 (2002).
  8. Oshlack, A., Emslie, D., Corcoran, L.M. & Smyth, G.K. Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 8, R2 (2007).
  9. Wu, D. et al. The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease. RNA 19, 876888 (2013).
  10. Risso, D., Massa, M.S., Chiogna, M. & Romualdi, C. A modified LOESS normalization applied to microRNA arrays: a comparative evaluation. Bioinformatics 25, 26852691 (2009).
  11. Lovén, J. et al. Revisiting global gene expression analysis. Cell 151, 476482 (2012).
  12. Baker, S.C. et al. The external RNA controls consortium: a progress report. Nat. Methods 2, 731734 (2005).
  13. Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 15431551 (2011).
  14. Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185193 (2003).
  15. Cleveland, W.S. & Devlin, S.J. Locally weighted regression: an approach to regression analysis by local fitting. JASA 83, 596610 (1988).
  16. Qing, T., Yu, Y., Du, T. & Shi, L. mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies. Sci. China Life Sci. 56, 134142 (2013).
  17. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. doi:10.1038/nbt.2957 (24 August 2014).
  18. Canales, R.D. et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 24, 11151122 (2006).
  19. Ferreira, T. et al. Silencing of odorant receptor genes by G Protein βγ signaling ensures the expression of one odorant receptor per olfactory sensory neuron. Neuron 81, 847859 (2014).
  20. Gagnon-Bartsch, J. & Speed, T. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539552 (2012).
  21. Gagnon-Bartsch, J., Jacob, L. & Speed, T.P. Removing unwanted variation from high dimensional data with negative controls. Tech. Rep. 820, Department of Statistics, University of California, Berkeley (2013).
  22. Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 10611068 (2008).
  23. ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636640 (2004).
  24. Leek, J.T. & Storey, J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 17241735 (2007).
  25. 't Hoen, P. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 10151022 (2013).
  26. Jacob, L., Gagnon-Bartsch, J. & Speed, T.P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Tech. Rep. 818, Department of Statistics, University of California, Berkeley (2013).
  27. Tang, F., Lao, K. & Surani, M.A. Development and applications of single-cell transcriptome analysis. Nat. Methods 8, S6S11 (2011).
  28. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 10931095 (2013).
  29. Cleveland, W.S. Robust locally weighted regression and smoothing scatterplots. JASA 74, 829836 (1979).
  30. Flicek, P. et al. Ensembl 2012. Nucleic Acids Res. 40, D84D90 (2012).
  31. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111 (2009).
  32. McCullagh, P. & Nelder, J. Generalized Linear Models (Chapman and Hall, New York, 1989).
  33. Listgarten, J., Kadie, C., Schadt, E.E. & Heckerman, D. Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl. Acad. Sci. USA 107, 1646516470 (2010).
  34. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).
  35. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
  36. Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 3 (2004).
  37. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621628 (2008).
  38. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc., B 57, 289300 (1995).
  39. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).

Download references

Author information

Affiliations

  1. Department of Statistics, University of California, Berkeley, Berkeley, California, USA.

    • Davide Risso,
    • Terence P Speed &
    • Sandrine Dudoit
  2. Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, California, USA.

    • John Ngai
  3. Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, California, USA.

    • John Ngai
  4. Functional Genomics Laboratory, University of California, Berkeley, Berkeley, California, USA.

    • John Ngai
  5. Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

    • Terence P Speed
  6. Department of Mathematics and Statistics, The University of Melbourne, Victoria, Australia.

    • Terence P Speed
  7. Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA.

    • Sandrine Dudoit

Contributions

D.R., S.D. and T.P.S. developed the statistical methods; D.R. and S.D. analyzed the data; J.N. designed the zebrafish experiment; D.R. and S.D. wrote the manuscript; all authors read and approved the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (3,464 KB)

    Supplementary Figures 1–20 and Supplementary Table 1

Zip files

  1. Supplementary Software (138 KB)

    RUVSeq_0.1.1.tar.gz

Additional data