Opinion | Published:

Tackling the widespread and critical impact of batch effects in high-throughput data

Nature Reviews Genetics volume 11, pages 733739 (2010) | Download Citation

Abstract

High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    Enduring values. Technometrics 14, 1–11 (1972).

  2. 2.

    et al. Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 39, 226–231 (2007).

  3. 3.

    et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002).

  4. 4.

    , , & On the design and analysis of gene expression studies in human populations. Nature Genet. 39, 807–808; author reply 808–809 (2007).

  5. 5.

    , , & High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer 11, 583–584; author reply 585–587 (2004).

  6. 6.

    , , & Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).

  7. 7.

    , & Supervised normalization of microarrays. Bioinformatics 26, 1308–1315 (2010).

  8. 8.

    , & Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  9. 9.

    et al. Gene expression in the urinary bladder: a common carcinoma in situ gene expression signature exists disregarding histopathological classification. Cancer Res. 64, 4040–4048 (2004).

  10. 10.

    & A gene expression bar code for microarray data. Nature Methods 4, 911–913 (2007).

  11. 11.

    et al. Multiple-laboratory comparison of microarray platforms. Nature Methods 2, 345–350 (2005).

  12. 12.

    Batch Effects and Noise in Micorarray Experiments: Sources and Solutions (ed. Scherer, A.) (John Wiley and Sons, Chichester, UK, 2009).

  13. 13.

    et al. A multilevel model to address batch effects in copy number estimation using SNP arrays. Biostatistics 12 Jul 2010 (doi:10.1093/biostatistics/kxq043).

  14. 14.

    et al. Effects of atmospheric ozone on microarray data quality. Anal. Chem. 75, 4672–4675 (2003).

  15. 15.

    & Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).

  16. 16.

    The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).

  17. 17.

    et al. Genomewide linkage analyses of bipolar disorder: a new sample of 250 pedigrees from the National Institute of Mental Health Genetics Initiative. Am. J. Hum. Genet. 73, 107–114 (2003).

  18. 18.

    The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).

  19. 19.

    et al. High-resolution serum proteomic features for ovarian cancer detection. Endocr. Relat. Cancer 11, 163–178 (2004).

  20. 20.

    Lessons from controversy: ovarian cancer screening and serum proteomics. J. Natl Cancer Inst. 97, 315–319 (2005).

  21. 21.

    et al. Cross-generation and cross-laboratory predictions of Affymetrix microarrays by rank-based methods. J. Biomed. Inform. 41, 570–579 (2008).

  22. 22.

    , & Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. J. Clin. Oncol. 26, 1186–1187; author reply 1187–1188 (2008).

  23. 23.

    , , & The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. Brief. Funct. Genomic. Proteomic. 3, 322–331 (2005).

  24. 24.

    & in Handbook of Data Visualization (ed. Chen, C.-H., Härdle, W. K. & Unwin, A.) 315–347 (Springer, Berlin, 2008).

  25. 25.

    & Principles of Numerical Taxonomy (WH Freeman, San Francisco, 1963).

  26. 26.

    , & Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

  27. 27.

    et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).

  28. 28.

    , , & A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).

Download references

Acknowledgements

We thank the referees for helpful comments and suggestions. One referee in particular went beyond the call of duty to help us improve clarity. We thank the TCGA and 1000 Genomes Project for making the data public. The GoKinD collection of DNA was genotyped through the Genetic Association Information Network (GAIN) programme with the support of the Foundation for the National Institutes of Health and The National Institute of Diabetes and Digestive and Kidney Diseases. The work of J.T.L., H.C.B., B.L. and R.A.I. is partially funded by US National Institutes of Health grants GM0083084, HG004059 and HG005220.

Author information

Affiliations

  1. Jeffrey T. Leek, Hector Corrada-Bravo, Benjamin Langmead and Rafael A. Irizarry are at the Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, Maryland 21205-2179, USA.

    • Jeffrey T. Leek
    • , Héctor Corrada Bravo
    • , Benjamin Langmead
    •  & Rafael A. Irizarry
  2. Robert B. Scharpf is at the Department of Oncology, Johns Hopkins University, Baltimore, Maryland 21205-2013, USA.

    • Robert B. Scharpf
  3. Héctor Corrada Bravo is also at the Department of Computer Science, University of Maryland, College Park, Maryland 20742, USA.

    • Héctor Corrada Bravo
  4. David Simcha is at the Biomedical Engineering Department, Johns Hopkins University, 3400 N. Charles St, Baltimore, Maryland 212218, USA.

    • David Simcha
  5. W. Evan Johnson is at the Department of Statistics, Brigham Young University, Provo, Utah 84602-6575, USA.

    • W. Evan Johnson
  6. Donald Geman is at the Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland 21218-2682, USA.

    • Donald Geman
  7. Keith Baggerly is at the Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, P. O. Box 301402, Houston, Texas 77230, USA.

    • Keith Baggerly

Authors

  1. Search for Jeffrey T. Leek in:

  2. Search for Robert B. Scharpf in:

  3. Search for Héctor Corrada Bravo in:

  4. Search for David Simcha in:

  5. Search for Benjamin Langmead in:

  6. Search for W. Evan Johnson in:

  7. Search for Donald Geman in:

  8. Search for Keith Baggerly in:

  9. Search for Rafael A. Irizarry in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Rafael A. Irizarry.

Supplementary information

PDF files

  1. 1.

    Supplementary information S1 (box)

    The data sets and analysis used to construct Table 1 are described here.

Glossary

Confounded

An extraneous variable (for example, processing data) is said to be confounded with the outcome of interest (for example, disease state) when it correlates both with the outcome and with an independent variable of interest (for example, gene expression).

Feature

The general name given to the measurement unit in high-throughput technologies. Examples of features include probes for the genes represented on microarray, mass-to-charge (m/z) ratios for which intensities are measured in mass spectrometry, and loci for which coverage is reported for sequencing technologies.

Hierarchical clustering

A statistical method in which objects (for example, gene expression profiles for different individuals) are grouped into a hierarchy, which is visualized in a dendrogram. Objects close to each other in the hierarchy, measured by tracing the branch heights, are also close by some measure of distance — for example, individuals with similar expression profiles will be close together in terms of branch lengths.

Linear models

Statistical models in which the effect of independent variables and error terms are expressed as additive terms. For example, when modelling the outcomes in a case–control study, the effect of a typical case is added to the typical control level. Variation around these levels is explained by additive error. Linear models motivate many widely used statistical methods, such as t-tests and analysis of variance. Many popular genomics software tools are also based on linear models.

Normalization

Methods used to adjust measurements so that they can be appropriately compared among samples. For example, gene expression levels measured by quantitative PCR are typically normalized to one or more housekeeping genes or ribosomal RNA. In microarray analysis, methods such as quantile normalization manipulate global characteristics of the data.

Principal components

Patterns in high-dimensional data that explain a large percentage of the variation across features. The top principal component is the most ubiquitous pattern in a set of high-dimensional data. Principal components are sometimes called eigengenes when estimated from microarray gene expression data.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg2825

Further reading