Tackling the widespread and critical impact of batch effects in high-throughput data

Leek, Jeffrey T.; Scharpf, Robert B.; Bravo, Héctor Corrada; Simcha, David; Langmead, Benjamin; Johnson, W. Evan; Geman, Donald; Baggerly, Keith; Irizarry, Rafael A.

doi:10.1038/nrg2825

Opinion
Published: 14 September 2010

Tackling the widespread and critical impact of batch effects in high-throughput data

Jeffrey T. Leek¹,
Robert B. Scharpf²,
Héctor Corrada Bravo^1,3,
David Simcha⁴,
Benjamin Langmead¹,
W. Evan Johnson⁵,
Donald Geman⁶,
Keith Baggerly⁷ &
…
Rafael A. Irizarry¹

Nature Reviews Genetics volume 11, pages 733–739 (2010)Cite this article

64k Accesses
1214 Citations
180 Altmetric
Metrics details

Subjects

Abstract

High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.

You have full access to this article via your institution.

Download PDF

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Genome-wide association studies

Article 26 August 2021

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Article Open access 30 April 2024

Main

Many technologies used in biology — including high-throughput ones such as microarrays, bead chips, mass spectrometers and second-generation sequencing — depend on a complicated set of reagents and hardware, along with highly trained personnel, to produce accurate measurements. When these conditions vary during the course of an experiment, many of the quantities being measured will be simultaneously affected by both biological and non-biological factors. Here we focus on batch effects, a common and powerful source of variation in high-throughput experiments.

Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments or if two different lots of reagents, chips or instruments were used. These effects are not exclusive to high-throughput biology and genomics research¹, and batch effects also affect low-dimensional molecular measurements, such as northern blots and quantitative PCR. Although batch effects are difficult or impossible to detect in low-dimensional assays, high-throughput technologies provide enough data to detect and even remove them. However, if not properly dealt with, these effects can have a particularly strong and pervasive impact. Specific examples have been documented in published studies^2,3 in which the biological variables were extremely correlated with technical variables, which subsequently led to serious concerns about the validity of the biological conclusions^4,5.

Normalization is a data analysis technique that adjusts global properties of measurements for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments⁶. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design⁷ or otherwise correct for the batch problem⁸, they are still not widely used.

As described here, in surveying a large body of published data involving high-throughput studies and a number of technology platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications^4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computational solutions to reduce their impact on high-throughput data.

An illustration of batch effects

To introduce the batch effect problem we used data from a previously published bladder cancer study⁹ (Fig. 1). In this study, microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carcinoma in situ (CIS). Hierarchical cluster analysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in Ref. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (Fig. 1a). After normalization, these global differences are greatly reduced (Fig. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (Fig. 1c). In gene expression studies, the greatest source of differential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (Fig. 1d), and the processing date was confounded with the presence/absence status.

**Figure 1: Demonstration of normalization and surviving batch effects.**

Group and date are only surrogates

The processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects¹¹. In addition, in nearly every gene expression study, large variations are associated with the processing date¹², and in microarray studies focusing on copy number variation, large effects are associated with DNA preparation groups¹³. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, laboratory temperatures and reagent quality^12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.

One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most common patterns that exist across features. For example, if most genes in a microarray study are differentially expressed with respect to cancer status, the first principal component will be highly correlated with cancer status. Principal components capture both biological and technical variability and, in some cases, principal components can be estimated after the biological variables have been accounted for¹⁵. In this case, the principal components primarily quantify the effects of artefacts on the high-throughput data. Principal components can be compared to known variables, such as processing group or time. If the principal components do not correlate with these known variables, there may be an alternative, unmeasured source of batch effects in the data.

Examination of public data

In addition to the example described above involving the sTCC study, we examined the extent of batch effects for eight other published or publicly available data sets (Table 1) using the following approach. First, we identified a surrogate for batch effects (such as date or processing group) for each data set. We then used simple linear models to measure the level of confounding between this surrogate and the study outcome (for example, case or control) when available. Note that the more confounding there is, the more likely it is that batch variability can be confused with biological variability. We then summarized the susceptibility to batch effects for each high-throughput feature. We did this by quantifying the association between observed values and the surrogates using analysis of variance models. If the association p-value was below 0.01, we declared the feature as susceptible to batch effects. Next, we identified the principal components that were most correlated with the surrogate and with the outcome, again using analysis of variance models. Finally, we identified associations between the feature measurements (for example, copy number and gene expression levels) and the outcome of interest for each study. The outcomes of this analysis are described below.

Table 1 Batch effects seen for a range of high-throughput technologies

Full size table

We re-examined data sets from three studies for which batch effects have been reported (Table 1). The first was the sTCC study⁹ described above. The second was a microarray data set from a study examining population differences in gene expression². The conclusion of the original paper, that the expression of 1,097 of 4,197 genes differs between populations of European and Asian descent, was questioned in another publication⁴ because the populations and processing dates were highly correlated. In fact, more differences were found when comparing data from two processing dates while keeping the population fixed. The third was a mass spectrometry data set that was used to develop a statistical procedure, based on proteomic patterns in serum, to distinguish neoplastic diseases from non-neoplastic diseases within the ovary³. Concerns about the conclusions of this paper were raised in another publication, which showed that outcome was confounded with run date⁵.

To further illustrate the ubiquity and potential hazards associated with batch effects, we also carried out analyses on representative publicly available data sets that have been established using a range of high-throughput technologies. In addition to the three studies described above, we examined: data from a study of copy number variation in HapMap populations¹⁶; a study of copy number variation in a genome-wide association study of bipolar disorder¹⁷; gene expression from an ovarian cancer study from The Cancer Genome Atlas (TCGA)¹⁸ produced using two platforms (Affymetrix and Agilent); methylation data from the same TCGA ovarian cancer samples produced using Illumina BeadChips; and second-generation sequencing data from a study comparing unrelated HapMap individuals (these data were a subset of the data from the 1000 Genomes Project).

We found batch effects for all of these data sets, and substantial percentages (32.1–99.5%) of measured features showed statistically significant associations with processing date, irrespective of biological phenotype (Table 1). This suggests that batch effects influence a large percentage of the measurements from genomic technologies. Next, we computed the first five principal components of the feature data (the principal components were ordered by the amount of variability explained). Ideally these principal components would correlate with the biological variables of interest, as the principal components represent the largest sources of signal in the data. Instead, for all of the studied data sets, the surrogates for batch (date or processing group) were strongly correlated with one of the top principal components (Table 1). In general, the correlation with the top principal components was not as high for the biological outcome as it was for the surrogates. This suggests that technical variability was more influential than biological variability across a range of experimental conditions and technologies.

For most of the data sets examined, neither date nor biological factors was perfectly associated with the top principal components, suggesting that other unknown sources of batch variability are present. This implies that accounting for date or processing group might not be sufficient to capture and remove batch effects. For example, we did a further analysis of second-generation sequencing data that were generated by the 1000 Genomes Project (Fig. 2). We found that 32% of the features were associated with date but up to 73% were associated with the second principal component. Note that the principal components cannot be explained by biology because only 17% of the features are associated with the biological outcome.

**Figure 2: Batch effects for second-generation sequencing data from the 1000 Genomes Project.**

Downstream consequences

In the most benign cases, batch effects will lead to increased variability and decreased power to detect a real biological signal¹⁵. Of more concern are cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions. An example of confounding is when all of the cases are processed on one day and all of the controls are processed on another. We have shown that in a typical high-throughput experiment, one can expect a substantial percentage of features to show statistically significant differences when comparing across batches, even when no real biological differences are present (Fig. 1; Table 1). Therefore, if one is not aware of the batch effect, a confounded experiment will lead to incorrect biological conclusions because results due to batch will be impossible to distinguish from real biological effects. As an example, we consider the proteomics study mentioned above³. These published results and further confirmation¹⁹ led to the development of a 'home-brew' diagnostic assay for ovarian cancer. However, in this study the biological variable of interest (neoplastic disease within the ovary) was extremely correlated with processing day⁵. Furthermore, batch effects were identified as a major driver of these results. Fortunately, objections raised after the assay was advertised led the US Food and Drug Administration to block use of the assay, pending further validation²⁰.

A more subtle consequence of the batch effect relates to correlations between features. These correlations are implicitly or explicitly used in various applications. For example, in systems biology, gene–gene expression correlations are used to analyse or predict pathways. Rank-based classification methods²¹ are another example of how correlations between features can be used. However, we find that batch effects are strong enough to change not only mean levels of gene expression between batches but also correlations and relative rankings between the expression of pairs of genes. In some cases, the direction of significant positive correlation between genes is completely reversed in different batches. For example, we found an effect of this type in our analysis of the gene expression data set from Ref. 2 (Table 1). Here, genes that show significant positive correlations in the direction of their gene expression changes in samples from one batch are significantly negatively correlated in samples from a second batch (Fig. 3).

**Figure 3: Batch effects also change the correlations between genes.**

If batch effects go undetected, they can lead to substantial misallocation of resources and lack of reproducibility²². In general, technology that has been developed for the prediction of clinical outcomes using data that show batch effects may produce results that are more variable than expected. Batch effects were shown to have strong adverse effects on predictors built with methods that are naive to these effects¹⁰; the result is lower-than-expected classification rates, which might put patients classified with these technologies at risk.

Experimental design solutions

The first step in addressing batch and other technical artefacts is careful study design²³. Experiments that run over long periods of time and large-scale experiments that are run across different laboratories are highly likely to be susceptible. But even smaller studies performed within a single laboratory may span several days or include personnel changes.

High-throughput experiments should be designed to distribute batches and other potential sources of experimental variation across biological groups⁸. For example, in a study comparing a molecular profile in tumour samples versus healthy controls, the tumour and healthy samples should be equally distributed between multiple laboratories and across different processing times²². These steps can help to minimize the probability of confounding between biological and batch effects.

However, even in a perfectly designed study, batch will strongly influence the measured high-throughput data. Information about changes in personnel, reagents, storage and laboratories should be recorded and passed onto data analysts. As it is generally impossible to record all potential sources of batch effects, statistical modelling solutions — as described below — are needed to reduce the impact of batch effects on biological conclusions.

Statistical solutions

After a high-throughput study has been performed, the statistical approach for dealing with batch effects consists of two key steps. Exploratory analyses must be carried out to identify the existence of batch effects and quantify their effect, as well as the effect of other technical artefacts in the data. Downstream statistical analyses must then be adjusted to account for these unwanted effects (Fig. 4).

**Figure 4: Key steps in the statistical analysis of batch effects.**

The first step in the exploratory statistical analysis of batch effects is to identify and quantify batch effects using principal components analysis or visualization techniques, such as hierarchical clustering dendrograms (Fig. 1d) or multidimensional scaling²⁴. Hierarchical clustering of samples²⁵ labelled both with biological groups and known batch surrogates reveals whether the major differences are due to biology or batch (Fig. 1d). It is also useful to plot the levels of individual features (the expression levels of specific genes, probes, proteins, and so on) versus biological variables and batch variables, such as processing group or time (Fig. 1c). Plotting individual features versus biological and known batch variables is crucial, as the bulk distribution of normalized data may seem correct even when batch effects exist (Fig. 1b). A useful way to summarize these feature-level effects is to calculate the principal components of the feature data²⁶. The principal components can also be plotted against known batch variables, such as processing group or time, to determine whether, on average, the high-dimensional feature data are correlated with batch. An example R script is included in the code and data for this article (see 'Further Information').

Strong batch effects may exist when: the samples cluster by processing group or time; a large number of features are highly associated with processing group or time; or principal components are associated with batch processing group or time. If strong batch effects exist, they must be accounted for in downstream statistical analyses.

Most downstream statistical analyses performed on high-throughput data rely on linear models, either explicitly or implicitly. However, there are also other solutions, such as those provided for copy number variation microarrays¹³ that do not use linear models. Because these latter solutions are typically specific to each application, we do not review them here. For analyses using linear models, batch effects can be modelled in one of two ways. If exploratory analyses and prior knowledge suggest that simple surrogates, such as processing time, capture all of the batch effects, these surrogates can be directly incorporated into the models that are used to compare groups. The simplest approach is to include processing group and time as variables in the linear model for association between the high-dimensional features and the outcome variables^8,12. See the ComBat website for a discussion of this approach.

In many cases, processing time is a useful surrogate but does not explain all of the technical artefacts and variability that are seen in high-throughput data. When the true sources of batch effects are unknown or cannot be adequately modelled with processing group or date, it may be more appropriate to use methods such as surrogate variable analysis (SVA)¹⁵ (see 'Further Information'). SVA estimates the sources of batch effects directly from the high-throughput data so that downstream significance analyses can be corrected. Variables estimated with SVA can then be incorporated into the linear model that relates the outcome to the high-dimensional feature data, in the same way as processing year or group could be included. An advantage of SVA is that surrogate variables are estimated instead of pre-specified, which means that the important potential batch variables do not have to be known in advance.

These approaches are most effective when batch effects are not highly confounded, or correlated, with the biological variables of interest. To identify potential sources of confounding, biological variables and sample characteristics can be compared to processing group and time. If the biological variables are highly correlated with processing group or time, it is difficult to determine whether observed differences across biological groups are due to biology or artefacts. At a minimum, analyses should report the processing group and time of all samples in a study along with the biological variables of interest so that results can be independently verified.

Conclusion

There has been substantial progress in identifying and accounting for batch effects, but substantial challenges remain. Foremost among these challenges is the need for consistent reporting of the most common potential sources of batch effects, including processing group and date. Experimental designs should also consistently distribute biological groups equally across processing groups and times. Close collaborations between laboratory biologists and data analysts are also needed so that the specific sources of batch effects can be isolated and the dependence on surrogates can be reduced. Targeted experiments may be necessary to determine the precise sources of non-biological signal for each specific technology. Finally, there is a need to incorporate adjustment for batch effects as a standard step in the analysis of high-throughput data analysis along with normalization, exploratory analysis and significance calculation.

References

Youden, W. J. Enduring values. Technometrics 14, 1–11 (1972).
Article Google Scholar
Spielman, R. S. et al. Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 39, 226–231 (2007).
Article CAS Google Scholar
Petricoin, E. F. et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002).
Article CAS Google Scholar
Akey, J. M., Biswas, S., Leek, J. T. & Storey, J. D. On the design and analysis of gene expression studies in human populations. Nature Genet. 39, 807–808; author reply 808–809 (2007).
Article CAS Google Scholar
Baggerly, K. A., Edmonson, S. R., Morris, J. S. & Coombes, K. R. High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer 11, 583–584; author reply 585–587 (2004).
Article CAS Google Scholar
Allison, D. B., Cui, X. Q., Page, C. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).
Article CAS Google Scholar
Mecham, B. H., Nelson, P. S. & Storey, J. D. Supervised normalization of microarrays. Bioinformatics 26, 1308–1315 (2010).
Article CAS Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article Google Scholar
Dyrskjot, L. et al. Gene expression in the urinary bladder: a common carcinoma in situ gene expression signature exists disregarding histopathological classification. Cancer Res. 64, 4040–4048 (2004).
Article CAS Google Scholar
Zilliox, M. J. & Irizarry, R. A. A gene expression bar code for microarray data. Nature Methods 4, 911–913 (2007).
Article CAS Google Scholar
Irizarry, R. A. et al. Multiple-laboratory comparison of microarray platforms. Nature Methods 2, 345–350 (2005).
Article CAS Google Scholar
Scherer, A. Batch Effects and Noise in Micorarray Experiments: Sources and Solutions (ed. Scherer, A.) (John Wiley and Sons, Chichester, UK, 2009).
Book Google Scholar
Scharpf, R. B. et al. A multilevel model to address batch effects in copy number estimation using SNP arrays. Biostatistics 12 Jul 2010 (doi:10.1093/biostatistics/kxq043).
Article Google Scholar
Fare, T. L. et al. Effects of atmospheric ozone on microarray data quality. Anal. Chem. 75, 4672–4675 (2003).
Article CAS Google Scholar
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
Article Google Scholar
The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
Dick, D. M. et al. Genomewide linkage analyses of bipolar disorder: a new sample of 250 pedigrees from the National Institute of Mental Health Genetics Initiative. Am. J. Hum. Genet. 73, 107–114 (2003).
Article CAS Google Scholar
The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
Conrads, T. P. et al. High-resolution serum proteomic features for ovarian cancer detection. Endocr. Relat. Cancer 11, 163–178 (2004).
Article CAS Google Scholar
Ransohoff, D. F. Lessons from controversy: ovarian cancer screening and serum proteomics. J. Natl Cancer Inst. 97, 315–319 (2005).
Article CAS Google Scholar
Liu, H. C. et al. Cross-generation and cross-laboratory predictions of Affymetrix microarrays by rank-based methods. J. Biomed. Inform. 41, 570–579 (2008).
Article Google Scholar
Baggerly, K. A., Coombes, K. R. & Neeley, E. S. Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. J. Clin. Oncol. 26, 1186–1187; author reply 1187–1188 (2008).
Article Google Scholar
Hu, J., Coombes, K. R., Morris, J. S. & Baggerly, K. A. The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. Brief. Funct. Genomic. Proteomic. 3, 322–331 (2005).
Article CAS Google Scholar
Cox, M. A. A. & Cox, T. F. in Handbook of Data Visualization (ed. Chen, C.-H., Härdle, W. K. & Unwin, A.) 315–347 (Springer, Berlin, 2008).
Book Google Scholar
Sokal, R. R. & Smeath, P. H. A. Principles of Numerical Taxonomy (WH Freeman, San Francisco, 1963).
Google Scholar
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Article CAS Google Scholar
Irizarry, R. A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).
Article Google Scholar
Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).
Article CAS Google Scholar

Download references

Acknowledgements

We thank the referees for helpful comments and suggestions. One referee in particular went beyond the call of duty to help us improve clarity. We thank the TCGA and 1000 Genomes Project for making the data public. The GoKinD collection of DNA was genotyped through the Genetic Association Information Network (GAIN) programme with the support of the Foundation for the National Institutes of Health and The National Institute of Diabetes and Digestive and Kidney Diseases. The work of J.T.L., H.C.B., B.L. and R.A.I. is partially funded by US National Institutes of Health grants GM0083084, HG004059 and HG005220.

Author information

Authors and Affiliations

Benjamin Langmead and Rafael A. Irizarry are at the Department of Biostatistics, Jeffrey T. Leek, Hector Corrada-Bravo, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, Maryland 21205-2179, USA.,
Jeffrey T. Leek, Héctor Corrada Bravo, Benjamin Langmead & Rafael A. Irizarry
Robert B. Scharpf is at the Department of Oncology, Johns Hopkins University, Baltimore, Maryland 21205-2013, USA.,
Robert B. Scharpf
Héctor Corrada Bravo is also at the Department of Computer Science, University of Maryland, College Park, Maryland 20742, USA.,
Héctor Corrada Bravo
David Simcha is at the Biomedical Engineering Department, Johns Hopkins University, 3400 N. Charles St, Baltimore, Maryland 212218, USA.,
David Simcha
W. Evan Johnson is at the Department of Statistics, Brigham Young University, Provo, Utah 84602-6575, USA.,
W. Evan Johnson
Donald Geman is at the Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland 21218-2682, USA.,
Donald Geman
Keith Baggerly is at the Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, P. O. Box 301402, Houston, Texas 77230, USA.,
Keith Baggerly

Authors

Jeffrey T. Leek
View author publications
You can also search for this author in PubMed Google Scholar
Robert B. Scharpf
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Corrada Bravo
View author publications
You can also search for this author in PubMed Google Scholar
David Simcha
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Langmead
View author publications
You can also search for this author in PubMed Google Scholar
W. Evan Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Donald Geman
View author publications
You can also search for this author in PubMed Google Scholar
Keith Baggerly
View author publications
You can also search for this author in PubMed Google Scholar
Rafael A. Irizarry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael A. Irizarry.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary information S1 (box)

The data sets and analysis used to construct Table 1 are described here. (PDF 227 kb)

Glossary

Confounded: An extraneous variable (for example, processing data) is said to be confounded with the outcome of interest (for example, disease state) when it correlates both with the outcome and with an independent variable of interest (for example, gene expression).
Feature: The general name given to the measurement unit in high-throughput technologies. Examples of features include probes for the genes represented on microarray, mass-to-charge (m/z) ratios for which intensities are measured in mass spectrometry, and loci for which coverage is reported for sequencing technologies.
Hierarchical clustering: A statistical method in which objects (for example, gene expression profiles for different individuals) are grouped into a hierarchy, which is visualized in a dendrogram. Objects close to each other in the hierarchy, measured by tracing the branch heights, are also close by some measure of distance — for example, individuals with similar expression profiles will be close together in terms of branch lengths.
Linear models: Statistical models in which the effect of independent variables and error terms are expressed as additive terms. For example, when modelling the outcomes in a case–control study, the effect of a typical case is added to the typical control level. Variation around these levels is explained by additive error. Linear models motivate many widely used statistical methods, such as t-tests and analysis of variance. Many popular genomics software tools are also based on linear models.
Normalization: Methods used to adjust measurements so that they can be appropriately compared among samples. For example, gene expression levels measured by quantitative PCR are typically normalized to one or more housekeeping genes or ribosomal RNA. In microarray analysis, methods such as quantile normalization manipulate global characteristics of the data.
Principal components: Patterns in high-dimensional data that explain a large percentage of the variation across features. The top principal component is the most ubiquitous pattern in a set of high-dimensional data. Principal components are sometimes called eigengenes when estimated from microarray gene expression data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leek, J., Scharpf, R., Bravo, H. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11, 733–739 (2010). https://doi.org/10.1038/nrg2825

Download citation

Published: 14 September 2010
Issue Date: October 2010
DOI: https://doi.org/10.1038/nrg2825

This article is cited by

Ensemble methods of rank-based trees for single sample classification with gene expression profiles
- Min Lu
- Ruijie Yin
- X. Steven Chen
Journal of Translational Medicine (2024)
Genomic data integration tutorial, a plant case study
- Emile Mardoc
- Mamadou Dia Sow
- Jérôme Salse
BMC Genomics (2024)
scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data
- Ziqi Zhang
- Xinye Zhao
- Xiuwei Zhang
Nature Communications (2024)
Integrated analysis of diverse cancer types reveals a breast cancer-specific serum miRNA biomarker through relative expression orderings analysis
- Liyuan Ma
- Yaru Gao
- Hongdong Li
Breast Cancer Research and Treatment (2024)
Artificial intelligence in liver imaging: methods and applications
- Peng Zhang
- Chaofei Gao
- Xiaolong Qi
Hepatology International (2024)