High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
We thank the referees for helpful comments and suggestions. One referee in particular went beyond the call of duty to help us improve clarity. We thank the TCGA and 1000 Genomes Project for making the data public. The GoKinD collection of DNA was genotyped through the Genetic Association Information Network (GAIN) programme with the support of the Foundation for the National Institutes of Health and The National Institute of Diabetes and Digestive and Kidney Diseases. The work of J.T.L., H.C.B., B.L. and R.A.I. is partially funded by US National Institutes of Health grants GM0083084, HG004059 and HG005220.
The data sets and analysis used to construct Table 1 are described here.
An extraneous variable (for example, processing data) is said to be confounded with the outcome of interest (for example, disease state) when it correlates both with the outcome and with an independent variable of interest (for example, gene expression).
The general name given to the measurement unit in high-throughput technologies. Examples of features include probes for the genes represented on microarray, mass-to-charge (m/z) ratios for which intensities are measured in mass spectrometry, and loci for which coverage is reported for sequencing technologies.
- Hierarchical clustering
A statistical method in which objects (for example, gene expression profiles for different individuals) are grouped into a hierarchy, which is visualized in a dendrogram. Objects close to each other in the hierarchy, measured by tracing the branch heights, are also close by some measure of distance — for example, individuals with similar expression profiles will be close together in terms of branch lengths.
- Linear models
Statistical models in which the effect of independent variables and error terms are expressed as additive terms. For example, when modelling the outcomes in a case–control study, the effect of a typical case is added to the typical control level. Variation around these levels is explained by additive error. Linear models motivate many widely used statistical methods, such as t-tests and analysis of variance. Many popular genomics software tools are also based on linear models.
Methods used to adjust measurements so that they can be appropriately compared among samples. For example, gene expression levels measured by quantitative PCR are typically normalized to one or more housekeeping genes or ribosomal RNA. In microarray analysis, methods such as quantile normalization manipulate global characteristics of the data.
- Principal components
Patterns in high-dimensional data that explain a large percentage of the variation across features. The top principal component is the most ubiquitous pattern in a set of high-dimensional data. Principal components are sometimes called eigengenes when estimated from microarray gene expression data.
About this article
Current Environmental Health Reports (2019)