Original Article

The Pharmacogenomics Journal (2010) 10, 336–346; doi:10.1038/tpj.2010.36

Batch effects in the BRLMM genotype calling algorithm influence GWAS results for the Affymetrix 500K array

K Miclaus1, R Wolfinger1, S Vega2, M Chierici3, C Furlanello3, C Lambert4, H Hong5, Li Zhang6, S Yin6 and F Goodsaid6

  1. 1SAS Institute, Cary, NC, USA
  2. 2Health Solutions Group, Microsoft, Redmond, WA, USA
  3. 3Fondazione Bruno Kessler, Trento, Italy
  4. 4Golden Helix, Bozeman, MT, USA
  5. 5National Center for Toxicological Research, FDA, Jefferson, AR, USA
  6. 6Center for Drug Evaluation and Research, FDA, Silver Spring, MD, USA

Correspondence: Dr K Miclaus, JMP Genomics, SAS Institute, 100 SAS Campus Drive, Cary, NC 27513, USA. E-mail: Kelci.Miclaus@jmp.com

Received 14 December 2009; Revised 23 March 2010; Accepted 26 April 2010.



The Affymetrix GeneChip Human Mapping 500K array is common for genome-wide association studies (GWASs). Recent findings highlight the importance of accurate genotype calling algorithms to reduce the inflation in Type I and Type II error rates. Differential results due to genotype calling errors can introduce severe bias in case–control association study results. Using data from the Wellcome Trust Case Control Consortium, 1991 individuals with coronary artery disease (CAD) and 1500 controls from the UK Blood Services (NBS) were genotyped on the Affymetrix 500K array. Different batch sizes and compositions were used in the Bayesian Robust Linear Model with Mahalanobis distance classifier (BRLMM) genotype calling algorithm to assess the batch effect on downstream association analysis. Results show that composition (cases and controls genotyped simultaneously or separate) and size (number of individuals processed by BRLMM at a time) can create 2–3% discordance in the results for quality control and statistical analysis and may contribute to the lack of reproducibility between GWASs. The changes in batch size are largely responsible for differential single-nucleotide polymorphism results, yet we observe evidence of an interactive effect of batch size and composition that contributes to discordant results in the list of significantly associated loci.


genotype calling error; BRLMM calling algorithm; WTCCC; GWAS; association studies



The advent of high-throughput genotyping technologies has facilitated many success stories for genome-wide association studies (GWASs) in finding genetic variants associated with common complex diseases. As we gain understanding about disease heritability, the size and scope of GWAS continue to grow; yet these studies often do not yield replicable results. One source of the lack of reproducibility in early GWAS was inadequate sample sizes to obtain the power necessary to capture a moderate effect.1 Genome-wide association studies have since begun to sample thousands of individuals and adopt multi-tiered replication study designs. Using large-scale studies, minimal errors can introduce bias that inflate the Type I and Type II error rates, making adequate quality control (QC) essential to ensure reliability of GWAS results.2 Population stratification and genotype errors are two main sources of bias in GWAS and have been shown to result in inflation of the test statistic by over 11%, over half of which can be attributed to bias introduced by inaccurate genotype calls.3 Although clustering algorithms used to call genotypes continue to improve in accuracy4, 5, 6 even miniscule errors can result in large biases.

The Affymetrix GeneChip Human Mapping 500K array set is a common genotyping platform used in GWAS7, 8, 9, 10 Hence, the corresponding genotype calling algorithm recommended by Affymetrix (Santa Clara, CA, USA), Bayesian Robust Linear Model with Mahalanobis distance classifier (BRLMM),11 is one of the most widely used algorithms for producing genotype calls. BRLMM is a multi-chip clustering algorithm that assigns calls after a single-nucleotide polymorphism (SNP) robust multiarray average process of normalization, transformation, and summarization of probe level intensities for the ‘A’ and ‘B’ alleles. The ability to simultaneously call genotypes for a set of samples at multiple loci has been shown to improve accuracy due to better estimation of genotype (AA, AB, and BB) cluster centers.11 One issue with a multi-chip approach is the introduction of batch effects on the resulting genotype calls. Recent research has shown that discordant results can arise due to changes in both batch size (the number of samples processed simultaneously through BRLMM at a time) and batch composition (including or separating cases and controls and/or heterogeneous samples from different populations in batch sets)3, 12, 13

Clayton et al.3 found that the clustering clouds for the three genotypes differed positionally for cases and controls and can result in ambiguous genotype calls when samples are combined. Yet it is noted that calling cases and control separately may also lead to overdispersion of the test statistic in association testing. Moreover, using high thresholds in QC and only calling genotypes with high confidence can lead to differential bias due to non-independence in genotypes set to missing. Another GWAS study observed that the effects of differential bias due to genotype errors can be remedied with stringent QC (namely SNP call rate), but their QC resulted in excluding approximately 250000 SNPs from the Affymetrix 500K array in their study.14 Anney et al.15 reported Type I error rates in a simulated case–control association study up to 59% when up to 5% non-random missing data was generated in the set of case subjects. Such findings indicate that although it may be inadvisable to call cases and controls together (due to differences in data collection, DNA source and preparation), calling cases and controls separately can lead to differential bias that inflate the test for association. Moreover, Plagnol et al.12 state that allowing different a priori frequencies for genotype clusters in calling algorithms between cases and controls conflicts with the null hypothesis of no association with genetic markers and trait status.

Hong et al.13 used the 270 HapMap samples for the Affymetrix 500K array set to evaluate batch size and composition effects on call rates and genotype call concordance. Batch sizes of 30, 45, and 90 individuals were shown to have concordance rates for both homozygous and heterozygous calls above 99.9% yet also showed significant differences in sample and SNP call-rate results. Similar findings were reported for batch composition in which batches were comprised of combinations of Asian, African, or European HapMap samples. A primary recommendation from the study was use of larger batch sizes and homogeneous (in terms of ancestral population background that is well known to have allele frequency differences) batch compositions. Such results highlight the fact that batch effects can significantly influence the outcome of a study and necessitate a better understanding of batch effect behavior in the BRLMM algorithm for GWASs with thousands of individuals.

In our study, we extend the evaluations of BRLMM batch effects and the implications on further downstream analysis of a case–control study. The problem of uncertainty in genotype calls and variability due to SNP, sample, and batch variation is becoming well known and well documented; yet accounting for this variation is still difficult in a GWAS. Recent research in the development of new genotype calling algorithms is aimed at increasing the accuracy and quality of SNPs used in an analysis. For example, recent advances with CRLMM (Corrected Robust Linear Model with Maximum Likelihood Classification) calling method5 propose quality metrics for SNPs and hybridization batch (or plate) quality metrics for evaluating SNP accuracy in CRLMM version 2.16 Even as the accuracy of calls increases and improved SNP quality filters are formed, the effects of uncertainty in genotype calls on downstream analysis are not well understood. One of our goals as part of the Microarray Quality Control Consortium Genome-Wide Association Working Group (MAQC GWAWG) is to explore and highlight the effects of genotype call errors, those that are systematically introduced by batch effects, in GWASs for popular calling algorithms.

Using 3491 samples from the Wellcome Trust Case Control Consortium (WTCCC)7 study of coronary artery disease (CAD) for the Affymetrix 500K array set, we determined how batch size and composition affect the results of quality control and association testing in GWAS. (Notably, the batch composition here has a different interpretation from the study by Hong et al,13 as we are looking at cases and controls assumed to be from a homogeneous population as opposed to different major world populations). We show that both batch size and composition can change the results of GWAS in terms of the set of SNPs that pass QC and the set of SNPs determined to be significant. Through evaluation of subject-specific differences in the probability for a marker to pass QC and subject-specific differences in the probability for a SNP to be deemed significant, our results show that batch size and composition have both practical and statistically significant effects on GWAS.


Materials and methods

WTCCC data

The raw data was obtained from the Wellcome Trust Case Control Consortium, an organization of several research groups aimed at better understanding genetic variants and complex disease heritability through GWASs. The CEL files of the 1500 controls from the UK Blood Service Control Group and the 1991 CEL files of CAD cases on the Affymetrix GeneChip Human Mapping 500K array were used to form batches for processing in the BRLMM genotype calling algorithm. The Affymetrix 500K array set consists of two arrays, Nsp and Sty, with approximately 262000 and 238000 SNPs on each, respectively.

BRLMM genotype calling

Cluster centers and variances for each genotype A/A, A/B, and B/B are estimated by BRLMM, which is a multi-chip genotype calling algorithm. The genotype call and associated confidence score is then decided using Mahalanobis distance to the cluster centers. The confidence score for the genotype call is derived using the ratio of d1/d2, where d1 is the smallest distance to a genotype cluster (resulting call) and d2 is the distance to the next closest cluster. In this study, we refer to the confidence of the call as, 1-d1/d2, so that higher confidences indicate that the genotype call is more likely to be accurate. If this confidence falls below a certain threshold, the genotype call is set to missing.11 The ‘B’ in BRLMM is a Bayesian step that uses dynamic model,4 a single-chip algorithm that calls genotypes one SNP at a time on a sample of SNPs to estimate priors for the cluster centers and variances. The BRLMM algorithm has been shown to produce higher call rates and improve accuracy in comparison with dynamic model, and has now become the recommended algorithm by Affymetrix for their 100 and 500K array sets.

The BRLMM calling was performed using the Affymetrix Power Tools, version 1.10.2, which is available for download from http://www.affymetrix.com. The BRLMM algorithm is implemented in the ‘apt-probeset-genotype’ application of the package and all default parameters, except those to control batch size to match our experimental design, were used (which included the use of 0.5 as the confidence threshold for reporting a SNP call as missing). Chip description files for Nsp and Sty arrays were used for genotype calling and were downloaded from the Affymetrix website as well. Genotype calls were performed for the Nsp and Sty arrays separately, and tab-delimited text files of calls were outputted for each array and batch.

Batch effect schema

Several runs of the BRLMM algorithm were performed under the experiment designed to estimate batch effects due to size and composition differences (that is, the number of CEL files processed simultaneously and the case–control status of the samples processed within a batch). Three levels of batch size were used: 500, 2000, and 3500 samples (approximately) in each batch. Two levels of batch composition were used: separated (S) with cases and controls in different batches and combined (C) with a 1.25:1 ratio of cases to controls were randomly assigned to batches. For the final analysis, five different datasets of genotype calls for the 3491 samples were used and notated as follows: C500 contained genotypes called from BRLMM in seven batches consisting of 285 cases and 215 controls for a combined batch composition of size 500; C2000 genotypes were called in two batches of 995 cases and 750 controls; C3500 was created with one single batch of the 3491 samples, S500 used four batches of 500 cases and three batches of 500 controls; and S2000 consisted of one batch of 1991 cases and one batch of 1500 controls. Note that due to the unequal number of 1991 cases, the batches may have contained one more or less sample for that scenario. This design allows for the quantification of the effect of both batch size and composition, as well as the interaction of the two effects. Owing to the fact that our design is unbalanced with the inclusion of C3500, this data set is not used in the formal statistical models we form to test significant changes in QC exclusion and significant test results; although it is used to elucidate trends in results due to increasing batch sizes.

Quality control and association analysis methods

Each text file produced by BRLMM was imported into JMP Genomics Statistical Discovery Software from SAS Institute (Cary, NC, USA) for analysis. The genotype data for all batches over the Nsp and Sty arrays were merged and formatted into a wide data set with rows corresponding to individuals and columns to markers. The annotation of SNP was downloaded from http://www.affymetrix.com/support/technical/annotationfilesmain.affx.

For each data set, the following QC steps were carried out. To eliminate low quality chips, individuals with a call rate less than 97% were excluded. Furthermore, individuals were dropped if their average heterozygosity was below 23% or exceeded 30% (empirical threshold used in the original data analysis by the WTCCC). The SNP quality control consisted of three steps. Markers with either a minor allele frequency less than 1% or a call rate of less than 95% were excluded. Remaining markers were filtered using a χ2 test of significant differences in the proportion of missing data between cases and controls. Single-nucleotide polymorphisms that failed the test for Hardy–Weinberg equilibrium (HWE) for the controls were also eliminated from analysis. Significant results for the test of a trend in missing data between cases and controls and the HWE test were determined using α=5.7 × 10−7, an empirical threshold used by the original WTCCC study. See Figure 1 for counts of individuals and loci excluded for C500, C2000, C3500, S500, and S2000 data.

Figure 1.
Figure 1 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Results of quality control for excluding individuals (top) and single-nucleotide polymorphisms (SNPs; bottom) from association analysis for each of the five data sets.

Full figure and legend (213K)Download PowerPoint slide (1,682 KB)

The SNPs that passed QC were tested for significant associations with disease status, CAD, by the Cochran–Armitage trend test for additive allele effects for each of the five data sets. The SNPs were classified as significant if the P-value from χ2 test, with one degree of freedom, was less than 5.0 × 10−7 (a commonly used threshold for uncorrected P-values1, 7) and differential SNPs for batch sets were compared. To evaluate stringent QC measures, a second association analysis was performed for SNPs that passed all previous QC steps and had a call rate not less than 99%. The set of SNPs that were significant at α=5.0 × 10−7 were again evaluated for differences among the five batch sets.

Generalized linear mixed models to test batch effects

Generalized Linear Mixed Models (GLMMs) were performed using the NLMIXED procedure in SAS/STAT, version 9.2, to estimate subject-specific batch size composition and size-by-composition effects for modeling the probability of a SNP passing quality control and for modeling the probability of a SNP being deemed significant. The model is given as

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where X is the design matrix for fixed effects (batch size and composition) and Z is the design matrix for the random effect, SNP (to account for the four responses at each unique SNP that was genotyped in the four batch sets C500, S500, C2000, and S2000). The SNP random effect, yi, is assumed to follow normal (0, σs2) distribution, where σs2 is the covariance within each unique SNP. A logit link function was used to model the probability that a marker passes QC depending on the batch effects (Model I) in which the response is coded as

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Model II models the probability of an SNP being deemed significant given a batch size and composition using the response Y2 coded as

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

For Model II, SNPs that did not pass QC in a batch set were set to missing for Y2 and if the SNP was excluded from all batch sets by QC it was not used in the model.



Five datasets of genotype calls were generated under different levels of batch size and composition for the WTCCC data (see Materials and methods section): C500, C2000, C3500, S500, and S2000, where the data set label conveys the batch composition (combined or separate) and size (500, 2000, and 3500) used in the BRLMM algorithm. These data sets were subjected to identical QC steps and association testing. Results of QC and testing were used in generalized linear mixed models to test whether batch size and composition (and their interaction) significantly changed the result of both QC SNP exclusion and significant disease association.

Quality control

Quality control to exclude both individuals and markers before association testing was carried out for all the data sets. Figure 1 shows workflows of the number of individuals (top) and number of SNPs (bottom) excluded at each QC step. The resulting number of individuals and SNPs given in the far right boxes for each data set was subsequently used for SNP association testing. For individual exclusion, there is a slight trend in more individuals being excluded with a call rate less than 97% as batch size increases for both combined and separate batch compositions.

This trend was also followed for the QC exclusion step for SNPs with a minor allele frequency less than 1% or a call rate less than 95%. As batch size increased, more SNPs were excluded due to low call rates (allele frequency differences were negligible; data not shown), indicating that the number of samples ran simultaneously through BRLMM is negatively correlated with the confidence (we define confidence as 1-d1/d2; where d1/d2 is the ratio of the distance of the closest genotype cluster over the second closest cluster) of the genotype call (the threshold for setting an SNP either to a genotype or missing). The results of QC for SNPs show that, for batch composition, more SNPs tended to have significant differences in call rate between cases and controls, when cases and controls were in separate batches. From the bottom chart in Figure 1, a larger proportion of SNPs were found to have highly significant proportions of missing data for cases vs. controls even after excluding markers with more than 5% missing calls overall. The number of SNPs that passed QC steps indicates that as batch size increased our QC protocol excluded a larger amount of SNPs for both levels of batch composition.

The batch effects not only influenced the number of SNPs that were excluded by QC; the sets of excluded SNPs included thousands of discordant results. Although the number of SNPs dropped from a given batch set was approximately 110000–116000, only 107217 loci in common were excluded in all five data sets. The proportion of discordant SNPs due to QC between two batch sets is no greater than 2%, yet this translates to several thousand markers that were dropped due to QC for genotype calls from one batch size and composition formulation, but were analyzed for significant associations in another. The Venn diagram in Figure 2 shows the breakdown of all SNP counts that were either concordant or discordant across batch sets.

Figure 2.
Figure 2 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Five-way Venn diagram of the counts of single-nucleotide polymorphisms (SNPs) excluded from each data set as a result of the quality control (QC) process.

Full figure and legend (74K)Download PowerPoint slide (718 KB)

Association testing

Markers that survived QC were deemed significant using an α=5.0 × 10−7 threshold for the trend test for association with CAD status in each of the five batch sets. Figure 3 displays P-values for SNP calls from the C500, C2000, C3500, S500, and S2000 batch sets. The plot in Figure 3 reports the −log10(P-value) across the genome sorted by chromosome and Mb position, and aids in understanding the overall trend in the behavior of the P-values among the different batch sets. In all batch sets, several SNPs in the chromosome 9 region were highly significant (similar to findings in the WTCCC analysis). The results for S500 batch set indicate a genome-wide trend of higher P-values, which is constant across all chromosomes. The C500 batch set also shows similar results of slightly inflated P-values in comparison to the batch sets of C2000, S2000, and C3500 (which seems to have the lowest trend in P-value magnitude). Figure 4 plots the pair-wise P-values for each batch set on the log10 scale for all SNPs that passed QC in both data sets. The solid gray indicates the 45° angle and the dotted gray lines fall at the α=5.0 × 10−7 level. Points that fall off the diagonal line indicate SNPs with discordant P-values and any SNPs in either of the two off-diagonal rectangles boxed in by the dotted lines are those that result in differential significance decisions. There was very little discordance found between the C2000 and C3500 batch sets, whereas S500 showed most discordance in comparison with all other data sets. Overall, discordant P-values tended to be more significant in data with separated case–control batch composition and smaller batch sizes.

Figure 3.
Figure 3 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Genome-wide scatter plots of P-values for each data set on the −log10 scale. The dotted line is the 5.0 × 10–7 significance threshold.

Full figure and legend (208K)Download PowerPoint slide (1,604 KB)

Figure 4.
Figure 4 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Pair-wise comparisons of the −log10(P-values) between C500, C2000, C3500, S500, and S2000 data sets. Points that fall off the diagonal line indicate discordant results. The dotted line is the 5.0 × 10–7 significance threshold. Only markers that passed quality control (QC) in both data sets are plotted.

Full figure and legend (116K)Download PowerPoint slide (1,097 KB)

One of the main goals of this study was to evaluate how the list of SNPs that are deemed statistically significant is influenced by batch variations for genotype calling with the BRLMM algorithm. Figure 5 contains Venn diagrams representing the counts of SNPs with a P<5.0 × 10−7 for one batch set compared with results from another batch set. Five pair-wise comparisons were of interest to interrogate effects of batch size (C500 vs C2000, C500 vs C3500, and S500 vs S2000) and composition (C500 vs S500 and C2000 vs S2000). The areas of the circles that do not overlap show that with the same data (allele intensities), different batch processing can produce highly discordant lists of significant findings.

Figure 5.
Figure 5 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Counts of concordant/discordant significant single-nucleotide polymorphisms (SNPs; P<5.0 × 10–7) among data sets. The Venn diagrams above the bars indicate the number of SNPs that were found significant in either or both data sources for pair-wise comparisons of: C500 vs C2000, C500 vs C3500, S500 vs S2000, C500 vs S500, and C2000 vs S2000. The bars below the Venn diagrams represent the count of discordant significant SNPs for that Venn (that is, the area in which the circles do not overlap), broken down by differences due to quality control (QC) exclusion or association testing. Bar color indicates the source of differential results and the color legend corresponds to the order of the comparison labeled on the x axis.

Full figure and legend (75K)Download PowerPoint slide (768 KB)

For each Venn diagram, the source of discordance (non-overlapping circles) can be categorized as due to QC (an SNP was excluded and hence missing in one data set but found significant in another) or QC differential test results (the SNP passed QC in both data sets but was only found to be significant in one batch set). The bars below the Venn diagrams represent the source of discordance for that pair-wise batch set comparison which are categorized by color, where black corresponds to the SNPs that were missing (excluded due to QC) in the first dataset in the comparison label, but found significant in the second batch set. Light gray corresponds to SNPs found significant in the first batch set but QC excluded in the second. Dark gray indicates the significant SNPs in the second batch set that were not significant in the first. Finally medium gray corresponds to SNPs found to be significant in the first batch set in the comparison but not in the second. Using the first bar on the left as an example, black (at the bottom of the bar) represents the count of SNPs that were found significant in C2000, but were excluded due to QC in C500 (that missing in first data set listed on the x axis label), light gray (second color from the bottom) is the count of discordant SNPs that are significant in C500 but QC excluded in C2000, the darker gray is the number of SNPs that were significant in C2000 but not in C500 (differential testing) and the medium gray is the count of SNPs that were significant in C500 but not in C2000. The medium and light gray portions of the bar correspond to the SNP count in the left non-overlapping Venn diagram portion, whereas the black and dark gray add up to the count observed in the right non-overlapping portion of the Venn diagram circle.

The largest amount of overall discordance was found between C500 and S500 and also shows by far the largest amount of discordance due to association testing results. The comparison S500 vs S2000 shows the largest amount of discordance due to QC exclusion, in which many SNPs that were observed to be significant in S500 were excluded due to QC in S2000 (light gray). Comparisons of different batch sizes showed a much higher proportion of discordance due to QC exclusion as opposed to test results in contrast to batch composition comparisons. The discordance for C500 vs C2000 and C500 vs C3500 showed similar patterns, indicating that an increase in the magnitude of the batch size differences beyond 2000 did not seem to influence results. While C500 vs S500 showed the most discordance, C2000 vs S2000 showed the least; hence the effect of batch composition was less severe (in terms of the significant SNP list results) for a larger batch size and suggests an interactive effect of batch size and composition. The information obtained from Figure 5 is that BRLMM calls using larger batch sizes in which cases and controls are combined within batches leads to more conservative, yet more concordant, results for significant association.

Stringent call-rate threshold evaluation

Miyagawa et al14 observed that stringent data cleaning (particularly SNP call rate) could reduce the false positive rate to nominal levels (which eliminated over 50% of SNPs in the array). We observed that by increasing the SNP call rate to pass QC to 99%, we could eliminate nearly all differentially significant SNP results in the five datasets. All concordant significant SNPs are found on chromosome 9 in the region of SNP, rs1333049, the locus also reported by WTCCC. Of the two SNPs with discordant significance results, rs7865618 was observed to not be significant in S2000 by a slight margin, whereas rs16846351 was excluded due to QC in all batch configurations but S500. Although these results may be encouraging that appropriate data cleansing will eliminate the effects of genotyping errors, using a stringent SNP call-rate threshold forces 40–50% of the data to be excluded from analysis; this is not optimal in studies geared toward discovery. Under the additional QC constraint, a little under 100000 more SNPs did not pass QC for each data set, leaving SNP counts of 305984 for C500, 298382 for C2000, 297165 for C3500, 308473 for S500, and 301511 for S2000, which were used in association analysis.

Statistical tests for batch effects

A GLMM approach was used to estimate the statistical effect that batch size and composition has on QC and association testing in GWAS. In striving for a more simple balanced design without losing information on batch size and composition, the C3500 batch set is not included in the models. Figures 1 and 2 clearly show differences in the counts of SNPs that pass QC among the batch sets. To test those differences, we built a GLMM to model the probability of an SNP passing QC (Y1 coded as ‘0’ if the SNP passes QC and ‘1’ if the SNP is excluded in each data set) depending on different levels of batch size, composition, and size-by-composition. An SNP identifier is incorporated as a random effect to account for correlation due to the fact that the same SNP is genotyped in each of the batch sets. An additional GLMM was fit for modeling the probability of an SNP deemed significant (Y2=1) using α=1.0 × 10−5, a threshold that was used in the WTCCC original analysis as a second-tier cutoff for moderate significance that could be recommended for follow-up replication studies. We also observed this new threshold to be a statistically appropriate cutoff by visual inspection of the quantile plots given in Figure 6, which show that differential divergence from the expected distribution of P-values for the batch sets begins around the α=1.0 × 10−5 level (corresponding to −log10(p)=5). We will refer to the GLMM for QC exclusion as Model I and the GLMM for SNP significance as Model II; see Materials and methods section for further details on the logistic models with batch size and composition as fixed effects and SNP as a random effect.

Figure 6.
Figure 6 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Overlaid quantile plots show the distribution of P-values for each of the five data sets on the −log10 scale. The x axis plots the expected P-value under uniform distribution, whereas the y axis is the observed P-values. The black solid line represents the 45° angle.

Full figure and legend (66K)Download PowerPoint slide (544 KB)

A set of hypotheses were formed to test for subject-specific (that is, SNP specific) differences between levels of batch size and composition. Table 1 gives the format of the design matrix and parameter vector for the fixed batch effect levels: C500, S500, C2000, and S2000.

Using the GLMM framework, we tested seven hypotheses for both Model I and Model II to estimate subject-specific differences (the difference in probability estimates between an SNP genotyped by one batch set vs that same SNP genotyped in another batch set). Two overall main effect tests for size and composition, and a third hypothesis test for the overall interaction were performed. The remaining four tests are for simple interaction effects to determine subject-specific differences across levels of batch size at each level of batch composition and vice versa. The result of these tests for Model I and II are presented in Table 2.

Owing to the extremely large sample size, the power to detect significant differences is extremely high; thus, we do not try to use a significance threshold to evaluate P-values (although for Model II there were instances of P-values that are relatively large due to more variability in the estimates as there was a much smaller proportion of data in which Y2=1). It is more informative to look at the direction and magnitude of the estimates; the log odds ratio of the probabilities of Y=1 for the contrasts specified in the Table 2. For Model I, Table 2 reports a large negative estimate for the ‘500 vs 2000’ contrast (−1.725) for the QC exclusion modeling probability, indicative of a much higher probability for an SNP to be excluded from QC in the batch sets of size 2000 than for the same SNP to be excluded in the 500 batch size sets. The main effect contrast for ‘Combined vs Separate’ is also negative, yet is of a much smaller magnitude. More importantly, the contrast size × composition shows strong evidence that there is an interactive effect of batch size and composition that influences the probability of a locus being excluded due to QC. The simple interaction effect contrasts show that there is a large subject-specific difference in probabilities for batch sizes of ‘500 vs 2000’ when batch composition is separate (−2.017 compared with −1.434 estimate when ‘500 vs 2000, combined’). For simple interaction effect for batch composition contrasts, there is very little evidence of subject-specific differences in QC results for the ‘Combined vs Separate’ contrast at batch size of 500 (0.096 estimate with a P-value=0.0013). The ‘Combined vs Separate’ simple interaction effect for size 2000 reported more evidence of differences, indicating that separated batch composition at a batch size of 2000 has a higher probability of a SNP not passing QC. Referring back to Figure 1, these differences are caused by more SNPs with a lower call rate and more SNPs with significant trends in missing data between cases and controls. The contrast estimates for Model I, modeling the probability of QC exclusion, show that batch size differences have a larger effect on quality control results than batch composition; although the effect of batch size is slightly less severe for a combined batch composition.

Estimates are generally smaller in magnitude for Model II (modeling probability is notated as Significance in the Table 2) compared with Model I, evidence that batch effects influence the results of QC much more strongly than association test results. For Model II, the batch size × composition interaction is not significant at a P=0.08. Batch size differences again result in the most subject-specific differences (0.912 log odds ratio for ‘500 vs 2000’ batch size contrast with P=1.8 × 10−13) with more significant SNPs expected to be observed at a batch size of 500. The batch composition main effect contrast indicates a higher probability of significant findings for the separated composition as compared with combined batch sets (−0.415 log odds ratio). The simple interaction effects corroborate these results.

Although the estimates for the hypotheses tested in our models show strong evidence of highly differential results, the subset of discordant outcomes for SNPs across the four batch sets was still very low. Overall concordance for the outcome of an SNP to pass QC or to be excluded from analysis in all four batch sets was 97.88%. Overall concordance for association testing (using the P<1.0 × 10−5 threshold to assign a significant result) was 97.29% when including differential QC results. For SNPs that passed QC in all four batch sets, overall concordance in association testing was 99.993%. In smaller scale studies, these concordance rates would be encouraging. When over 500000 SNPs are tested, discordance of up to 2–3% can dramatically change the outcome of association testing as this translates to thousands of markers influenced by batch effects.



We show that batch size and composition considerations can produce discordant results in terms of QC decisions and the final list of significantly associated SNPs for GWAS data using the WTCCC CAD disease data set as an example. Figures 1 and 2 make it clear that increasing batch size from sets of 500 individuals called by the BRLMM algorithm to sets of 2000 or 3500 results in more SNPs being excluded from analysis under typical QC thresholds, as well as differential sets of SNPs that pass QC. In addition, batch composition differences results in more discordant results at a batch size of 500 as opposed to 2000. Subsequent testing of association with disease status is also highly affected by batch changes, in large part due to the differential SNPs that passed QC but also due to different trends in the magnitude of P-values, as seen in Figures 3 and 4. These discordant results in QC and P-value results propagate to the list of SNPs that are deemed significant for the different batch sets as evidenced in Figure 5.

Using generalized linear mixed models, we show that batch size and composition effects have a highly significant impact on results of GWAS. Although our analyses indicate that stringent QC could eliminate much of the discordance, it would also result in a drastic loss of potentially useful data. Our WTCCC analysis for CAD found many likely spurious associations in addition to the signal on chromosome 9 that the original WTCCC analysis team reported. The analysts of the original WTCCC data understood the impact that genotyping errors can have and subjected every SNP found to be significant to visual inspection of the cluster plots of the normalized SNP intensity scores. Several hundred SNPs were examined visually and reported findings did not include SNPs that showed poor clustering, which blatantly lead to genotyping uncertainty. Through this inspection, evidence of plate and batch effects was found due to lab variation for plates hybridized at different locations and times (reported in the Supplementary Information from the Wellcome Trust Case Control Consortium7). Unexplained variation in missing data rates across batches of samples was also observed, which we have shown can lead to inflation of the false positive rates if missing rates are systematically different between cases and controls. Clearly, without intense analysis and visual inspection, which requires enormous time and personnel resources, the reported findings of the WTCCC would have suffered severely from batch effects. For these reasons, the WTCCC researchers recommended simultaneous calling of all samples when possible to yield more conservative/accurate results similar to those we observed for the C3500 batch analysis. Generally, if we can assume that the C3500 significant results accurately control Type I error, our results indicate that calling carried out with case/control-separated batches of a smaller size (S500) can lead to over a fourfold increase to the Type I error rate (a conservative estimate based on the distribution of the counts of significant SNPs pictured in Figure 5). Batching performed in smaller sets with cases and controls combined still results in strong evidence of inflated false positive findings of at least a twofold increase in Type I errors.

These results indicate that careful consideration to the implementation of the BRLMM algorithm with the Affymetrix 500K array set should be taken to ensure more reproducible concordant results. As mentioned previously, the Wellcome Trust Case Control Consortium also found issue with the BRLMM algorithm and proposed a new calling algorithm CHIAMO in the course of their work,7 which was used to process their samples in one single batch. To counterbalance false positive results, other recent researches5, 6, 16 propose the use of modified normalization and summarization steps for better genotype calls with the method CRLMM. The use of batch-specific quality metrics proposed in CRLMM, version 2,16 aid in the detection and correction of SNPs with genotyping errors that can affect GWAS results and merits further research to assess how well such metrics mitigate batch effects in downstream analysis. Another algorithm called Birdseed, which is tailored for the Genome-Wide Human SNP Array 6.0, has also been introduced by Affymetrix, although it is still recommended to use BRLMM for the 500K array set. More research and study is needed to evaluate how batch considerations affect the results of these alternative genotype calling algorithms. Alternatively, a promising focus is on better clustering methods adapted to allow for differences in genotype clusters in cases and controls as proposed by,12 as such methods could help alleviate bias that is introduced due to batch processing.

Batch size and composition differences in the BRLMM algorithm can drastically change the results of a GWAS and are a potential source for lack of reproducibility. Batches containing more individuals (a larger size) with cases and controls combined resulted in more conservative, concordant association testing results with the CAD case–control samples from the WTCCC, indicative of a lower probability of making Type I Errors with such a batch schema. Alternatively, the exclusion of more loci from analysis due to QC could result in increases in Type II errors as an SNP that may have been associated is not even tested due to QC thresholds. The opposite is true for smaller batch sizes with cases and controls separated, in which positive findings are more likely to be spurious associations (as evidenced by higher instances of discordant findings in our study). Imposing strict QC measures can eliminate discordance in association results due to batch variation, but also eliminates a large portion of potentially informative data. There is a trade-off between high accuracy at the cost of losing informative data, and increased discovery of possible variants associated with disease that may contain a higher number of false positives.

Significance rules based on P-value cutoffs are not designed to be reproducible, but rather to control Type I and Type II errors, so some discordance evident in Figure 4 may simply be due to the fact that we are determining significance by a P-value-based criterion. A similar issue of discordance arose in the context of gene expression in the MAQC I project.17 A primary conclusion of that study was that filtering on both P-value and fold-change criterion can lead to more reproducible results. This suggests that adding a criterion, such as the absolute difference of the numerator of the association test statistic (for example, the Cochran–Armitage trend test) to current P-value filtering rules could enhance concordance and/or reproducibility of genome-wide association testing results.

Decisions on batch size and composition should be based on the researcher's goals, discovery or accurate reproducibility, and careful attention to DNA collection and preparation. It is recommended to use a combined batch composition in studies in which cases and controls are prepared by similar labs and randomly assigned to plates for processing to avoid differential bias due to varying patterns of missing data in cases and controls (although a test for this as a quality control step alleviates some of the bias) or to use methods, such as those developed by Plagnol et al.12 In all follow-up, collaborative, and replication studies, it is paramount to use a common genotype calling algorithm and batch schema to have comparable results. Another possible approach would be to analyze the allele probe intensities directly to avoid genotype calling errors. Future research into the behavior of batch effects in other popular calling algorithms such as Birdseed, CRLMM (version 1 and version 2), and CHIAMO is necessary for researchers to make an informed choice for genotyping platform and genotype calling algorithm and is currently under study by the MAQC GWAWG.




The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.


Conflict of interest

The authors declare no conflict of interest.



  1. Kingsmore SF, Lindquist IE, Mudge J, Gessler DD, Beavis WD. Genome-wide association studies: progress and potential for drug discovery and development. Nat Rev Drug Discov 2008; 7: 221–230. | Article | PubMed | ChemPort |
  2. Donnelly P. Progress and challenges in genome-wide association studies in humans. Nature 2008; 456: 728–731. | Article | PubMed | ChemPort |
  3. Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM et al. Population structure, differential bias and genomic control in large-scale, case–control association study. Nat Genet 2008; 37: 1243–1246. | Article | ChemPort |
  4. Di X, Matsuzaki H, Webster TA, Hubbell E, Liu G, Dong S et al. Dynamic model based algorithms for screening and genotyping over 100k SNPs on oligonucleotide microarrrays. Bioinformatics 2005; 21: 1958–1963. | Article | PubMed | ISI | ChemPort |
  5. Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide snp array data. Biostatistics 2007; 8: 485–499. | Article | PubMed
  6. Lin S, Carvalho B, Cutler DJ, Arking DE, Chakravarti A, Irizarry RA. Validation and extension of an empirical bayes method for snp calling on affymetrix microarrays. Genome Biol 2008; 9: R63. | Article | PubMed | ChemPort |
  7. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 case of seven common diseases and 3,000 shared controls. Nature 2007; 447: 661–678. | Article | PubMed | ISI | ChemPort |
  8. Winkelmann J, Schormair B, Lichtner P, Ripke S, Xiong L, Jalilizadeh S et al. Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions. Nat Genet 2007; 39: 1000–1006. | Article | PubMed | ChemPort |
  9. Meisinger C, Prokisch H, Gieger C, Soranzo N, Mehta D, Rosskopf D et al. A genome-wide association study identifies three loci associated with mean platelet volume. Am J Hum Genet 2008; 84: 66–71. | Article | PubMed | ChemPort |
  10. Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J et al. Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci USA 2008; 105: 4340–4345. | Article | PubMed | ChemPort |
  11. Affymetrix White Paper Publication. BRLMM: an improved genotype calling method for the genechip human mapping 500k array set http://www.affymetrix.com/support/technical/whitepapers/brlmmwhitepaper.pdf.
  12. Plagnol V, Cooper JD, Todd JA, Clayton DG. A method to address differential bias in genotyping in large-scale association studies. PLoS Genet 2007; 3: 759–767. | Article | ChemPort |
  13. Hong H, Su Z, Ge W, Shi L, Perkins R, Fang H et al. Assessing batch effect of genotype calling algorithm brlmm for affymetrix genechip human mapping 500k array set using 270 hapmap samples. BMC Bioinformatics 2008; 9(Suppl 9): S17 . | Article | PubMed | ChemPort |
  14. Miyagawa T, Nishida N, Ohashi J, Kimura R, Fujimoto A, Kawashima M et al. Appropriate data cleaning methods for genome-wide association study. J Hum Genet 2008; 53: 886–893. | Article | PubMed | ChemPort |
  15. Anney RJ, Kenny E, O’Dushlaine CT, Lasky-Su J, Franke B, Morris DW et al. Non-random error in genotype calling procedures: Implications for family-based and case-control genome-wide association studies. Am J Med Genet B (Neuropsychiatr Genet) 2008; 147: 1379–1386. | Article
  16. Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics 2010; 26: 242–249. | Article | PubMed | ChemPort |
  17. MicroArray Quality Control Consortium. The microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006; 24: 1151–1161. | Article | PubMed | ISI | ChemPort |


We thank all members of the GWAWG and MAQC for their contribution to this study. We also thank the members of the WTCCC for providing access to the data and the anonymous reviewers, whose comments and insight has made this a much more effective paper.