The Genome-Wide Association Working Group (GWAWG) is part of a large-scale effort by the MicroArray Quality Consortium (MAQC) to assess the quality of genomic experiments, technologies and analyses for genome-wide association studies (GWASs). One of the aims of the working group is to assess the variability of genotype calls within and between different genotype calling algorithms using data for coronary artery disease from the Wellcome Trust Case Control Consortium (WTCCC) and the University of Ottawa Heart Institute. Our results show that the choice of genotyping algorithm (for example, Bayesian robust linear model with Mahalanobis distance classifier (BRLMM), the corrected robust linear model with maximum-likelihood-based distances (CRLMM) and CHIAMO (developed and implemented by the WTCCC)) can introduce marked variability in the results of downstream case–control association analysis for the Affymetrix 500K array. The amount of discordance between results is influenced by how samples are combined and processed through the respective genotype calling algorithm, indicating that systematic genotype errors due to computational batch effects are propagated to the list of single-nucleotide polymorphisms found to be significantly associated with the trait of interest. Further work using HapMap samples shows that inconsistencies between Affymetrix arrays and calling algorithms can lead to genotyping errors that influence downstream analysis.
Lack of reproducibility has been a constant plague for genome-wide association studies (GWASs). GWAS must be carefully designed and carried out to avoid the myriad of pitfalls that can introduce type I and type II errors that are due to both biological and technical biases.1 Studies aimed at understanding the genetic contributions to complex diseases continue to increase in size and scope; therefore, it is paramount to establish a standard of quality to avoid even miniscule errors that can lead to unreliable results. One source of bias is systematic genotyping errors that can propogate to downstream GWAS analysis results.2, 3 Owing to the data size and memory constraints of hardware, it is common in GWAS for samples to be grouped either by plate or by batch for genotype calling. This can introduce differential bias and systematic errors in the genotype calls that may influence the results of association testing. This is especially a concern when DNA collection for samples was carried out in different labs or at different times, or when a study uses a set of common controls obtained from an external site.
In general, batch effects can refer to multiple sources of systematic bias in a genetic study. Technical batch effects can arise when disease status is confounded with differences in laboratories, data collection protocols, sampling dates, et cetera. Meanwhile, biological batch effects include ancestral differences, population structure and relatedness that influence the phenotype. In this study we focus on what could be considered statistical batch effects, that is, systematic bias because of the execution of statistical algorithms used for converting intensities taken from the microarrays into genotype calls. Grouping sets of samples into batches for processing in a genotype calling algorithm can introduce statistical batch effects that are exacerbated by technical and biological batch effects.
These sources of variability are particularly critical when the list of single-nucleotide polymorphisms (SNPs) selected is to be tested in a clinical trial to support the approval of a drug or a test. Regulatory agencies are concerned about the reproducibility of the lists of SNPs selected, fully aware of the constraints for confirmatory studies. As the steps in the process of selecting a classifier are followed, are there specific configurations for the analysis protocols with which we might anticipate more reproducible classifiers?
In an effort to evaluate the effect that choice of genotype calling algorithm and sample batch processing to obtain genotype calls has on GWAS results, the MicroArray Quality Control (MAQC) Genome-Wide Association Working Group (GWAWG) carried out association analyses under a variety of different genotype calling algorithm conditions using data obtained from Wellcome Trust Case Control (WTCCC).4 Our findings show that careful attention to how samples in a study are processed is necessary to produce accurate and reproducible genotype calls when using genotype algorithms for the Affymetrix GeneChip Human Mapping 500K array (Affymetrix, Santa Clara, CA, USA, Catalog no. 500K). Concurrent publications along with this article5, 6, 7 include analyses of statistical batch effects (how individuals are combined and processed through the genotype algorithms) for the Bayesian robust linear model with Mahalanobis distance classifier (BRLMM) algorithm,8 the corrected robust linear model with maximum-likelihood-based distances (CRLMM)9 and CHIAMO (the algorithm developed and implemented by the WTCCC in the study of seven complex diseases).4 From these analyses, we find that there is significant variability in the results of association analysis attributable to the number of samples processed through the genotype calling algorithm at a time, as well as the composition of the samples batched together for simultaneous genotype calling (whether cases and controls are combined within batches or whether calls are generated for controls separate from cases). Using the results of these analyses for the WTCCC data for coronary artery disease (CAD; 1500 controls and 1991 cases), this paper summarizes calling algorithm batch effect findings and collates these findings to examine variability across genotype calling algorithms and its implication on GWAS. We find that there is variability not only within calling algorithms, but also between algorithms; moreover, the amount of variation is influenced by both calling algorithm batch size and composition and can have significant effects on the results of GWAS. Using additional data on a CAD GWAS provided by the University of Ottawa Heart Institute from the ongoing Ottawa Heart Genomics Study (OHGS) for samples genotyped on the Affymetrix 500K array,10 we find further evidence of discordance among genotype calling algorithms as a follow-up replication study.
Additional work published concurrently11 by the GWAWG includes further research based on the use of HapMap data12 to evaluate the technical robustness in genotyping technologies, inconsistencies between Affymetrix SNP arrays and the corresponding recommended calling algorithms (dynamic model, BRLMM and Birdseed) and the propagation of these inconsistencies to analysis. Results show that through the use of technical replicates, the Affymetrix Genome-Wide Human SNP 6.0 array (Affymetrix SNP6) is generally robust with high concordance (99%) across laboratories. Yet, for the Affymetrix 500K and Affymetrix SNP6 arrays, inconsistencies still arise across arrays as well as across genotyping algorithms recommended by Affymetrix. These genotype inconsistencies are amplified when downstream analysis of quality control (QC) steps and association testing is performed (supporting the results that we observed with the WTCCC case–control association data).
Materials and methods
WTCCC comparative analysis
To study batch variability in calling algorithms with the WTCCC data, for each genotype calling algorithm (BRLMM, CRLMM and CHIAMO), the 1991 CEL files obtained from the WTCCC for samples with CAD and the 1500 CEL files for control individuals from the UK Blood Service Control Group were used to form batch sets for genotype calling algorithm processing. The end result of each algorithm analysis was five sets of P-values for SNPs on the Affymetrix 500K array. These five sets of P-values correspond to the results of association analysis for the genotype data generated by five different sample batch processing formats for genotype calling. Table 1 outlines the batch sizes and composition (that is, the number of CEL files processed simultaneously and the case–control status of the samples processed within a batch) used to create each set of genotype calls, along with the names used to refer to that set. More details can be found within the articles pertaining to each genotype calling algorithm: BRLMM,5 CHIAMO6 and CRLMM.7
For each genotype calling algorithm, a standard set of steps for QC and association testing was carried out (refer to the individual papers for further detail) and variability in results within each algorithm was assessed. Using these results, we assessed the variability in QC filtering and association testing between the three genotype calling algorithms.
Computing time and specifications for genotype calling
The most common reason that genotype calling is performed in batches is because of limitations in computing power and time constraints. The genotype calling algorithms are computationally intensive, especially when all samples are called simultaneously. Table 2 shows the computer specifications and computing time needed to obtain genotype calls for all 3491 samples simultaneously for the three algorithms. BRLMM calling was performed using the Affymetrix Power Tools version 1.10.2 that is available for download from http://www.affymetrix.com. CHIAMO version 0.2.1 was used for CHIAMO calls, with software freely available for download. CRLMM calls were made using the CRLMM version 1.4.1 that is available in the Bioconductor package for R software. All default parameter settings as specified in the Affymetrix Power Tools ‘apt-probeset-genotype’ application were used for computing BRLMM calls, including the 0.5 confidence threshold (based on the ratio of the Mahalanobis distances of two closest computed genotype clusters) used to set a SNP call to missing. For CLRMM calls, in comparison with the BRLMM standard 0.5 call/no-call cutoff, a 0.94 confidence threshold was used for calling a genotype as missing. The CHIAMO algorithm uses a Bayesian 4 class (AA, AB, BB and NULL) hierarchical mixture model to obtain genotype probabilities. The genotype with the highest probability is chosen for the call; if no probabilities exceed 0.9 (set according to the WTCCC study) then the genotype is set to NULL. Further information on the differences in methodologies can be found in the respective concurrent publications: BRLMM,5 CHIAMO6 and CRLMM.7
Ottawa Heart Institute data analysis
Data obtained for 898 cases and 981 controls from the University of Ottawa Heart Institute on a GWAS for CAD was also analyzed by the GWAWG as a follow-up replication study to further evaluate calling algorithm variability and association results. Using all Ottawa samples simultaneously, genotype calls were made by the BRLMM, CRLMM and CHIAMO algorithms (execution mirrored that used for the WTCCC data) using the CEL files for the Affymetrix 500K array. Downstream QC and analysis was then performed to obtain P-values for SNP association with CAD in the Ottawa samples.
The same QC steps applied in the WTCCC analysis were followed with the Ottawa samples. This includes filtering samples with a call rate of <97% and filtering SNPs that have a minor allele frequency of <1%, a call rate < 95%, an χ2 test of significant differences in the proportion of missing data between cases and controls and an χ2 test for Hardy–Weinberg equilibrium in the controls (filtering of SNPs followed that order of SNPs). Significant results for the test of a trend in missing data between cases and controls and the Hardy–Weinberg equilibrium test were determined using the empirical threshold of α=5.7 × 10−7 as in the WTCCC study. To detect sample handling errors and to ensure that correct NSP and STY arrays were matched together, representing the same sample, an additional QC test of the correlation between the genotypes for each sample over both arrays was conducted (before other QC filtering; this QC step was also performed for the WTCCC data but no mismatches were found). The nature of linkage disequilibrium is such that for two neighboring SNPs, the alleles will usually be inherited together as part of a haplotype block, and hence be correlated. This phenomenon can be used to test the matching between the NSP and STY genotyping arrays used in this study. Over a large number of closely spaced SNP pairs, with one SNP from each array, we should observe correlation between the genotypes for appropriately matched samples. For each autosomal SNP on the NSP array, we identified the nearest STY SNP within a maximum distance of 2500 base pairs. We then calculated the correlation of minor allele counts between all of the resulting SNP pairs for each subject in the data. A histogram of the resulting correlation values revealed two distinct clusters, and is shown in Supplementary Figure 1. There were 29 samples with extremely low NSP/STY correlation (mean 0.06, s.d. 0.004), whereas the remaining 1850 matched samples (884 cases and 966 controls) had a higher correlation (mean 0.28, s.d. 0.005). To further support that the correlation difference was because of an NSP/STY mismatch, we confirmed that 19 of the 29 mismatched samples were also gender mismatched, as measured by X heterozygosity (see Supplementary Figure 2). After all QC filters, SNPs were deemed signifcant using the same WTCCC empirical threshold for genome-wide significance of α=5.0 × 10−7 for the χ2 additive trend test for association, as well as a less stringent threshold of α=1.0 × 10−5, which may be more reasonable given the smaller sample size and power considerations of moderate effects.
Calling algorithm variability in the WTCCC data
For all three genotype calling algorithms, the C3500 data set seems to produce the most conservative (resulting in less apparent inflation to the type I error) and concordant (following the results published in the WTCCC) association results within the algorithms. In general, genotype calls generated using a batch size of 500 samples with cases and controls separated (S500) resulted in a large number of apparent false-positive associations and discordance. Figure 1 shows five-way Venn diagrams highlighting the concordance/discordance of SNPs found to be significant at a genome-wide significance level of α=5.0 × 10−7 for each data set created by the different batch scenarios.
The counts displayed where all ellipses overlap indicate (concordant) SNPs that were found to be significant in all data sets for the genotype algorithm, whereas the counts that are in only one ellipse represent the number of (discordant) SNPs that were found to be significant only in that data set. Across all three genotyping algorithms, a number of SNPs were found to be only significant in data sets with calls generated for cases and controls separately, with the most discordance found in the CHIAMO genotype calling algorithm. This is likely because of systematic call errors introduced by allowing the cluster clouds used to make genotype calls to differ positionally for cases versus controls (and hence violating the null hypothesis of no genotypic differences associated with trait status). As batch size for genotype calling increased, fewer SNPs were found in general to be discordant in testing (except for CHIAMO with the S2000 data set, there is an inordinately higher number of SNPs found associated only with that data). The main conclusion we can draw from Figure 1 is that, across all genotype calling batch scenarios, the CRLMM algorithm is the most concordant algorithm (resulting in more reproducible results that are less likely to suffer from type I errors regardless of batch size and composition).
Concordance based on a genome-wide threshold significance level to determine genetic signal associated with a trait can be arbitrary and can amplify the effect of minor differences between studies. Figure 2 more completely shows the discordance that exists in the data because of genotype algorithm execution. Each data set (C2000, C500, S2000 and S500) was compared against the C3500 data set for each algorithm by evaluating the percentage of SNPs found in common in ranked SNP lists of varying sizes. On the basis of the sorted P-values, lists of the top ranked hits were collected for sizes 2, 4, 8, 16, 32, 64, 128, 256, 256, 512, 1024, 2048, 4096, 8192, 16 384 and 32 768. Figure 2 plots the percentage of SNPs in agreement versus list size (on the Log scale) from different genotype data sets. Across varying list sizes, the CRLMM algorithm seems the most concordant with the exception of the S500 data set comparison, which fails to achieve more than at 62% agreement rate even at a list size of 32 768 thousand ranked SNPs. The C2000 data set shows the most agreement with C3500 across algorithms (as expected) but in general, agreement across varying list sizes does not get better than 80–85% concordance; which indicates that even without an arbitrary cutoff for significance testing, there is marked variability because of statistical batch effects. Figure 8 (shown later with the Ottawa heart study results) shows a similar list agreement plot, comparing the three genotyping algorithms for the WTCCC data and the Ottawa data.
Figure 3 shows Venn diagrams that compare directly the three calling algorithms for data generated under the S500 batch schema and the C3500 batch; data sets in which the genotype calls generally result in the most discordant and most concordant results, respectively. The top set of Venn diagrams depict the counts of SNPs that were filtered out for not passing QC. Even for the gold standard of calling all individuals simultaneously (C3500), thousands of SNPs show discordance such that they have been excluded from analysis in one data set, but included for association testing in another, providing evidence that there is variability in the resulting calls and genotype errors are being made that propagate to the downstream analysis.
The effect of systematic genotyping errors is evident in the discordance in the SNPs found significant, as depicted in the bottom left diagram of Figure 3, in which for the S500 data, over 200 SNPs were found significant in CHIAMO that were not significant at the genome-wide level for CRLMM. Similarly, 114 SNPs found significant in BRLMM were not so in CRLMM, and the overlap between BRLMM and CHIAMO is very small. This indicates that many of these SNPs deemed significant are likely type I errors. When genotype calls for all samples are made simultaneously (for C3500 data), there is much more concordance in the list of significant SNPs, although differences still exist. Notably, for BRLMM, 14 additional SNPs were found significant even under the ideal genotype calling batch design. Of those 14 SNPs, 11 were found on the X chromosome and a majority of them had been excluded for QC in CHIAMO and CRLMM.
From Figure 3, it is clear that there are notable differences between results of the three different genotype calling algorithms. This is made even clearer by the P-value plots for the data from S500 and C3500 in Figure 4 (a and b, respectively). One explanation for discordance is that by imposing discrete choices of QC pass or fail as well as genome-wide significance as below the threshold or not, even miniscule differences will be influential. Figure 4 shows that even without the use of discrete choices to make conclusions, P-values for GWAS can vary under different algorithms and are highly influenced by how samples are processed to obtain genotype calls.
The findings of GWAS are always subject to scrutiny for both type I and type II errors. Often type I errors are likely if the distribution of significant findings shows a large number of SNPs that are below the genome-wide significance threshold. Figure 5 shows the total counts of SNPs found significant for all batch set combinations for the three genotype calling algorithms. Across all batching combinations to produce genotype calls, CRLMM was the most conservative and most concordant algorithm. CHIAMO seems to have a largely inflated number of significant associations when cases and controls are called separately, indicative of likely type I errors. BRLMM also seems to inflate the type I error rate for separated batches as well as for smaller batch sizes.
Calling algorithm variability in the Ottawa Heart Institute data
Our analysis of the WTCCC data for CAD showed that calling genotypes for all data simultaneously can yield more concordant results across different algorithms. Although differences still remained, the strong association signal on chromosome 9p21.3 was consistently evident, with significant SNPs found in common across all three algorithms for the WTCCC samples. With the Ottawa data, the GWAWG hoped to see replication of this signal, as well as observe concordant results across the three genotype calling algorithms when all samples were processed simultaneously.
Table 3 reports the counts of SNPs excluded at each step of the QC process, leaving a total number of SNPs that were tested for association of 377 736 for BRLMM calls, 376 356 for CRLMM and 397 601 SNPs for CHIAMO. Figure 6 shows the P-values across the genome for the SNP genotypes called by CRLMM, BRLMM and CHIAMO.
Although there is slight evidence of a replicated signal on chromosome 9p21.3, it is not notably significant at α=5.0 × 10−7 level across all algorithms. The signal is replicated at the lower significance level of α=1.0 × 10−5 (the second tier significance level considered by the WTCCC), but as a stand-alone study it would not be a strong evidence of association. The P-values of association from the BRLMM called genotypes (middle plot of Figure 6) show more evidence of the 9p21.3 signal, with one SNP, rs703845, being highly significant with a negative log10 P-value of 14.8. This SNP was also highly significant in CRLMM at a negative log10 P-value of 9.33 but was excluded from QC filters by CHIAMO (because of a trend in the proportion of missing genotypes between cases and controls) and more interestingly excluded in the WTCCC study for low minor allele frequency. For the WTCCC data, rs703845 had a minor allele frequency of 0.0013 (BRLMM called genotypes), whereas in the Ottawa data it had a minor allele frequency of 0.027 (BRLMM). This large difference in minor allele frequency (Fisher's exact test for a difference returned a P-value of 2.2 × 10−35) raises other questions; for example, the possibility of population differences between the Canadian samples and the European samples. Further discussion of this is given in the Discussion section.
The other purpose of analyzing the Ottawa data was to evaluate the concordance between the calling algorithms when all samples are called simultaneously. Figure 6 shows that there are many SNPs with different P-value results across the genotyping algorithms. Only four loci that have an asterisk marker in Figure 6 were found to be significant at the empirical threshold (α=5.0 × 10−7). The Venn diagrams in Figure 7 show that there is very little concordance among the algorithms with the Ottawa data. No SNP on chromosome 9 was found to be significant across all three algorithms at the genome-wide threshold, although there are SNPs in this region with low P-values that would meet a relaxed significance threshold. CRLMM calls produced the most discordant list of significant SNPs in comparison to the other algorithms. The majority of this discordance was found to be because of differential QC filtering, in which 97 of the 98 SNPs found only significant in CRLMM calls were SNPs that were filtered out in QC in either or both the BRLMM and CHIAMO called data.
Figure 8 shows the plots of ranked SNP list concordance (similar to Figure 2), comparing results across different genotype calling algorithms for both the WTCCC and the Ottawa studies. From this plot it is evident that the percentage of agreement between top-ranked SNPs is generally much higher in the WTCCC data than the Ottawa data across SNP list sizes (especially for smaller list sizes), which could indicate that sample size influences concordance among genotyping algorithms. Despite the evidence of more discordance in the CRLMM algorithm when applying a discrete significance threshold, all three algorithms have similar list agreement profiles in the Ottawa data.
Unlike the WTCCC data analysis, in which data generated through the CRLMM algorithm resulted in a lower number of significant findings, CRLMM called genotypes resulted in many more significant hits in the Ottawa data. Very few of these significant findings appear to be real, robust signal, as indicated by the random scatter of P-values in Figure 6 (top plot). CHIAMO seems to produce more conservative results with less evidence of significant findings across the genome in comparison to CRLMM and BRLMM, and also showed higher-quality data with few loci filtered out because of QC steps. These results with the Ottawa data further complicate the issue of genotype calling algorithm variability. Clearly, the algorithms produce enough genotype calling errors to distinctly influence the results of a GWAS; but the behavior of these algorithms across different data sets do not allow for making conclusions as to which will be more likely (if any) to produce more false-positive results.
Stringent QC analysis of Ottawa data
In the study of batch effects for the BRLMM algorithm, it was found that by applying a more stringent SNP call rate filter of 99%, much of the discordance in significant results was eliminated (at the cost of also eliminating nearly half the loci).5 This additional QC filter was carried out for the genotype calls for CRLMM, BRLMM and CHIAMO with the Ottawa samples to see whether much of the apparently spurious and discordant associations would be eliminated. Using a 99% SNP call rate filter left 326 444 CRLMM called SNPs, for BRLMM 274,706 SNPs and 343 640 CHIAMO called SNPs for association analysis. Figure 9 shows the genome-wide plots of the P-values for these SNPs. Under the stringent QC, it is more clearly evident that there is a slight replication signal at chromosome 9 when using the α=1.0 × 10−5 threshold. There also seems to be a possible slight signal on chromosome 1 when analyzing across the three algorithms for consistency.
CRLMM confidence call analysis
The CRLMM algorithm still seems to show discrepancies for these data even with the higher call rate QC filter. To further investigate the underlying reason for the discordances with the associated SNPs from CRLMM, the calls for the 98 SNPs found significant only by CRLMM (using the initial QC filtering standard, which was a call rate threshold of 95%) were analyzed in depth. As mentioned previously, the majority of these SNPs were filtered out by CHIAMO and BRLMM in QC steps. For BRLMM, 64 of these SNPs had a minor allele frequency of <0.01, 15 had a call rate of <95%, 1 SNP was significantly out of Hardy–Weinberg equilibrium proportions and 3 loci were excluded after all previous steps for a significant trend in missing data between cases and controls; this left only 16 SNPs of the 98 that were even tested for association. A similar pattern of filtering was observed for CHIAMO at these loci. Loci that passed QC in all algorithms additionally showed discordant behavior with CRLMM. Pair-wise scatter plots of the association test P-values for each of the algorithms showed many more SNPs with P-values that were inflated in the CRLMM algorithm when compared with both BRLMM and CHIAMO, as shown in Supplementary Figure 3; many of these loci were from the NSP array.
The CRLMM algorithm produces a confidence score for each genotype call and in this analysis a call with <0.94 confidence score was set to a no call. An analysis of variance was performed using JMP Genomics (Cary, NC, USA) software on the confidence scores for the 98 SNPs in question in the CRLMM algorithm, and for each SNP a test for differences in the mean confidence call between case and control status was calculated. The Hochberg and Benjamini13 adaptive step-up Bonferroni method was used to adjust for multiple testing of the 98 SNPs through PROC MULTTEST in SAS/STAT (Cary, NC, USA). Using a significance level of 0.05 with this multiple testing correction, 65 of the 98 SNPs were found to have significantly different confidence score means between cases and controls in the Ottawa data (see Supplementary Figure 4 for the volcano plot of these P-values versus the mean difference). Although all SNPs had average confidence scores greater than 97% for case and control sets (Supplementary Figure 5 shows a parallel plot of the mean profiles for each SNP found significant), the score distributions was skewed with several confidence scores falling down into the 0.5–0.8 range. Moreover, the majority of these low confidence scores were associated with heterozygous (A/B) SNP calls, indicating a propensity for genotyping errors to be made for the heterozygote call (which corroborates the fact that many of these SNPs were excluded in BRLMM and CHIAMO because of low minor allele frequency). Variability in the accuracy and quality of the CRLMM calls between the WTCCC and Ottawa samples indicates that more research into the performance of this method is needed.
The results of the analyses by the GWAWG highlight the necessity of quality standards across genotyping platforms, arrays and, most notably, genotype calling algorithms. Although the bulk of the research was performed on the Affymetrix 500K array, the evidence of systematic genotyping errors that influence GWAS results has far-reaching implications. Array consistency is only improving very slowly as microarray technology advances, and methods of calling genotypes still rely on accurately capturing genotype cluster clouds, which can be poorly formed and difficult to cluster. Newly introduced genotype calling algorithms that rely on the use of statistical clustering will inherently contain measurement error and will still produce genotype errors that can systematically bias the results of a GWAS. Our results show that it is vital to pay careful attention to how samples are collected and processed in the new era of meta-analysis and large-scale studies combining thousands of individuals across diseases, shared controls and even genotyping platforms.
When possible, with the proper computing resources, it is the authors' recommendation based on these results that genotype calling should be performed on all samples simultaneously to avoid systematic batch effects that propagate to downstream analysis. When this is not possible, it is imperative for cases and controls to be randomized within batches to avoid violations of the null hypothesis of no genotypic differences associated with trait status. The developers of BRLMM, the earliest of the three algorithms, discussed the issue of batch size and composition.8 On the basis of their data and accuracy targets, they mentioned that more samples processed at a time will lead to better results, but found that performance did not greatly improve when increasing sample size past 50–100 samples. BRLMM authors also caution consideration for the extent of which samples can be combined, depending on lab differences and variation in the underlying probe intensity distribution. CRLMM was developed in part to address poor performance across data from different laboratories9 so that more samples could be processed while mitigating lab effects and other differences among samples processed simultaneously. Their latest algorithm release, CRLMM version 2,14 proposes new quality metrics for hybridization batch or plate effects for evaluating SNP accuracy. The developers of CHIAMO invented the algorithm for the purpose of processing all samples (cases and controls) simultaneously and have specific parameters for modeling cohorts. The evolution of these algorithms indicates a positive trend toward methods that allow for more samples across cohorts to be processed simultaneously. In general, the experimental design of a study must be carefully considered when processing raw intensity data to genotype calls to mitigate the effects of biological batch effects (population differences, relatedness of samples), technical batch effects (lab and data collection differences) and statistical batch effects (how samples are processed through the genotyping algorithms). Each of the concurrent publications gives further insight and results on the influence of such batch effects.5, 6, 7, 11
The results of the MAQC GWAWG clearly indicate discordance both within and among genotype calling algorithms. All the three genotype calling algorithms, BRLMM, CRLMM and CHIAMO, have similarities and dissimilarities. Obtaining calls is a two-step process starting with pre-processing of the raw intensities to eliminate lab effects and other array-to-array variability. This step consists of normalization, transformation and summarization. The second step is statistical model fitting and clustering of the SNP intensities. Both BRLMM and CRLMM use flavors of robust multiarray averaging for pre-processing and CHIAMO also uses a quantile normalization method. Although variations in the execution of this first step, no doubt, can contribute to downstream discordance, the behavior of the algorithms under different conditions presented here and in concurrent publications by the MAQC GWAWG suggests to the researchers that greater discordance arises because of the statistical modeling and confidence score formulation used for obtaining calls from the intensities. The three methods rely on markedly different algorithms and priors for SNP clustering. BRLMM uses the Dynamic Model (a single-array genotype calling method) to estimate cluster centers and variances, and then determines confidence thresholds based on Mahalanobis distance ratios between cluster centers.5, 8 CRLMM uses the HapMap populations as a training set for the fitted mixture model, and genotype probabilities are obtained through the expectation-maximization algorithm,7, 9 and then confidence scores are based on these probabilities. Finally, CHIAMO was designed with specific parameters for modeling positional cluster differences for multiple cohorts using a hierarchical Bayesian mixture model,4, 6 which is likely why the case–control separated batches performed so poorly for the algorithm. Given the vast statistical differences in these methods, it is not surprising that discordant results in GWAS arise because of calling algorithm choice and execution.
Differences within and between genotype calling algorithms are clearly a source of variability in GWAS and may account for the lack of reproducibility. Replication studies for a trait should always attempt to implement the same calling algorithm and study design in concordance with other GWASs for that trait; yet, our analysis of the Ottawa data shows that reproducibility is a many-faceted issue. The failure to produce a consistent signal in the Ottawa at the chromosome 9p21.3 loci may be due to one of many issues that plague GWAS. The lack of reproducibility could be because of biological reasons. One concern is that the Ottawa samples may not have been from European descent as the WTCCC data were; this is something that would question the use of HapMap CEPH haplotypes to form priors for the CRLMM algorithm. The principal component plots for the Ottawa data with HapMap anchors in Figure 10 show that the Ottawa samples show more variability than the CEU population, but generally the data clusters well with the European ancestral group (similar to how WTCCC clustered). In the Results section, it was noted that SNP rs703845, which was found on chromosome 9, had a much higher minor allele frequency for the Ottawa samples than in the WTCCC samples. This finding was an exception to very similar allele frequency profiles between the two sets of samples with a correlation of minor allele frequencies 0.996 across all loci. Another source of biological variability may be because of phenotypic differences in the sets of cases and controls used. CAD is a complex disease that is influenced by the environment as well as genetics.
In addition to possible biological variability, there is statistical variability in that the sample size of 1850 individuals in the Ottawa may not be statistically powerful enough to detect a moderate effect compared with 3491 samples in the WTCCC study. The power of a study deeply affects statistical decision making in terms of determining genome-wide significance. SNP rs1333049 showed strong evidence for association in the WTCCC 3491 samples, with an average −log10 P-value of 11.49 across the three algorithms, whereas in the Ottawa 1850 samples this SNP had an average −log10 P-value of 5.02. Given the sample size and thus the expected loss of power in the Ottawa data, this P-value result does suggest strong evidence of a replication signal on chromosome 9; yet, this result was difficult to elucidate given the P-value patterns in the Ottawa data. The use of an empirical genome-wide significance level to be consistent with the WTCCC data for determining statistical significance may detract from biological meaning when a study is underpowered to detect a strong signal at that threshold. Allowing for different multiple testing thresholds (dependent on the number of SNPs that pass QC) can result in variability in statistical decisions. One solution to this dilemma is to focus more on the ranks of associations (based on test statistic values) to ensure that true associations are found and considered for replication studies15 (refer to Kraft et al.16 for a further discussion on the issue of replication in genome-wide association studies).
Variability in GWAS results arises from many sources. Owing to the nature of complex diseases, large samples sizes are necessary to capture the moderate effects that common genetic polymorphism contribute to disease heritability. With large sample sizes come biological sources of variability (such as population stratification) as well as technical (genotype calling batch effects, QC filtering and varying statistical power to detect moderate associations) sources. Increased array consistency, more accurate genotype calling algorithms and more robust methods for analyzing the data are necessary for GWAS to continue to produce successful and meaningful results for the understanding of disease heritability. These necessities in terms of genotype calling algorithms are made clear when different algorithms can produce vastly different lists of SNPs found to be associated with a trait. Presently, it is the responsibility of researchers to diligently apply rigorous QC measures and perform well-designed studies and to share not only data, but also detailed reports of how samples were obtained, genotypes were called and the statistical steps used to reach the result of an analysis to ensure reliable and reproducible findings.
We thank all members of the GWAWG and MAQC for their contribution to this article as well as the members of the WTCCC and Dr George Wells from the University of Ottawa Heart Institute for providing access to the data. We also thank the anonymous reviewers for their invaluable feedback, which has made this paper a much improved contribution.
About this article
The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.
Supplementary Information accompanies the paper on the The Pharmacogenomics Journal website (http://www.nature.com/tpj)