The robustness of genome-wide association study (GWAS) results depends on the genotyping algorithms used to establish the association. This paper initiated the assessment of the impact of the Corrected Robust Linear Model with Maximum Likelihood Classification (CRLMM) genotyping quality on identifying real significant genes in a GWAS with large sample sizes. With microarray image data from the Wellcome Trust Case–Control Consortium (WTCCC), 1991 individuals with coronary artery disease (CAD) and 1500 controls, genetic associations were evaluated under various batch sizes and compositions. Experimental designs included different batch sizes of 250, 350, 500, 2000 samples with different distributions of cases and controls in each batch with either randomized or simply combined (4:3 case–control ratios) or separate case–control samples as well as whole 3491 samples. The separate composition could create 2–3% discordance in the single nucleotide polymorphism (SNP) results for quality control/statistical analysis and might contribute to the lack of reproducibility between GWAS. CRLMM shows high genotyping accuracy and stability to batch effects. According to the genotypic and allelic tests (P<5.0 × 10−7), nine significant signals on chromosome 9 were found consistently in all batch sizes with combined design. Our findings are critical to optimize the reproducibility of GWAS and confirm the genetic role in the pathophysiology of CAD.
Genome-wide association studies (GWAS) reveal great opportunities in discovering genes underlying common, complex diseases.1 Many efforts have already been put in different therapeutic areas, with the latest effort being the Wellcome Trust Case–Control Consortium (WTCCC) Projects (http://www.wtccc.org.uk/). Single nucleotide polymorphism (SNP) microarrays represent a key technology allowing for high throughput genotyping, with which it is possible to assess genome-wide variation and conduct association studies.2 For example, the Affymetrix GeneChip Human Mapping 100K and 500K arrays have been widely used in GWAS, and the SNP 6.0 array with >900 000 SNPs has recently been introduced. At such densities, association studies are theoretically well powered to detect small and moderate genetic effects in samples involving hundreds to thousands of subjects.3 However, the practical use of SNP arrays may produce large numbers of false positives and fail to extract adequate information from the raw data because of genotyping errors. Therefore, selecting a highly reliable calling algorithm and eliminating unqualified SNPs with quality control (QC) are important.
In general, genotyping algorithms make a call (AA, AB, or BB) for a SNP of each sample assuming two alleles within a locus. One of the first algorithms designed for calling SNPs was the Adaptive Background Genotype Calling Scheme.4 Originally, the method fits Gaussian models using probe intensities associated with a particular SNP in a single chip. However, the program has a propensity to drop heterozygous calls. Affymetrix developed another Modified Partitioning Around Medoids5 for analysis of the 10K chip. Still, the method does not perform well when the number of chips input into the program is of moderate size and for SNPs with a low minor allele frequency (MAF). On the basis of the Adaptive Background genotype Calling Scheme, Affymetrix developed the dynamic model (DM).6 Although unaffected by small sample size and low MAF, DM is prone to drop heterozygous calls as well. In parallel to the Modified Partitioning Around Medoids basic framework, Rabbee and Speed7 developed the Robust Linear Model with Mahalanobis Distance Classifier (RLMM). A regression strategy in these models is used to infer cluster characteristics, which makes calls with markedly greater accuracy than DM. However, it is not robust to variability in procedures used by different laboratories.7 Later, Affymetrix introduced the Bayesian Robust Linear Model with Mahalanobis Distance Classifier (BRLMM)8 for 100K and 500K SNP chip arrays. It uses DM to make initial guesses and form a prior for cluster characteristics. Clusters for each SNP are then re-calibrated in an ad hoc Bayesian manner; clusters that are populated with few data points because of low MAF, can draw more influence from the prior. The Affymetrix current product, the SNP 6.0 array, provides yet another algorithm: Birdseed (http://www.broadinstitute.org/mpg/birdsuite/birdseed.html). In 2007, Carvalho et al.9 developed a pre-processing algorithm designed to remove the bulk of the laboratory effect. The resulting algorithm is referred to as the Corrected Robust Linear Model with Maximum Likelihood Classification (CRLMM). Similar to BRLMM, CRLMM includes a re-calibration step using a Bayesian framework to adjust clusters to account for the residual effects. A procedure is also added explicitly tying metrics of call confidence to per-SNP accuracy in a manner robust to chip-run quality.
The first goal of this paper is to validate new features of CRLMM calling on the Affymetrix 500K chip data through a GWAS with a large sample size (3491 subjects), which has never been conducted successfully in the published literature. The second goal is to describe qualification benchmarks to be used for comparison purposes. The importance of sound assessment protocols was underscored by the recent coronary artery disease (CAD) information from the WTCCC.10 Comparing it with the calling algorithm BRLMM and CHIAMO developed by WTCCC, we find that CRLMM provides more accurate genotype calls across data sets and offers substantially improved estimates of accuracy.
Materials and methods
We used the raw intensity data (CEL files) from WTCCC representing runs of varying quality, various batches, and different case–control combinations. In all, 1991 cases for CAD and 1500 normal individuals from the UK National Blood Service control group were available. All individuals were genotyped on 500K Array by Affymetrix, consisting of two sets (Nsp and Sty), each capable of genotyping ∼250 000 SNPs. In total, 500 568 SNPs were genotyped overall.
We built a Dell PowerEdge 2900 server, with 2 × Quad Core Intel Xeon E5355, 2.66 GHz CPU, 500 GB HD SCSI hard drive, 28 GB RAM memory. The operating system is Linux Red Hat Enterprise Linux 5 Server Edition x86_64 with Sun Grid Engine.
A brief summary of the pre-processing and genotyping algorithms is presented here; for additional technical detail, see Carvalho et al.9 Starting with available CEL files provided by Affymetrix, CRLMM summarizes the probes associated with each SNP in a manner similar to SNP Robust Multiarray Average.11 The resulting values are proportional to the log2 of the quantity of DNA in the target sample associated with alleles A and B. CRLMM describes SNP variation effects with a simple mixture model, estimates the model separately on each array and treats the sense and antisense features as exchangeable. The Expectation–Maximization algorithm is used to obtain genotype calls by estimating and maximizing the probability of each class for each SNP with 99% concordance rates. A supervised learning approach yields more accurate genotype calls and eliminates laboratory-specific effects by using an empirical Bayes solution to predict centers and scales for cases with few or no observations.12
BRLMM has a standard cutoff of 0.5 for call/no-call; similarly, we set 0.94 to be the threshold of confidence for calling a genotype missing in CRLMM. In addition, CRLMM has a ‘batch_size’ argument, whose default value is 40 000. That means that 40K SNPs will be processed at a time.
On the basis of our computer specification, we decreased it to 10K so as not to exceed the memory capacities. A software implementation of CRLMM is freely available through the Oligo package at Bioconductor (http://www.bioconductor.org/), which is an open development software project running under the statistical computing program R (http://www.rproject.org/).
Batch effect schema
WTCCC raw data were normalized and genotyped following the experimental design to estimate impacts of batch size effects (that is the number of samples processed simultaneously) and composition differences (that is the case-to-control ratio in a batch) on CRLMM. Several levels of batch sizes were tested; including 500, 2000, and 3491 samples in each batch. Three levels of batch composition were used: separate (S) with cases and controls in different batches, combined (C) with a 4/3 ratio of cases to controls which were assigned to batches simply by plates where 96 samples were detected; and randomized (R) with systematic sampling methodology. Systematic sampling is a statistical method involving the selection of elements from an ordered sampling frame when the given case/control samples are logically homogeneous. The combined (C) design used the order from the WTCCC original samples, whereas randomized (R) design assigned a unique random number to each sample in cases and controls, respectively, sorted the random number, and then picked the files by the interval (interval=batches). This was one way to repeat the experiment by generating new sets of data of the same combination.
In our experimental design, nine different data sets of genotype calls for 3491 samples were used and annotated as follows: C500 and R500 contained genotypes called from CRLMM in seven batches consisting of 285 cases and 215 controls for a combined batch composition of 500. C2000 and R2000 genotypes were called in two batches of 995 cases and 750 controls. S500 used four batches of 500 cases and three batches of 500 controls. S2000 consisted of one batch of 1991 cases and one batch of 1500 controls. We also explored even smaller batch sizes. R250 (14 batches with combination of 143 cases and 107 controls) and R350 (10 batches with combination of 200 cases and 150 controls). R3491 was created with one single batch of the 3491 samples with systematic sampling.
Note that due to the unequal number of 1991 cases, the batches might have contained one more or less samples for that scenario. Our design allowed for the quantification of the effect of both batch size and composition as well as the possible interaction of the two effects. Table 1 gives the format of the design matrix and parameter vector for the fixed batch effect levels: R250, R350, R500, C500, S500, R2000, C2000, S2000, and R3491.
CRLMM genotyping output files were analyzed by SNPassoc, Scrime packages from Bioconductor and our own R programming. The genotype data for each batch over the Nsp and Sty arrays were merged and formatted into a wide data set with rows corresponding to individuals and columns as markers. SNP annotation was downloaded from http://www.affymetrix.com/support/technical/annotationfilesmain.affx. The following QC steps were carried out for samples in each batch. To eliminate low-quality chips, individuals with a call rate <97% were excluded; further individuals were dropped if their average heterozygosity was below 23% or exceeded 30% (empirical threshold used in the original data analysis by the WTCCC). SNP QC consisted of three steps. Markers with either an MAF <1% or a call rate of <95% were excluded. Remaining markers were filtered using a χ2 one degree of freedom test of significant differences in the proportion of missing data between cases and controls. SNPs that failed the test for Hardy–Weinberg Equilibrium in the controls were also eliminated from analysis. The significance of the results for the test of a trend in missing data between cases and controls and the Hardy–Weinberg Equilibrium test were determined using α=5.7 × 10−7, an empirical threshold used by the original WTCCC study.
The samples and SNPs that passed QC were evaluated in nine designs for significant associations with disease status (CAD); by using PLINK (v1.06) (http://pngu.mgh.harvard.edu/~purcell/plink) with the one degree of freedom Cochran–Armitage trend test13 with additive genetic model and two degrees of freedom genotypic test. Association was considered significant if the test P-value was <5.7 × 10−7, using the empirical threshold previously proposed by the WTCCC study.
Overall computation performance
The time was almost the same for all batches if the data ran sequentially. For example, CRLMM genotyping R3491 needed 16 days (NSP for 8 days and STY for 8 days). R2000 needed 16 days too, unless the system was allowed to run two batches simultaneously. All nine designs took about 5 months to complete genotyping calls.
Nine designs of CRLMM genotype calls were generated under different levels of batch size and composition for the WTCCC data: R250, R350, R500, R2000, R3491, C500, C2000, S500, and S2000. QC to exclude both unqualified individuals and SNP markers before association testing was carried out for all the data sets. Figure 1 shows workflows of samples and loci after exclusion in nine data sets. For sample exclusion, there was a slight trend in more individuals being excluded with a call rate <97% as batch sizes increased for both combined and separate batch compositions. Samples that did not pass the heterozygosity filter was the same for all data sets, namely 12707B11 (CAD, male). QC results for SNP exclusion showed that for batch composition, more SNPs tended to have significant differences in the call rate between cases and controls when cases and controls were in separate batches. Therefore, the numbers of SNPs that passed QC steps were a little larger in combined compositions than in separate compositions. Batch effects in CRLMM did not have as much of an impact as those in BRLMM on the number of SNPs that were excluded by QC. For designs R250, R350, R500, and R2000, while the number of SNPs dropped from a given batch set was ∼120 000–115 000; the majority 112 599 loci in common were excluded in four data sets. Even with different compositions of C500, C2000, S500, S2000, and R3491, while the number of SNPs dropped from a given batch set was ∼125 000–117 000, still the majority 113 599 loci in common were excluded in all five designs. We were aware that although the proportion of discordant SNPs due to QC between two batch sets was no greater than 2–3%, several thousand markers that were dropped due to QC for genotype calls from one design might be analyzed for significant associations in another. The Venn diagram in Figure 2 shows the breakdown of all SNP counts that are either concordant or discordant across batch sets.
Markers that survived QC were deemed significant using α=5.7 × 10−7 criterion for association tests with CAD status. Figure 3 displays the P-values of the Cochran–Armitage trend test in nine designs by demonstrating the −log10 (P-value) across the genome sorted by Chromosome and Mb position. Across all batch sets, nine SNPs in the chromosome 9 region were highly significant (Table 2) in either the genotypic test or the allelic test under different genetic models (additive, recessive, dominant).
Very little difference appeared regarding different batch sizes of R250, R350, R500, and R2000. When comparing the differences between separate compositions and combined compositions (including systemic sampling), separate compositions indicated slightly inflated P-values, whereas combined design and randomized design results were consistent across different sets of data. Figure 4 plots the pairwise P-values in two designs on the −log10 scale for all SNPs that passed QC. The solid gray lines indicated the 45 degree qangle and the dotted gray lines fell at α=5.0 × 10−7 level. Points that fell off the diagonal line indicated SNPs with discordant P-values and any SNPs in either of the two off-diagonal rectangles boxed in by the dotted lines were those that resulted in differential significance decisions. There was very little discordance found between the combined compositions (including systemic sampling), whereas discordant P-values tended to be more significant in data with separate case–control batch compositions.
Figure 5 contains Venn diagrams representing the counts of SNPs with P<5.0 × 10−7 for one batch set comparing with results from another batch set. Five pairwise comparisons were of interest to interrogate effects of batch size (C500 vs C2000, C500 vs R3491, and S500 vs S2000) and composition (C500 vs S500 and C2000 vs S2000). The areas of the circles that did not overlap showed that the discordant lists of significant findings between batches. For each Venn diagram, the source of discordance (non-overlapping circles) could be categorized as due to QC (a SNP was excluded and hence missing in one data set but qualified in another) or due to differential test results (the qualified SNP was in both data sets but was only found significant in one batch set). The bars below the Venn diagrams represented the source of discordance for that pairwise batch set comparison and were broken down by color where black corresponds to the SNPs that were missing (excluded due to QC) in the first data set in the comparison label but found significant in the second batch set. Light gray corresponded to discordant SNPs found significant in the first batch set but QC excluded in the second. Dark gray indicated the significant SNPs in the second batch set that were not significant in the first. Finally, medium gray corresponded to SNPs found significant in the first batch set in the comparison but not in the second. Using the first bar on the left as an example, black (at the bottom of the bar) represented the count of SNPs that were found significant in C2000 but were excluded due to QC in C500 (that is missing in the first data set listed on the x axis label), light gray (second color from the bottom) was the count of discordant SNPs that were significant in C500 but QC excluded in C2000, the darker gray was the number of SNPs that were significant in C2000 but not in C500 (differential testing), and the medium gray was the count of SNPs that were significant in C500 but not in C2000. The medium and light gray portions of the bar corresponded to the SNP count in the left non-overlapping Venn diagram portion, whereas the black and dark gray added up to the count found in the right non-overlapping portion of the Venn diagram circle.
The largest amount of overall discordance was found between S500 and S2000 and also showed by far the largest amount of discordance due to association testing results. The comparison of C500 vs S500 showed the largest amount of discordance due to QC exclusion where many SNPs that were found significant in S500 were excluded due to QC in C500. Comparisons of different batch sizes showed a much smaller proportion of discordance due to QC exclusion as opposed to test results in contrast to batch composition comparisons. The discordance for C500 vs C2000 and C500 vs R3491 showed similar patterns, indicating that an increase in the magnitude of the batch size did not appear to influence results. Although S500 vs S2000 showed the most discordance, C500 vs R3491 showed the least; hence the effect of batch size is not strong. From the comparison between C500 vs S500 and C2000 vs S2000, in contrast to BRLMM, an interactive effect of batch size and composition was not obvious in CRLMM.
GWAS have been widely used to discover significant genetic signals underlying complex, heritable disorders. Microarrays provide the genotype calling technology in GWAS as they have the capability to explore more than a million SNPs simultaneously. Before association analysis between genetic effects and disease status in GWAS, making genotype calls is the first step. Various sophisticated algorithms have been proposed for transforming raw data into genotype calls. The variability in microarray output quality across different SNPs, different arrays, and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms, which will further affect the quality of findings reported by the GWAS.
We have successfully applied the algorithm CRLMM on WTCCC GWAS data including 3491 Affymetrix 500K chips, which has never been previously reported in the published literature with such a large data set. Hong et al.14 evaluated 270 Hapmap samples by using Affymetrix 6.0 and found that batch effects were even smaller compared with Affymetrix 500K. We recently completed another CRLMM application on an Ottawa Heart Institute GWAS data with 1879 samples running on Affymetrix 500K chips, and observed similar robustness to batch effects. CRLMM accounts for variability across batches and improves the call-specific assessment of each call. The enhanced hierarchical model permits the development of quality metrics for the identification of low-quality SNPs, samples, and batches.15
We have also described an experimental design to estimate the effects of batch sizes and compositions, which is useful to compare CRLMM with other existing calling algorithms such as BRLMM and CHIAMO. Batch size and composition differences in the BRLMM and CHIAMO algorithms can drastically change the results of a GWAS and are a potential source for lack of reproducibility. In BRLMM and CHIAMO, batches containing more individuals with cases and controls combined resulted in more conservative and concordant associations.10
We have shown that after CRLMM genotyping batch size does not have an influence on the total number of samples and SNPs passing QC procedures as well as the final list of significantly associated SNPs, using the WTCCC CAD disease data set as an example. However, composition considerations can produce discordant results in QC decisions and subsequent association testing. As separate composition randomizes and normalizes the case and control separately, it is common that plate and outcome of interest are partially confounded when genotyping. Therefore, it will be difficult to distinguish real from artifactual associations. Imposing a stricter QC measure can eliminate discordance in association results, but also can eliminate a large portion of potentially informative data. Although these algorithms may need to address the case/control composition issue, currently CRLMM has developed an enhancement to the model, providing much improved probabilities and a powerful probability-based approach to detect problematic SNPs and batches. By comparing designs of separate composition within the three algorithms, CRLMM produces the lowest number of false positives in the final association results.
CRLMM is developed in Bioconductor package and has to be running under R environment. It is more computationally intensive and requires higher standard computer resources such as CPU processing time and memory compared with BRLMM and CHIAMO. Our experiment design and QC procedure provide a good example to validate and compare the three algorithm performances on a GWAS with a large sample size.
Risch N, Merikangas K . The future of genetic studies of complex human diseases. Science 1996; 273: 1516–1517.
Dong S, Wang E, Hsie L, Cao Y, Chen X, Gingeras TR . Flexible use of high-density oligonucleotide arrays for single-nucleotide polymorphism discovery and validation. Genome Res 2001; 11: 1418–1424.
Lin S, Chakravarti A, Cutler DJ . Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet 2004; 36: 1181–1188.
Cutler DJ, Zwick ME, Carrasquillo MM, Yohn CT, Tobin KP, Kashuk C et al. High-throughput variation detection and genotyping using microarrays. Genome Res 2001; 11: 1913–1925.
Liu WM, Di X, Yang G, Matsuzaki H, Huang J, Mei R et al. Algorithms for large-scale genotyping microarrays. Bioinformatics 2003; 19: 2397–2403.
Di X, Matsuzaki H, Webster TA, Hubbell E, Liu G, Dong S et al. Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays. Bioinformatics 2005; 21: 1958–1963.
Rabbee N, Speed TP . A genotype calling algorithm for Affymetrix SNP arrays. Bioinformatics 2006; 22: 7–12.
BRLMM: an improved genotype calling method for the GeneChip human mapping 500K array set. http://media.affymetrix.com/support/technical/whitepapers/brlmm_whitepaper.pdf.
Carvalho B, Bengtsson H, Speed TP, Irizarry RA . Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 2007; 8: 485–499.
TheWellcome Trust Case Control Consortium. Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature 2007; 447: 661–678.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4: 249–264.
Lin S, Carvalho B, Cutler DJ, Arking DE, Chakravarti A, Irizarry RA . Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biol 2008; 9: R63.
Armitage P . Tests for linear trends in proportions and frequencies. Biometrics 1971; 11: 375–386.
Hong H, Su Z, Ge W, Shi L, Perkins R, Fang H et al. Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500K array set using 270 HapMap samples. BMC Bioinformatics 2008; 9 (Suppl 9): S17.
Carvalho B, Louis TA, Irizarry RA . Quantifying uncertainty in genotype calls. Bioinformatics 2010; 26: 242–249.
We thank the Wellcome Trust Case–Control Consortium (WTCCC), UK, for providing us the original Affymetix 500 data with CAD patients and controls. We thank Benilton Carvalho and Rafael A Irizarry (Johns Hopkins University) for providing us advice on setting up CRLMM on our computer server. The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.
The authors declare no conflict of interest.
About this article
Cite this article
Zhang, L., Yin, S., Miclaus, K. et al. Assessment of variability in GWAS with CRLMM genotyping algorithm on WTCCC coronary artery disease. Pharmacogenomics J 10, 347–354 (2010) doi:10.1038/tpj.2010.27
Cell Discovery (2017)
The AAPS Journal (2016)
Standardization efforts enabling next-generation sequencing and microarray based biomarkers for precision medicine
Biomarkers in Medicine (2015)
Coronary Artery Disease (2015)
Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine