Introduction

Genome-wide association (GWA) studies for complex human diseases have now become increasingly popular due to rapid decrease of genotyping costs and recent completion of the International HapMap Project.1, 2, 3, 4 With interrogation of hundreds of thousands of SNPs in a large collection of human subjects, GWA studies allow a comprehensive scan of the genome and have the potential to identify novel disease-related genes. The advent of GWA studies has led to the discovery of susceptibility genes for age-related macular degeneration,5 cardiac repolarization,6 obesity,7 inflammatory bowel disease,8 and type II diabetes.9

However, many issues in designing and analyzing GWA studies remain unclear. For example, when designing a GWA study, an investigator has to choose among several SNP chips. Ideally, one would wish to choose the SNP chip that provides the best genomic coverage for the studied population. However, given the increased cost of using a denser chip, one would also be interested in knowing how much power gain a denser chip has over a less dense chip. The decision is largely dependent on comparison of different SNP chips, thus making systematic and thorough evaluation inevitably important.

The most commonly used criterion for SNP chip evaluation is global coverage, defined as the fraction of common SNPs that are tagged by the SNPs on the chip.10, 11 The global coverage is clearly the most relevant criterion, as it represents the average level of coverage of all common SNPs. However, the HapMap data showed in great detail the extent of local variation in linkage disequilibrium (LD) across the genome. Since coverage is calculated based on LD, one would expect variation in coverage as well. Although the global coverage provides an overall evaluation of an SNP chip, it does not tell us how the coverage varies across the genome, an important feature that should be taken into consideration because coverage variation often results in power variation in subsequent association analysis.

To achieve a fuller understanding of the coverage of SNP chips, we propose carrying out more detailed coverage evaluations, including a map of local coverage over small consecutive genomic regions, and gene coverage that is calculated for each known gene in the genome. These evaluations reveal the degree of variation of each SNP chip in covering the genome and can facilitate SNP chip comparisons at a finer scale. We evaluate both the local coverage and gene coverage for six currently available SNP chips, including Affymetrix SNP Array 5.0 and SNP Array 6.0, and Illumina HumanHap300, HumanHap550, HumanHap650Y, and Human1M. Since the power for regions or genes of low coverage is likely to be lower than that for regions or genes of high coverage, information on local coverage and gene coverage can help determine if supplementary genotyping is necessary for the success of a GWA study.

Methods

Data sets

We considered six most commonly used SNP chips in GWA studies: Affymetrix SNP Array 5.0 (500 568 SNPs) and SNP Array 6.0 (934 968 SNPs), and Illumina HumanHap300 (317 511 SNPs), HumanHap550 (555 352 SNPs), HumanHap650Y (660 917 SNPs), and Human1M (1 072 820 SNPs). The Illumina SNP chips include tag SNPs derived from over two million common SNPs (minor allele frequency MAF ≥0.05) in the HapMap data. The Affymetrix SNP Array 5.0 includes SNPs selected on the basis of sequence constraints when choosing the probes, and thus represents a set of quasi-random SNPs that ignores LD patterns.10 The additional SNPs in the SNP Array 6.0 are mostly tag SNPs. Allele frequency and LD data for the four HapMap populations (CEU, CHB, JPT, and YRI) were obtained from HapMap release no. 21.

Local coverage

We estimated the coverage of the six SNP chips for chromosomal regions of sizes 1 Mb throughout the genome. We adapted the formula of Barrett and Cardon10 to estimate local coverage rate for each of the four HapMap populations. Briefly, for each 1 Mb region, we obtained R – the number of common SNPs in the HapMap, T – the number of common SNPs on the SNP chip, and L – the number of common SNPs not on the SNP chip but are tagged at r2≥0.8 by at least one SNP in the chip within 250 kb. Let G denote the total number of common SNPs in the region under consideration, including those that have already been discovered and those that have yet to be discovered. Following Barrett and Cardon,10 the local coverage rate is estimated by

Here L/(RT) computes the fraction of HapMap common SNPs tagged by SNPs on the chip but are not tags themselves. Multiplying this fraction by GT yields the number of common SNPs in the region that are not on the chip but can be tagged by SNPs on the chip. This number is then added by T to give an estimate of the total number of SNPs that are captured by either LD tagging or by inclusion on the chip. Compared to a naïve estimate of coverage, (L+T)/R, this formula corrects for overestimation of coverage.10

The value of G is unknown and needs to be estimated. For a 1 Mb region, the average number of common SNPs is estimated to be about 2631 based on the estimated numbers of common SNPs (7.5 × 106) and euchromatic base pairs (2.85 × 109) in the human genome.10, 11 We recognize that different estimates of G may lead to different values of local coverage rate. However, the above formula can be rewritten as L/(RT)+[1–L/(RT)] × T/G, which indicates that the value of G has little effect on the final estimate as long as the fraction of common SNPs included in the SNP chip, T/G, is small, which is true for the six SNP chips we evaluated.

To calculate local coverage rate across the genome, we moved the 1 Mb window by 200 kb and repeated the calculation until the end of the chromosome. We did not calculate the values for a window if (1) the number of common SNPs in the HapMap is <20, (2) all common SNPs are located at the left or right half of the window, or (3) the common SNPs are clustered at the ends of the window with a big gap (≥500 kb) in between. As a result, coverage was not calculated for about 7% of the genome, most of which are in heterochromatic regions and have effectively no coverage from the current SNP chips.

Gene coverage

The local coverage calculation procedure can also be applied to calculate the coverage for each gene in the genome. To obtain the starting and ending positions of genes, we downloaded the known Gene table (contains positions of transcripts for known protein coding genes) and the kgXref table (contains cross reference between transcript IDs and gene symbols) from the UCSC human genome release hg17. A gene region is defined as the region from the transcriptional start to end positions, including both exons and introns. For a gene that has more than one transcript, the gene region is defined as the union of regions for all the transcripts. By merging the known Gene and the kgXref tables and eliminating genes that map onto different chromosomes, we obtained 29 815 autosomal and X-linked gene regions. Gene regions vary greatly in size, and those containing very few HapMap common SNPs may have unreliable or inflated coverage results because the design of most current SNP chips relied on the HapMap data. Because of this, we considered gene regions containing only five or more HapMap common SNPs, resulting in 19 913 gene regions for the CEU sample in final analysis (19 299 for CHB, 19 211 for JPT, and 20 694 for YRI, respectively).

Coverage calculation for SNP Array 6.0 and Human1M

The local coverage and gene coverage were calculated based on the HapMap data. However, each of the latest two chips, SNP Array 6.0 and Human1M, has about 10% of the SNPs that are not in the HapMap. According to Affymetrix, the SNP Array 6.0 has 934 968 SNPs, but with 99 854 SNPs (10.7%) not in the HapMap, including 72 379 common SNPs for CEU, 76 016 for CHB, 70 356 for JPT, and 83 412 for YRI. According to Illumina, the Human1M has 1 072 820 SNPs, but with 125 688 SNPs (11.7%) not in the HapMap, including 70 995 common SNPs for CEU, 67 453 for CHB/JPT, and 77 729 for YRI. Because of this, their local coverage and gene coverage may be underestimated if only the HapMap SNPs were considered in coverage calculation. To address this problem, we calculated an alternative coverage estimate as follows, using the SNP Array 6.0 as an example. Suppose there is an ‘updated HapMap data set’ that consists of the current HapMap SNPs and the SNPs on the SNP Array 6.0. Based on this ‘updated data’, for each region, we could estimate the number of common SNPs, denoted as R1, and the number of common SNPs on the chip, denoted as T1. For example, if the region contains m non-HapMap common SNPs on the SNP Array 6.0, then R1=R+m and T1=T+m. However, owing to the lack of LD information between the ‘new’ SNPs and the other HapMap SNPs, we do not know how many additional HapMap SNPs are tagged by these ‘new’ SNPs, therefore, L1 cannot be directly estimated. However, if we assume that the number of tagged common SNPs that are not on the chip increases proportionally with the number of common SNPs on the chip, that is, T1/T=L1/L, then L1 can be estimated as (T1/T) × L. Therefore, based on the ‘updated HapMap data’, we could calculate the local/gene coverage of the SNP Array 6.0 as

The original estimate of genomic coverage in (1) ignored the SNPs that were on the SNP Array 6.0 but were not on the HapMap, and thus it can be viewed as a ‘lower bound’ of the coverage. On the other hand, the coverage in (2) might overestimate when T1>T and T is small. In our analysis, we took the average of the coverage calculated using (1) and (2), which we believe may provide a more appropriate estimate for the coverage of the SNP Array 6.0. The coverage estimate for the Human1M was similarly calculated.

Results

A map of local coverage

We estimated the local coverage rate for Affymetrix SNP Array 5.0 and SNP Array 6.0, Illumina HumanHap300, HumanHap550, HumanHap650Y, and Human1M. As an example, Figure 1 displays the local coverage rate for chromosome 17 for the four HapMap populations. Detailed, high-resolution results for all chromosomes can be downloaded from http://www.biostat.mc.vanderbilt.edu/SNPChipCoverage. Not surprisingly, the Human1M has universally better coverage than the other five chips for all four populations. For the CEU sample, the coverage of the HumanHap550 is almost always better than the SNP Array 6.0, despite the fact that the latter chip has a significantly more number of SNPs; moreover, the HumanHap300 is almost always better than the SNP Array 5.0. As expected, the coverage of the HumanHap650Y is significantly improved for the YRI sample over the HumanHap550. For comparison's purpose, the global coverage of the six SNP chips is summarized in Table 1.

Figure 1
figure 1

Local coverage map for each HapMap population for chromosome 17. The six SNP chips that were evaluated are SNP Array 5.0 (black), SNP Array 6.0 (blue), HumanHap300 (red), HumanHap550 (green), HumanHap650Y (cyan), and Human1M (purple). The red bars at the top and bottom indicate the transcription regions of known protein coding genes.

Table 1 Global coverage (%) by SNP chips

Figure 2 shows a wide range of local coverage across the genome, with some regions receiving low to moderate coverage. For Human1M, the percentage of the euchromatic genome that has ≥80% local coverage rate is 98% for the CEU sample and 97% for the CHB+JPT samples. For HumanHap650Y, the corresponding percentages are 90 and 77%, respectively; for HumanHap550, the percentages are 88 and 73%; for HumanHap300, the percentages are only 41 and 11%. For Affymetrix chips, the percentages are 69 and 74% for SNP Array 6.0, and only 9 and 12% for SNP Array 5.0. All six SNP chips have low coverage rate for the YRI sample. Figure 2 indicates that evaluation of local coverage provides complementary information of an SNP chip in addition to global coverage.

Figure 2
figure 2

Distribution of local coverage. The vertical line is the global coverage rate.

We next evaluated the variation of coverage across chromosomes by calculating the average local coverage rates for all 1 Mb intervals on each chromosome. The coverage of different chromosomes is largely similar, except for chromosome 19, which appears to have lower coverage by all six SNP chips across all HapMap populations (Figure 3). For example, for the CEU sample and SNP Array 6.0, the coverage for chromosome 19 is 67%, whereas the coverage for the other chromosomes ranges from 75 to 86%. The lower coverage for chromosome 19 is presumably due to SNP ascertainment bias in the HapMap12 or the unusually high density of repeat sequences and high prevalence of large segmental duplications on this chromosome.13

Figure 3
figure 3

Mean local coverage by chromosome. The six SNP chips that were evaluated are SNP Array 5.0 (black), SNP Array 6.0 (blue), HumanHap300 (red), HumanHap550 (green), HumanHap650Y (cyan), and Human1M (purple).

Gene coverage

Figure 4 displays the number of gene regions with coverage exceeding certain thresholds for all six SNP chips. For the CEU sample, among the 19 913 genes with at least five common SNPs in the HapMap, 17 730 (89.1%) genes have ≥80% coverage by the Human1M, while the numbers are 16 210 (81.4%), 15 873 (79.7%), 11 207 (56.3%), 12 613 (63.3%), and 6820 (34.2%), respectively, for the HumanHap650Y, HumanHap 550, HumanHap300, SNP Array 6.0, and SNP Array 5.0. The numbers are slightly smaller for the CHB+JPT samples, but drop substantially for the YRI sample. We also note that there is a noticeable fraction of genes that are not well covered by all six SNP chips (Figure 5). For example, for the CEU sample, 1897 (9.5%) genes have coverage of <80% by all six SNP chips. The numbers of such genes are even greater for the CHB (2457, 12.7%), JPT (2295, 11.9%), and the YRI (10 722, 51.8%) samples. Moreover, for each SNP chip, there are some genes that have zero coverage at r2=0.8, even though they contain five or more HapMap common SNPs (Table 2).

Figure 4
figure 4

Number of genes covered at various coverage thresholds. Only gene regions containing with ≥5 HapMap common SNPs were considered, and coverage was evaluated at r2 ≥0.8.

Figure 5
figure 5

Number of genes with coverage less than a certain threshold by all six SNP chips. Only gene regions containing with ≥5 HapMap common SNPs were considered, and coverage was evaluated at r2 ≥0.8.

Table 2 Number of genes with 0% coverage by SNP chips

Similar to the analysis of local coverage, we also calculated the average coverage for genes on each chromosome (Figure 6). Again, we observed that the average coverage for genes on chromosome 19 is significantly lower than that for genes on other chromosomes. For example, for the CEU sample and SNP Array 6.0, the average coverage for genes on chromosome 19 is 61%, whereas the average coverage for genes on other chromosomes ranges from 73 to 85%. Since chromosome 19 has the highest density of genes among all human chromosomes, more than double the genome-wide average,13 it is inevitably important to improve its coverage.

Figure 6
figure 6

Mean gene coverage by chromosome. The six SNP chips that were evaluated are SNP Array 5.0 (black), SNP Array 6.0 (blue), HumanHap300 (red), HumanHap550 (green), HumanHap650Y (cyan), and Human1M (purple). Only gene regions containing with ≥5 HapMap common SNPs were considered, and coverage was evaluated at r2≥0.8.

Table 3 lists genes that have <30% coverage for the CEU sample by all six SNP chips and that are known to be associated with pathways in the KEGG and BioCarta databases (lists for other samples can be obtained from http://www.biostat.mc.vanderbilt.edu/SNPChipCoverage). This list includes several genes that have been previously identified to be associated with human diseases. For example, Long et al.14 noted that increased expression and a polymorphism of TGFB1 are associated with abdominal obesity and body mass index in humans. TGFB1 has also been reported to play a role in many other diseases, including Duchenne muscular dystrophy,15 kidney disease,16 cancer,17 scleroderma,18 lung disease,19 and herpes simplex virus-1 infection.20 We recognize that these findings need to be replicated by future studies. However, despite the potential important role of TGFB1 in many diseases, all six SNP chips we evaluated have poor coverage for this gene. If an investigator is mainly interested in studying these diseases, then it is likely that TGFB1 will be missed in the initial scan. Understanding the coverage of known genes of different SNP chips will help investigators determine whether supplementary genotyping is needed for certain genes of high interest.

Table 3 Genes with coverage less than 30% by all six SNP chips for the CEU sample

We next evaluated whether genes with poor coverage are more likely to be located in copy number variation (CNV) regions.21, 22 We obtained the CNV annotation file from Affymetrix, which assembled information of all known CNV regions. For a given coverage threshold, the genes were categorized into two groups, one with coverage higher than the threshold and the other lower than the threshold. Within each group, we calculated the fraction of genes that are located in known CNV regions. Not surprisingly, a higher fraction of low coverage genes fall into known CNV regions than high coverage genes, and the difference is greater for smaller coverage threshold values (Figure 7). This indicates that genes with poorer coverage are more likely to be located in known CNV regions. We also note that for the CEU sample, the fraction of low coverage genes in known CNV regions is slightly higher for the Illumina chips than the Affymetrix chips. This is presumably due to the fact that Illumina designed their products based on tag SNPs derived from the HapMap CEU sample, whereas Affymetrix designed their chips on the basis of sequence constraints when choosing the probes, which may result in a better coverage for CNV regions.

Figure 7
figure 7

Percentage of genes in known CNV regions at various coverage thresholds. The six SNP chips that were evaluated are SNP Array 5.0 (black), SNP Array 6.0 (blue), HumanHap300 (red), HumanHap550 (green), HumanHap650Y (cyan), and Human1M (purple). Solid lines are for genes with coverage greater than the coverage threshold, and dashed lines are for genes with coverage less than the coverage threshold. Only gene regions containing with ≥5 HapMap common SNPs were considered, and coverage was evaluated at r2≥0.8.

Another possible reason of poor coverage is due to weak LD, as such regions would require inclusion of the majority of SNPs in the region in order to achieve satisfactory coverage. For genes that are not located in known CNV regions, we calculated the average r2 over all common SNP pairs that are 30 kb apart. As expected, genes with poor coverage tend to have significantly lower levels of LD than genes with high coverage (data not shown).

Discussion

For six currently available SNP chips, we calculated a map of local coverage across the genome as well as the coverage of all known genes. All six SNP chips have demonstrated variation in their coverage. As GWA studies are becoming a major approach toward disease gene discovery, such explicit evaluation of coverage variation will give a full picture of the genotyping products. We believe that our results can facilitate several aspects in GWA studies.

First, it will be of interest to investigators who have specific prior interest in certain regions in the genome (e.g. candidate genes, linkage peaks, conserved elements and so on). Knowing the extent of coverage for these regions or genes can help determine whether supplementary genotyping is needed in addition to the whole-genome SNP chip.

Second, evaluation of local coverage and gene coverage can ease interpretation and comparison of inconsistent results from GWA studies using different SNP chips. Inconsistency of results in a region or gene across studies might be partly due to differences in coverage. Our results on local coverage (Supplementary Figure 1) and gene coverage (Supplementary Table 1) provide a clear visualization of coverage across the genome for several widely used SNP chips. With such information, an investigator can easily compare local coverage of different SNP chips, aiding interpretation of different results.

Third, knowledge on local and gene coverage can help design new SNP chips. We recognize that the selection of SNPs to be included in a chip will depend on practical constraints; for example, it may be difficult to improve coverage for certain regions due to structural variations such as CNVs or other segmental repeats.21, 22 However, our results indicate that many genes in the genome have low coverage simply due to weak LD. Previous studies have shown that some genes are preferentially located in such regions, for example, genes that are involved in immune response and sensory perception.23 Low coverage of a gene will often result in low power to detect genetic association if the disease variant falls in the gene. Evaluation of local and gene coverage can provide guidance on which regions or genes should receive denser coverage in the new chip.

When calculating gene coverage, we used the transcriptional start and end positions to define gene regions. We recognize that functional variants may exist in the 5′ or 3′ UTRs. However, the UTR information is not available for all the known genes and there is no consensus on how large the UTRs should be. Indeed, we repeated our calculation by expanding each region by 5 kb on each end, and observed similar results (data not shown).

It is commonly believed that GWA studies offer an unbiased approach for identification of susceptibility variants for complex diseases. However, even if the investigator does not impose any prior information onto a GWA study, the analysis results still will be biased toward regions and genes that are better covered by the SNP chip that is used in the study. Thus, for current SNP chips, it is desirable to carry out supplementary genotyping if necessary and to employ more flexible data analysis approaches that can take prior information into account.

In summary, we have evaluated coverage variation of different SNP chips for GWA studies at a finer scale. Although we focused on six SNP chips in this paper, the procedures that we employed are general and are not restricted to a particular product. As whole-genome SNP chips continue to evolve, we believe that detailed coverage evaluation will be valuable for comparing different genotyping products and designing future GWA studies. All results presented in this paper can be downloaded from http://www.biostat.mc.vanderbilt.edu/SNPChipCoverage.