It has been four years since the publication of one of the first and largest genome-wise association studies (GWAS), the Wellcome Trust Case Control Consortium (WTCCC) study.1 Although imputation-based analysis was already adopted at that time for refining association signals, its use was limited and few results were reported based on imputations. In one study, an imputed missense mutation in gene GCKR is found to be more strongly associated with triglyceride.2

Imputation significantly increases the statistical power of GWAS and allows meta-analysis of studies genotyped on different platforms.3, 4, 5 Until recently, most imputation work has been using HapMap haplotypes as a reference panel.6 However, the recent publication of the 1000 Genomes Pilot Project and the availability of phased haplotypes from both the pilot and main project bring opportunities for much denser imputations and more extensive analysis.7 The 1000 genomes-based imputation has already been shown to refine association signals and identify underlying genetic variants that are in high linkage disequilibrium (LD) with variants that were included in the genotyping platform.7 However, there have been few reports of novel findings using this latest imputation approach on a genome-wide scale.8, 9 The Oxford-GSK study8 used 1000 genomes-based imputation to refine a single genetic region and successfully identified a SNP that is in the promoter region of a biologically plausible gene with a more significant P-value than that without 1000 genomes imputation. The Sardiana study9 fully utilized the reference panels from HapMap2, HapMap3, and the 1000 genomes, but did not specifically evaluate the power gains from each panel. It also has a smaller discovery sample size (N <1700) compared with the WTCCC study. In contrast, our study is hypothesis driven, based on a flagship study with rich phenotypes and a large sample size. We reason it is very important to confirm that genotype imputation based on the latest 1000 genomes release could identify novel variants on a genome-wide scale, beside refining associations at a regional level, given the large amount of efforts needed for scientists around the globe poised to apply this emerging tool to their scientific investigation.

We hypothesize that 1000 genomes-based imputation can identify novel variants beyond what could be seen from purely genotyped data or HapMap imputed data, due to the much denser SNP coverage and a much broader representation of reference populations. We test this hypothesis by re-examining the WTCCC Phase I data after imputing the genotype data to the full set of SNPs present in 1000 genomes latest release (version 20100804). We re-run association analysis for the seven traits based on 1000 genomes imputed dosages and highlight novel and refined genetic associations that would have been discovered by the original study should the 1000 genomes reference panel be available back then.


We obtained approval for using the raw genotype data for the original WTCCC data set, and we created one harmonized genotype data set by applying quality controls similar as the original study to the downloaded files (details provided in Online Supplementary Section S1). Using this genotype data with embedded disease status, we ran the case/control association tests with PLINK and verified that the results are similar as those reported in the original study with negligible difference.10 To match the genomic position used by the 1000 genomes reference, we mapped all SNPs to NCBI's build37 (hg19).

We used the MaCH program to first phase the haplotypes and then ran MiniMac for genotype imputation.11 We used the recommended two-step approach and recommended parameters of 20 iterations of the Markov sampler and 200 states. The 1000 genomes reference panel was obtained from the University of Michigan Abecasis lab, version 20100804 ( A total of 566 reference haplotypes of the European ancestry served as the reference panel. We ran a logistic regression analysis based on imputation dosages via MACH2DAT (thus taking imputation uncertainty into account) for each of the seven traits with the shared control without covariate adjustment, a similar statistical analysis approach presented by the original WTCCC study. We included all SNPs with estimated R2 >0.3 and minor allele frequency >0.01 for analysis.

SNPs were considered in the same region if they are within the same gene or are <1 Mb apart. We define ‘novel’ for any SNP with association P-value reaching the genome-wide significance level (5 × 10−8) in a region not reported in the original study that analyzed genotyped data. Novel variants that were later reported by other studies after the original 2007 WTCCC paper are designated as ‘missed’ instead of ‘novel’. We define ‘refined’ for any association where there is a reported association in the same region in the original study but the new lead SNP has a P-value more significant even after correcting the number of new SNPs tested. For ‘refined’ association, we further require that the lead SNP has either a functional support or is pinpointing to biologically more relevant gene. We used PolyPhen-2 to predict the possible impact of amino-acid substitutions in silico.12 To evaluate whether the genome-wide significant threshold of 5 × 10−8 widely used in HapMap2 imputed analysis would be sufficiently conservative in 1000 genomes imputed analysis, we picked tagging SNPs using a greedy algorithm similar to that in ldselect13 at R2=0.9 for the SNPs included in our analysis.

For all genetic loci identified as novel or refined, a 5-Mb region around the lead SNP was re-imputed and then analyzed with an independent imputation program (BEAGLE)14 and association analysis tool (PLINK)10 by an independent analyst (DE; Supplementary Section S2).


After SNP quality control and mapping of genomic positions to build37, a total of 389 827 SNPs for 16 179 samples were retained as input genotype data for imputation. After imputation, a total of 6 233 112 SNPs with estimated R2 >0.3 and minor allele frequency >0.01 were used for association analysis. The estimated genomic inflation factor λ15 for the seven case/control GWAS ranges from 1.04 to 1.09, which is comparable to the original study and indicates low genomic inflation. Association Manhattan plots for all seven analyses are shown in Supplementary Figure S1. We compared the signals with those in the original WTCCC study, and highlighted two missed and two refined variants that we identified through this latest imputation method (Table 1). All four loci were confirmed by an independent analysis using BEAGLE and PLINK (JH, DE).

Table 1 Novel/missed and refined variants from 1000 genomes imputation-based analysis

As shown in the regional plots, the two missed variants would not have been identified with HapMap2-based16 imputation (shown as red color in Figure 1). For the refined CUX2 region, the best HapMap2 imputed SNP is less significant than the genotyped SNP originally reported. For the refined IL23R region, the best HapMap2 SNP is not exonic and not predicted to have functional consequence.

Figure 1
figure 1

Regional plots for two novel (missed) and two refined loci. The top two plots are for two novel (‘missed’) regions, where highly significant SNPs meet genome-wide significance (P<2.5 × 10−8). The bottom two plots are for two refined regions. SNPs are represented by three different colors: black for WTCCC genotyped SNPs, red for HapMap2 imputed SNPs, and green for 1000 genomes imputed SNPs. Chromosome base pair positions (NCBI build 37) are represented on the X-axis. On the Y-axis, statistical significance is expressed as –log10 of the P-values. The horizontal line marks the P=2.5 × 10−8 threshold of genome-wide significance.

A total of 1 915 543 tagging SNPs were picked for the total of 6 233 112 SNPs included in our analysis, at the R2 threshold of 0.9. Therefore, compared with the previous assumption of one million independent loci across the genome,17and a genome-wide significance threshold of 5.0 × 10−8, we propose a genome-wide threshold of 2.5 × 10−8 assuming two million independent SNPs. All four SNPs in Table 1 meet genome-wide significance applying this conservative (because of correlation among the tagging SNPs) threshold.


This is the first study to comprehensively assess the utility of 1000 genomes-based imputation for identifying novel genetic association signals. We identified two associations that were not reported in the original WTCCC study but later established through other GWAS. One is within gene IL2RA for association with type 1 diabetes (T1D) and the other is 128 kb downstream of gene CDKN2B for association with type 2 diabetes (T2D). No association with phenotypic trait(s) was reported in the original study for these two loci. But as the two lead SNPs are no longer genome-wide significant once conditioning on the SNPs established by independent studies after the publication of the original WTCCC study18, 19, 20 we view them as ‘missed’ rather than ‘novel’ loci.

Furthermore, we identified two refined SNP associations: one is SNP rs11209026 in the exon of IL23R gene for association with Crohn's disease. It is predicted to be probably damaging by PolyPhen-2 and has a P-value of 1.41 × 10−21 compared with the previous best association P-value of 5.85 × 10−12 (SNP rs11805303 within intron 6 of IL23R, see Table 1) from the original WTCCC study. Although the lead SNP rs11209026 is still genome-wide significant after conditioning on the lead genotyped SNP, we consider it a refined signal for two reasons. First two SNPs reside within the same gene and in physical vicinity (30.4 kb apart). Second, the P-value of rs11209026 dropped by more than three orders of magnitude (from 4.2 × 10−21 to 7.6 × 10−17), suggesting that the two signals are partially dependent or tagging the same underlying/untyped causal SNP(s) or haplotype(s). The other refined association is located in the CUX2 region for association with T1D, where the newly identified top SNP rs1265564 has an association P-value of 1.68 × 10−16. The original WTCCC study reported a best association P-value of 1.51 × 10−14 (rs17696736) within the gene NAA25 (Table 1). These two SNPs are 780 kb apart, however, the CUX2 gene has been shown to directly regulate the expression of NeuroD, a gene that can cause T1D when mutated.21 The lead SNPs for these two refined loci (IL23R for CD and CUX2 for T1D) are no longer genome-wide significant after conditioning on the nearest lead SNPs reported in literature to date.18, 22

We tend to believe that the two missed loci, namely the IL2RA locus for T1D and CDKN2B locus for T2D, are the same as reported in later studies for two reasons. First, the two missed loci fall in the same region with the SNPs reported in post-2007 literature with physical distance ranging from 3.6 to 28.8 kb. Although the level of LD is largely moderate, given such close physical vicinity, it is hard to rule out the possibility that both our lead SNPs and the SNPs reported in literature are tagging the same (untested) SNP(s) or haplotype(s). Our second reason is that the P-values of our lead SNPs drop by an order of three or four, and all >2.9 × 10−5 once conditional on the SNPs reported in the post-2007 literature. For the IL2RA locus, more detailed haplotype and fine-mapping analyses would be required to fully appreciate the complex architecture of causal variants in this region.

The independent imputation and association analysis using BEAGLE and PLINK identified the exact same four SNPs showing best association signals in the four regions identified by MaCH and MACH2DAT, confirming our findings that two are novel and two are refined, except that the BEALGE/PLINK generated a P-value for the CDKN2B locus slightly above the 5 × 10−8 threshold. Both MaCH and BEAGLE have been recommended for practical use because of their user-friendly interface and computational efficiency.23, 24, 25, 26

We adopted a conservative genome-wide significance threshold 2.5 × 10−8 to guard against false positives particularly given that we are testing seven phenotypic traits instead of a single one. The fact that we only have genome-wide significant signals from four well-established regions suggests that our conservative threshold fulfilled its purpose. Future studies may gain additional power with more sophisticated methods to control type-I error27, 28, 29, 30 or with methods that handle multiple related phenotypes.31, 32 On the other hand, for four out of the seven traits, our 1000 genomes-based imputation detected nothing novel on top of the original WTCCC study, suggesting that the potential power of imputation is limited by the genetic architecture of the trait(s) of interest and the genomic coverage of the GWAS genotyping panel used.

We present here an example where 1000 genomes-based imputation identifies both novel and refined signals. By using 1000 genomes-based imputation, we identified two SNPs that are genome-wide significantly associated with two of the seven traits in the WTCCC study, neither discovered in the original study with only genotyped SNPs. The two SNPs ‘missed’ from the original analysis serve as positive controls because the two residing regions were both established by other studies after the 2007 WTCCC publication. Our analysis also provided further insights into two regions identified in the original study by identifying SNPs that are either more significant or point to a biologically more plausible gene. Importantly, we had no other signals based on our conservative genome-wide significance threshold, suggesting that we have no inflated false discovery rates. Taken together, our findings suggest that applying 1000 genomes-based imputation to the large number of GWAS data sets existing nowadays has the potential both to identify novel disease-associated genetic variants and to advance our understanding in known regions by examining a much denser set of imputed variants. We believe that larger reference panels continuing to be released by the 1000 Genomes Project will benefit the community even more, by performing single-marker analysis as presented here or rare or structural variant analysis.11, 33