Main

Despite extensive research efforts for more than a decade, the genetic basis of common human diseases remains largely unknown. Although there have been some notable successes1, linkage and candidate gene association studies have often failed to deliver definitive results. Yet the identification of the variants, genes and pathways involved in particular diseases offers a potential route to new therapies, improved diagnosis and better disease prevention. For some time it has been hoped that the advent of genome-wide association (GWA) studies would provide a successful new tool for unlocking the genetic basis of many of these common causes of human morbidity and mortality1.

Three recent advances mean that GWA studies that are powered to detect plausible effect sizes are now possible2. First, the International HapMap resource3, which documents patterns of genome-wide variation and linkage disequilibrium in four population samples, greatly facilitates both the design and analysis of association studies. Second, the availability of dense genotyping chips, containing sets of hundreds of thousands of single nucleotide polymorphisms (SNPs) that provide good coverage of much of the human genome, means that for the first time GWA studies for thousands of cases and controls are technically and financially feasible. Third, appropriately large and well-characterized clinical samples have been assembled for many common diseases.

The Wellcome Trust Case Control Consortium (WTCCC) was formed with a view to exploring the utility, design and analyses of GWA studies. It brought together over 50 research groups from the UK that are active in researching the genetics of common human diseases, with expertise ranging from clinical, through genotyping, to informatics and statistical analysis. Here we describe the main experiment of the consortium: GWA studies of 2,000 cases and 3,000 shared controls for 7 complex human diseases of major public health importance—bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D). Two further experiments undertaken by the consortium will be reported elsewhere: a GWA study for tuberculosis in 1,500 cases and 1,500 controls, sampled from The Gambia; and an association study of 1,500 common controls with 1,000 cases for each of breast cancer, multiple sclerosis, ankylosing spondylitis and autoimmune thyroid disease, all typed at around 15,000 mainly non-synonymous SNPs. By simultaneously studying seven diseases with differing aetiologies, we hoped to develop insights, not only into the specific genetic contributions to each of the diseases, but also into differences in allelic architecture across the diseases. A further major aim was to address important methodological issues of relevance to all GWA studies, such as quality control, design and analysis. In addition to our main association results, we address several of these issues below, including the choice of controls for genetic studies, the extent of population structure within Great Britain, sample sizes necessary to detect genetic effects of varying sizes, and improvements in genotype-calling algorithms and analytical methods.

Samples and experimental analyses

Individuals included in the study were living within England, Scotland and Wales (‘Great Britain’) and the vast majority had self-identified themselves as white Europeans (153 individuals with non-Caucasian ancestry were excluded from final analysis—see below). The seven conditions selected for study are all common familial diseases of major public health importance both in the UK and globally4, and for which suitable nationally representative sample sets were available. The control individuals came from two sources: 1,500 individuals from the 1958 British Birth Cohort (58C) and 1,500 individuals selected from blood donors recruited as part of this project (UK Blood Services (UKBS) controls). See Methods and Supplementary Table 1 for sample recruitment, phenotypes and summary details for each collection.

We adopted an experimental design with 2,000 cases for each disease and 3,000 combined controls. All 17,000 samples were genotyped with the GeneChip 500K Mapping Array Set (Affymetrix chip), which comprises 500,568 SNPs, as described in Methods. The power of this study (estimated from simulations that mimic linkage disequilibrium patterns in the HapMap Caucasian sample (CEU), see Methods) averaged across SNPs with minor allele frequencies (MAFs) above 5% is estimated to be 43% for alleles with a relative risk of 1.3, increasing to 80% for a relative risk of 1.5, for a P-value threshold of 5 × 10-7 (Supplementary Table 2).

We developed a new algorithm, CHIAMO, which we applied to simultaneously call the genotypes from all individuals (see Methods and Supplementary Information). Cross-platform comparison showed CHIAMO to outperform BRLMM (the standard Affymetrix algorithm) by having an error rate under 0.2% (Supplementary Table 3), and comparison of 108 duplicate genotypes in our study gave a discordance rate of 0.12%.

We excluded 809 samples after checks for contamination, false identity, non-Caucasian ancestry and relatedness (see Methods and Supplementary Table 4); 16,179 individuals remained in the study.

Genome-wide, 469,557 SNPs (93.8%) passed our quality control filters (described in Methods) giving an average call rate of 99.63%. Of those, 392,575 have study-wide MAFs > 1% (45,106 have MAFs < 0.1%; see also Supplementary Figs 1 and 2). Initial analyses of the polymorphic SNPs suggest that patterns of linkage disequilibrium in our samples are very similar to those in HapMap (Supplementary Fig. 3). Therefore, we expect genome coverage with the Affymetrix 500K set in this study to be similar to that estimated for the HapMap CEU panel2.

All SNPs passing quality control filters were used in the association analyses, although power is very low for SNPs with low MAFs (unless they have unusually large effects). On visual inspection of the cluster plots of SNPs showing apparently strong association, we removed a further 638 SNPs with poor clustering.

Control groups

Our main purpose in using two control groups was to assess possible bias in ascertaining control samples. In addition, noting that DNA sample processing differed between these groups, comparison of control groups also provides a check for effects of differential genotyping errors as a result of differences in DNA collection and preparation. Figure 1a shows the results of 1-d.f. Mantel-extension tests5 for differences in allele frequencies of SNPs between subjects from the 58BC and UKBS collections, stratified by 12 broad regions of Great Britain (see Supplementary Table 5 and Supplementary Fig. 4 for region definitions). The associated quantile-quantile plot (see Methods for background) in Fig. 1b shows good agreement with the null distribution (similar results are obtained for tests that do not stratify by geography, data not shown). The fact that we see few significant differences between these two control groups despite the fact that they differ in population groups sampled, DNA processing, and age, indicates that there would be little bias due to use of either sample as a control group for any of the case series, and justifies our combining of the two control groups to form a single group of 3,000 subjects for our main analyses.

Figure 1: Genome-wide scan for allele frequency differences between controls.
figure 1

a, P values from the trend test for differences between SNP allele frequencies in the two control groups, stratified by geographical region. SNPs have been excluded on the basis of failure in a test for Hardy–Weinberg equilibrium in either control group considered separately, a low call rate, or if minor allele frequency is less than 1%, but not on the basis of a difference between control groups. Green dots indicate SNPs with a P value <1 × 10-5. b, Quantile-quantile plots of these test statistics. In this and subsequent quantile-quantile plots, the shaded region is the 95% concentration band (see Methods).

One consequence of using a shared control group (for which detailed phenotyping for all traits of interest is not available) relates to the potential for misclassification bias: a proportion of the controls is likely to have the disease of interest (and therefore might meet the criteria for inclusion as a case) and some others will develop it in the future. However, the effect this has on power is modest unless the extent of misclassification bias is substantial; for example, if 5% of controls would meet the definition of cases at the same age, the loss of power is approximately the same as that due to a reduction of the sample size by 10%6. Even for the higher prevalence conditions examined by the WTCCC (such as HT, CAD and T2D), the precise ascertainment schemes used here (which enriched for more extreme phenotypes and/or strong family history) will have limited the proportions of controls meeting case criteria to low levels (for example, to <5%). Although a study design which used ‘hypercontrols’ (that is, selection of control individuals from the lower extremity of the relevant trait distribution) would generally be the most powerful approach in a study focusing on one disease, the merits of such an approach need to be weighed against the additional costs associated with the need to phenotype and genotype each control sample.

Geographical variation and population structure

An additional cause of false positive findings is hidden population structure. Case and control samples may differ in the distribution of their ancestry, either owing to control sampling effects, as discussed above, or to confounding when different ancestries carry higher disease risk and are, as a result, over-represented in cases. Even after exclusion of individuals with evidence of recent non-European ancestry, the British population is heterogeneous, having been shaped by several waves of immigration from southern and northern Europe. Whether the differences between these incoming populations are sufficiently large to distort the findings of population-based case-control studies is an open question.

We first examined our samples for non-European ancestry, using multidimensional scaling after ‘seeding’ our data with those from the three HapMap analysis panels (see Supplementary Fig. 5 and Methods), and excluded 153 individuals on this basis. We next looked for evidence of population heterogeneity by studying allele frequency differences between the 12 broad geographical regions (defined in Supplementary Fig. 4). The results for these 11-d.f. tests and associated quantile-quantile plots are shown in Fig. 2. Widespread small differences in allele frequencies are evident as an increased slope of the line (Fig. 2b); in addition, a few loci show much larger differences (Fig. 2a and Supplementary Fig. 6).

Figure 2: Genome-wide picture of geographic variation.
figure 2

a, P values for the 11-d.f. test for difference in SNP allele frequencies between geographical regions, within the 9 collections. SNPs have been excluded using the project quality control filters described in Methods. Green dots indicate SNPs with a P value <1 × 10-5. b, Quantile-quantile plots of these test statistics. SNPs at which the test statistic exceeds 100 are represented by triangles at the top of the plot, and the shaded region is the 95% concentration band (see Methods). Also shown in blue is the quantile-quantile plot resulting from removal of all SNPs in the 13 most differentiated regions (Table 1).

Thirteen genomic regions showing strong geographical variation are listed in Table 1, and Supplementary Fig. 7 shows the way in which their allele frequencies vary geographically. The predominant pattern is variation along a NW/SE axis. The most likely cause for these marked geographical differences is natural selection, most plausibly in populations ancestral to those now in the UK. Variation due to selection has previously been implicated at LCT (lactase) and major histocompatibility complex (MHC)7,8,9, and within-UK differentiation at 4p14 has been found independently10, but others seem to be new findings. All but three of the regions contain known genes. Aside from evolutionary interest, genes showing evidence of natural selection are particularly interesting for the biology of traits such as infectious diseases; possible targets for selection include NADSYN1 (NAD synthetase 1) at 11q13, which could have a role in prevention of pellagra, as well as TLR1 (toll-like receptor 1) at 4p14, for which a role in the biology of tuberculosis and leprosy has been suggested10.

Table 1 Highly differentiated SNPs

There may be important population structure that is not well captured by current geographical region of residence. Present implementations of strongly model-based approaches such as STRUCTURE11,12 are impracticable for data sets of this size, and we reverted to the classical method of principal components13,14, using a subset of 197,175 SNPs chosen to reduce inter-locus linkage disequilibrium. Nevertheless, four of the first six principal components clearly picked up effects attributable to local linkage disequilibrium rather than genome-wide structure. The remaining two components show the same predominant geographical trend from NW to SE but, perhaps unsurprisingly, London is set somewhat apart (Supplementary Fig. 8).

The overall effect of population structure on our association results seems to be small, once recent migrants from outside Europe are excluded. Estimates of over-dispersion of the association trend test statistics (usually denoted λ; ref. 15) ranged from 1.03 and 1.05 for RA and T1D, respectively, to 1.08–1.11 for the remaining diseases. Some of this over-dispersion could be due to factors other than structure, and this possibility is supported by the fact that inclusion of the two ancestry informative principal components as covariates in the association tests reduced the over-dispersion estimates only slightly (Supplementary Table 6), as did stratification by geographical region. This impression is confirmed on noting that P values with and without correction for structure are similar (Supplementary Fig. 9). We conclude that, for most of the genome, population structure has at most a small confounding effect in our study, and as a consequence the analyses reported below do not correct for structure. In principle, apparent associations in the few genomic regions identified in Table 1 as showing strong geographical differentiation should be interpreted with caution, but none arose in our analyses.

Disease association results

We assessed evidence for association in several ways (see Methods for details), drawing on both classical and bayesian statistical approaches. For polymorphic SNPs on the Affymetrix chip, we performed trend tests (1 degree of freedom16) and general genotype tests (2 degrees of freedom16, referred to as genotypic) between each case collection and the pooled controls, and calculated analogous Bayes factors. There are examples from animal models where genetic effects act differently in males and females17, and to assess this in our data we applied a sex-differentiated test which is sensitive to associations of a different magnitude and/or direction in the two sexes.

Our study also allows us to look for loci which may have an effect in more than one disease. To assess this, we compared our common controls with all cases in each of three natural groupings of diseases: CAD+HT+T2D (metabolic and cardiovascular phenotypes with potential aetiological overlap, for example, involving defects in insulin action); RA+T1D (already known to share common loci); and CD+RA+T1D (all autoimmune diseases).

To help to capture putative disease loci not on the Affymetrix chip we used a new multilocus method in which a population genetics model is applied to our genotype data and the HapMap reference samples to simulate, or impute, genotype data at 2,193,483 HapMap SNPs not on the Affymetrix chip. These imputed, or in silico, genotypes are then tested for association in the same ways as SNPs genotyped in the project.

Before detailing the principal results for each disease, we first summarize our main observations. Table 2 details the findings from the WTCCC scan for the 15 variants for which there was strong prior evidence of association with one or more of the diseases studied, based on extensive replication studies. All but two of these show associations in our study, with the magnitude of the evidence generally consistent with their effect sizes as estimated from prior studies. One of the signals for which we failed to obtain evidence of replication (APOE in CAD) is poorly tagged by the Affymetrix 500K chip. The other (INS in T1D) is represented by a single SNP that marginally failed our study-wide quality control filters (overall missingness 5.2%) but which was nonetheless strongly associated with T1D when examined. Quantile-quantile plots for the trend test for each of the seven diseases show only very minor deviations from the null distribution, except in the extreme tails which correspond to associations reported below (Fig. 3). The quantile-quantile plots and the results at positive controls (Table 2) give confidence in the quality of our data and the robustness of our analyses.

Table 2 Evidence for signal of association at previously robustly replicated loci
Figure 3: Quantile-quantile plots for seven genome-wide scans.
figure 3

For each of the seven disease collections, a quantile-quantile plot of the results of the trend test is shown in black for all SNPs that pass the standard project filters, have a minor allele frequency >1% and missing data rate <1%. SNPs that were visually inspected and revealed genotype calling problems were excluded. These filters were chosen to minimize the influence of genotype-calling artefacts. Each quantile-quantile plot shown in black involves around 360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented by triangles. Additional quantile-quantile plots, which also exclude all SNPs located in the regions of association listed in Table 3, are superimposed in blue (for BD, the exclusion of these SNPs has no visible effect on the plot, and for HT there are no such SNPs). The blue quantile-quantile plots show that departures in the extreme tail of the distribution of test statistics are due to regions with a strong signal for association.

Our genome-wide results for the trend test are illustrated in Fig. 4. The single-disease trend and genotypic tests for SNPs on the chip identified 21 signals across the 7 diseases that exceeded a threshold of 5 × 10-7 (Table 3). For each of these SNPs (except those within the MHC), cluster plots are shown in Supplementary Fig. 10 and ‘signal plots’ in Fig. 5. These signal plots estimate the likely demarcation of the hit region and show the signal at genotyped and imputed SNPs together with local genomic context. Four further strong (with P < 5 × 10-7) associations were revealed by the other primary analyses described (Table 3). One locus (in RA) was revealed by the sex-differentiated analysis, two through multilocus approaches (both for T1D) and one through an analysis which combined cases from more than one autoimmune disease (signal plots in Supplementary Figs 11, 12 and 13, respectively).

Figure 4: Genome-wide scan for seven diseases.
figure 4

For each of seven diseases -log10 of the trend test P value for quality-control-positive SNPs, excluding those in each disease that were excluded for having poor clustering after visual inspection, are plotted against position on each chromosome. Chromosomes are shown in alternating colours for clarity, with P values <1 × 10-5 highlighted in green. All panels are truncated at -log10(P value) = 15, although some markers (for example, in the MHC in T1D and RA) exceed this significance threshold.

Table 3 Regions of the genome showing the strongest association signals
Figure 5: Regions of the genome showing strong evidence of association.
figure 5

Characteristics of genomic regions 1.25 Mb to either side of ‘hit SNPs’—SNPs with lowest P values. Region boundaries (vertical dotted lines) were chosen to coincide with locations where test statistics returned to background levels and, where possible, recombination hotspots. Upper panel, -log10(P values) for the test (trend or genotypic) with the smallest P value at the hit SNP. Black points represent SNPs typed in the study, and grey points represent SNPs whose genotypes were imputed. SNPs imputed with higher confidence are shown in darker grey. Middle panel, fine-scale recombination rate (centimorgans per Mb) estimated from Phase II HapMap. The purple line shows the cumulative genetic distance (in cM) from the hit SNP. Lower panel, known genes, and sequence conservation in 17 vertebrates. Known genes (orange) in the hit region are listed in the upper right part of each plot in chromosomal order, starting at the left edge of the region. The top track shows plus-strand genes and the middle track shows minus-strand genes. Sequence conservation (bottom track) scores are based on the phylogenetic hidden Markov model phastCons. Highly conserved regions (phastCons score ≥600) are shown in blue. Information in middle and lower panels is taken from the UCSC Genome Browser. Positions are in NCBI build-35 coordinates. See Supplementary Information on ‘signal plots’.

All of these signals were subjected to visual inspection of cluster plots, and in all cases (with one exception noted below) nearby correlated SNPs also showed a strong signal (see signal plots). Thus, genotyping artefacts are unlikely to be responsible for these associations. Indeed, at the time of writing, 12 of these 25 strong signals represent replications of previously reported findings (only those with extensive prior replication are reported in Table 2). Of the remainder, follow-up studies (reported elsewhere) have confirmed all but one of the loci (ten in total) for which replication has been attempted10,19,20,21,22,23,24. The other replication study gave equivocal results. Of the 18 loci implicated in autoimmune diseases, 5 show associations (P < 0.001) to more than 1 condition, leading to a number of further potential new associations, at least one of which has also been replicated10.

It is likely that further susceptibility genes will be identified through follow-up of other signals for which the evidence from our scan is less conclusive (see below for some specific examples). For example, there are 58 further signals with single-point P values between 10-5 and 5 × 10-7 for which inspection of cluster plots verifies CHIAMO calls (Table 4). As described below, analyses which make use of selected case samples to expand the reference group should also provide a useful route to the prioritization of such putative signals for further analysis. For convenience, the strongest association results are presented separately for each disease in Supplementary Table 7.

Table 4 Regions of the genome showing moderate evidence of association

Several general points are relevant to interpretation of these disease-association data. First, replication studies are required to confirm associations from GWAs. For the reasons given in the box, we regard very low P values (say P < 5 × 10-7) in our comparatively large sample size as strong evidence for association, and indeed all or most of the loci we find at this level are either already known or have now been confirmed by subsequent replication. Such replication studies are also the substrate for efforts to determine the range of associated phenotypes and to identify and characterize pathologically relevant variation.

Second, failure to detect a prominent association signal in the present study cannot provide conclusive exclusion of any given gene. This is the consequence of several factors including: less-than-complete coverage of common variation genome-wide on the Affymetrix chip; poor coverage (by design) of rare variants, including many structural variants (thereby reducing power to detect rare, penetrant, alleles)25; difficulties with defining the full genomic extent of the gene of interest; and, despite the sample size, relatively low power to detect, at levels of significance appropriate for genome-wide analysis, variants with modest effect sizes (odds ratio (OR) < 1.2).

Third, whereas the association signals detected can help to define regions of interest, they cannot provide unambiguous identification of the causal genes. Nevertheless, assessments on the basis of positional candidacy carry considerable weight, and, as we show, these already allow us, for selected diseases, to highlight pathways and mechanisms of particular interest. Naturally, extensive resequencing and fine-mapping work, followed by functional studies will be required before such inferences can be translated into robust statements about the molecular and physiological mechanisms involved.

We turn now to a discussion of the main findings for each disease, focusing here only on the most significant and interesting results from the analyses described above, and consideration of an expanded reference group, described below.

Bipolar disorder (BD)

Bipolar disorder (BD; manic depressive illness26) refers to an episodic recurrent pathological disturbance in mood (affect) ranging from extreme elation or mania to severe depression and usually accompanied by disturbances in thinking and behaviour: psychotic features (delusions and hallucinations) often occur. Pathogenesis is poorly understood but there is robust evidence for a substantial genetic contribution to risk27,28. The estimated sibling recurrence risk (λs) is 7–10 and heritability 80–90%27,28. The definition of BD phenotype is based solely on clinical features because, as yet, psychiatry lacks validating diagnostic tests such as those available for many physical illnesses. Indeed, a major goal of molecular genetics approaches to psychiatric illness is an improvement in diagnostic classification that will follow identification of the biological systems that underpin the clinical syndromes. The phenotype definition that we have used includes individuals that have suffered one or more episodes of pathologically elevated mood (see Methods), a criterion that captures the clinical spectrum of bipolar mood variation that shows familial aggregation29.

Several genomic regions have been implicated in linkage studies30 and, recently, replicated evidence implicating specific genes has been reported. Increasing evidence suggests an overlap in genetic susceptibility with schizophrenia, a psychotic disorder with many similarities to BD. In particular association findings have been reported with both disorders at DAOA (D-amino acid oxidase activator), DISC1 (disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1 (dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome 16p12 (genotypic test P = 6.3 × 10-8; Table 3) and the best-fitting genetic model was recessive (Supplementary Table 8). Although recognizing that this signal was not additionally supported by the expanded reference group analysis (see below and Supplementary Table 9) and that independent replication is essential, we note that several genes at this locus could have pathological relevance to BD, (Fig. 5). These include PALB2 (partner and localizer of BRCA2), which is involved in stability of key nuclear structures including chromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes a subunit of complex I of the mitochondrial respiratory chain; and DCTN5 (dynactin 5), which encodes a protein involved in intracellular transport that is known to interact with the gene ‘disrupted in schizophrenia 1’ (DISC1)32, the latter having been implicated in susceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P < 5 × 10-7 in the expanded reference group analysis (Supplementary Table 9), it is of interest that the closest gene to the signal at rs1526805 (P = 2.2 × 10-7) is KCNC2 which encodes the Shaw-related voltage-gated potassium channel. Ion channelopathies are well-recognized as causes of episodic central nervous system disease, including seizures, ataxias and paralyses34. It is possible that this may extend to episodic disturbances of mood and behaviour.

Amongst the other higher ranked signals in the BD data set (Supplementary Table 7), there is support for the previously suggested importance of GABA neurotransmission (rs7680321 (P = 6.2 × 10-5) in GABRB1 encoding a ligand-gated ion channel (GABA A receptor, beta 1))35, glutamate neurotransmission (rs1485171 (P = 9.7 × 10-5) in GRM7 (glutamate receptor, metabotropic 7))35 and synaptic function (rs11089599 (P = 7.2 × 10-5) in SYN3 (synapsin III)36).

We note that a broad range of genetic and non-genetic data point to the importance of analyses that use alternative approaches to phenotype definition, including symptom dimensions31. Although beyond the scope of the current paper, such analyses will be required to maximize the potential of the current BD data set.

Coronary artery disease (CAD)

Coronary artery disease (coronary atherosclerosis) is a chronic degenerative condition in which lipid and fibrous matrix is deposited in the walls of the coronary arteries to form atheromatous plaques37. It may be clinically silent or present with angina pectoris or acute myocardial infarction. Pathogenesis is complex, with endothelial dysfunction, oxidative stress and inflammation contributing to development and instability of the atherosclerotic plaque37.

In addition to lifestyle and environmental factors, genes are important in the aetiology of CAD38. For early myocardial infarction, estimates of λs range from 2 to 7 (ref. 39). Genetic variation is thought likely to influence risk of CAD both directly and through effects on known CAD risk factors including hypertension, diabetes and hypercholesterolaemia. Genome-wide linkage studies have mapped several loci that may affect susceptibility to CAD/myocardial infarction40 although for only two of these has the likely gene been identified (ALOX5AP (arachidonate 5-lipoxygenase-activating protein) and LTA4H (leukotriene A4 hydrolase))41,42. Association studies have identified several plausible genetic variants affecting lipids, thrombosis, inflammation or vascular biology but for most the evidence is not yet conclusive40. We did not find evidence for strong association at any of these genes within our study (Table 2 and Supplementary Table 10).

The most notable new finding for CAD is the powerful association on chromosome 9p21.3 (Table 3; Fig. 5). Although the strongest signal is seen at rs1333049 (P = 1.8 × 10-14), associations are seen for SNPs across > 100 kilobases. This region has not been highlighted in previous studies of CAD or myocardial infarction40,43. The region of interest contains the coding sequences of genes for two cyclin dependent kinase inhibitors, CDKN2A (encoding p16INK4a) and CDKN2B (p15INK4b), although the most closely associated SNP is some distance removed. Both genes have multiple isoforms, have an important role in the regulation of the cell cycle and are widely expressed44, with CDKN2B known to be expressed in the macrophages but not the smooth muscle cells of fibrofatty lesions45,46. It is of interest that expression of CDKN2B is induced by transforming growth factor beta (TGF-β) and that the TGF-β signalling system is implicated in the pathogenesis of human atherosclerosis45,46. Besides CDKN2A and CDKN2B, the only other known gene nearby is MTAP which encodes methylthioadenosine phosphorylase, an enzyme that contributes to polyamine metabolism and is important for the salvage of both adenine and methionine. MTAP is ubiquitously expressed, including in the cardiovascular system47. Further work is required to determine whether the CAD association at this locus is mediated through CDKN2A/B, MTAP or some other mechanism. The same region also shows replicated evidence of association to T2D in the WTCCC and other data sets19,21,22, though different SNPs seem to be involved.

None of the loci showing more modest associations with CAD (Table 4) includes genes hitherto strongly implicated in the pathogenesis of CAD. A potentially interesting association is at rs6922269 (P = 6.3 × 10-6), an intronic SNP in MTHFD1L, which encodes methylenetetrahydrofolate dehydrogenase (NADP+-dependent) 1-like, the mitochondrial isozyme of C1-tetrahydrofolate (THF) synthase48,49. C1-THF synthases interconvert the one carbon units carried by the biologically active form of folic acid, C1-tetrahydrofolate. These are used in a variety of cellular processes including purine and methionine synthesis48. Another enzyme in the same pathway, methylene THF reductase (encoded by MTHFR) is subject to a common mutation which influences plasma homocysteine level50 and has been associated with increased risk of coronary and other atherosclerotic disease51. The possibility of a link between variants in MTHFD1L and CAD risk is supported by evidence that MTHFD1L activity also contributes to plasma homocysteine52 and that defects in the MTHFD1L pathway may increase plasma homocysteine level48,53.

An intronic SNP in ADAMTS17 (a disintegrin and metalloproteinase with thrombospondin motifs 17), which showed modest association (rs1994016; P = 1.1 × 10-4) in our primary analysis, showed a much stronger association in the expanded reference group analysis (see below and Supplementary Table 9). Although the specific function of ADAMTS17 has not been determined, other members of the ADAMTS family have been implicated in vascular extracellular matrix degradation, vascular remodelling and atherosclerosis54,55.

Crohn’s disease (CD)

Crohn’s disease is a common form of chronic inflammatory bowel disease56. The pathogenic mechanisms are poorly understood, but probably involve a dysregulated immune response to commensal intestinal bacteria and possibly defects in mucosal barrier function or bacterial clearance57. Genetic predisposition to CD is suggested by a λs of 17–35 and by twin studies that contrast monozygotic concordance rates of 50% with only 10% in dizygotic pairs58,59.

A number of CD-susceptibility loci have previously been defined, and all of these generate strong signals in our data (Table 2). In 2001, positional cloning identified CARD15 (caspase recruitment domain family, member 15; NOD2) as the first confirmed CD-susceptibility gene60,61. In the present study, this locus is represented by rs17221417 (P = 9.4 × 10-12). A second association, on chromosome 5q31 (ref. 62) has been widely replicated, although the identity of the causative gene is disputed owing to extensive regional linkage disequilibrium63. Here, the previously described risk haplotype is tagged by rs6596075 (P = 5.4 × 10-7).

More recent studies have identified four further CD-susceptibility loci, all of which are strongly replicated in the present study. The association between CD and SNPs within IL23R (interleukin 23 receptor)63 is here represented by a cluster of associated SNPs, including rs11805303 (P = 6.5 × 10-13). The strongest signal for CD in the present scan (at rs10210302; P = 7.1 × 10-14) maps to the ATG16L1 (ATG16 autophagy related 16-like 1) gene and is in strong linkage disequilibrium (r2 = 0.97) with a non-synonymous SNP (T300A, rs2241880) associated with CD in a German non-synonymous SNP scan64. The third is a locus at chromosome 10q21 around rs10761659 (P = 2.7 × 10-7) and represents a non-coding intergenic SNP mapping 14-kb telomeric to gene ZNF365 and 55-kb centromeric to the pseudogene antiquitin-like 4—a recently detected signal65. Finally, strong association with a cluster of SNPs around rs17234657 (P = 2.1 × 10-13) within a 1.2 Mb gene desert on chromosome 5p13.1, recapitulates the finding of a recent GWA study66.

The current study identifies four further new strong association signals in CD, located on chromosomes 3p21, 5q33, 10q24 and 18p11 (Table 3; Fig. 5). Successful replication for all four loci is reported elsewhere23.

The first of these includes several SNPs around IRGM (immunity-related guanosine triphosphatase; the human homologue of the mouse Irgm/Lrg47), the strongest signal being at rs1000113 (P = 5.1 × 10-8). IRGM encodes a GTP-binding protein which induces autophagy and is involved in elimination of intracellular bacteria, including Mycobacterium tuberculosis67. Reduced function and/or activity of this gene would be expected to lead to persistence of intracellular bacteria, consistent with existing models of CD pathogenesis57 and the recent ATG16L1 association64 (see above).

The second novel CD association is seen at rs9858542 (P = 7.7 × 10-7), a synonymous coding SNP within the BSN (bassoon) gene on chromosome 3p21. BSN is thought to encode a scaffold protein expressed in brain and involved in neurotransmitter release; a more plausible regional candidate is MST1 (macrophage stimulating 1), which encodes a protein influencing motile activity and phagocytosis by resident peritoneal macrophages68.

The third novel association involves a cluster of SNPs around rs10883365 (P = 1.4 × 10-8) on chromosome 10q24.2. The most credible candidate here is the NKX2-3 (NK2 transcription factor related, locus 3) gene, a member of the NKX family of homeodomain-containing transcription factors. Targeted disruption of the murine homologue of NKX2-3 results in defective development of the intestine and secondary lymphoid organs69. Abnormal expression of NKX2-3 may alter gut migration of antigen-responsive lymphocytes and influence the intestinal inflammatory response.

The final novel association, at rs2542151 (P = 4.6 × 10-8) maps 5.5-kb upstream of PTPN2 (protein tyrosine phosphatase, non-receptor type 2) on chromosome 18p11. PTPN2 encodes the T cell protein tyrosine phosphatase TCPTP, a key negative regulator of inflammatory responses. The same locus also shows strong association with T1D susceptibility (trend test P = 1.9 × 10-6) and a consistent, though weaker, association with RA (P = 1.9 × 10-2), supporting the existence of overlapping pathways in the pathogenesis of very distinct inflammatory phenotypes (combined trend test P value for all three diseases = 9 × 10-8) (Table 3; ref. 10).

Several further loci generating less strong evidence for association are of interest on the basis of their biological candidacy (Table 4). For example, rs9469220 (P = 8.7 × 10-7) mapping to the human leukocyte antigen (HLA) system class II region was detected in the ‘second tier’ of associations (Table 4). This suggests a significant contribution of HLA to CD-susceptibility, though less marked than seen in classical autoimmune conditions such as RA and T1D. Another interesting candidate flagged in Table 4 is TNFAIP3 (TNFα induced protein 3), the closest gene to rs7753394 on chromosome 6q23. The protein product inhibits TNFα-induced NFκB-dependent gene expression by interfering with RIP- or TRAF-2-mediated transactivation signals—hence interacting with the same pathway as CARD15 (NOD2). Markers with lower levels of significance include rs6478108 (P = 9.0 × 10-5) within TNFSF15 (tumour necrosis factor super family, member 15), previously reported associated with CD70; and rs3816769 (P = 3.1 × 10-5) which maps within STAT3 (signal transducers and activator of transcription, member 3). On the X chromosome rs2807261 (P = 1.3 × 10-7) maps 50-kb from the gene CD40LG (CD40 ligand—previously known as TNF superfamily, member 5), implicated in the regulation of B-cell proliferation, adhesion and immunoglobulin class switching71. As described in the section on T1D, a modest association between CD and SNPs in the vicinity of the PTPN11 gene on chromosome 12q24 (P = 1.5 × 10-3) probably reflects a locus influencing general autoimmune predisposition.

An emerging theme from molecular genetic studies of CD is the importance of defects in autophagy and the processing of phagocytosed bacteria. A number of other specific components within innate and adaptive immune pathways are also highlighted.

Hypertension (HT)

Hypertension refers to a clinically significant increase in blood pressure and constitutes an important risk factor for cardiovascular disease (http://www.who.int/whr/2002/en/; ref. 72). Lifestyle exposures that elevate blood pressure, including sodium intake, alcohol and excess weight73 are well-described risk factors. Genetic factors are also important74,75. Estimates of λs are approximately 2.5–3.5.

Experimental models have highlighted a number of quantitative trait loci but these have yet to translate into insights into human hypertension76. Linkage studies are consistent with susceptibility genes of modest effect size77 and well-replicated findings have yet to emerge from association approaches.

None of the variants previously associated with HT showed evidence for association in our study although we note that some, such as promoter of the WNK1 (WNK lysine deficient protein kinase 1) gene78,79, are not well tagged by the Affymetrix chip.

For HT there were no SNPs with significance below 5 × 10-7 (Table 3) but the number and distribution of association signals in the range 10-4 to 10-7 was similar to that of the other diseases studied (Table 4 and Supplementary Table 7). There are several possible explanations. First, HT may have fewer common risk alleles of larger effect sizes than some of the other complex phenotypes. If so, then identification of susceptibility variants for HT is likely to be reliant on the synthesis of findings from multiple large-scale studies. Second, the present study may have failed to detect genuine common susceptibility variants of large effect size because they happened to be poorly tagged by the set of SNPs genotyped in the current study. If so, further rounds of genotyping using resources that offer increased density (or complementary SNP sets), and/or improved analytical methods (for example, imputation-based) should facilitate their discovery. Third, study of HT may be more susceptible than other phenotypes to the diluting effects of misclassification bias due to the presence of hypertensive individuals within the control samples. If so, power can be improved in future studies by use of controls specifically screened to exclude individuals with elevated blood pressure.

The most strongly associated SNPs (Table 4) do not identify genes from physiological systems previously implicated by clinical or genetic studies in hypertension. The strongest signal overall is with rs2820037 on 1q43 (genotypic test, P = 7.7 × 10-7). The closest genes are RYR2 (encoding the ryanodine receptor 2), mutations in which are associated with stress-induced polymorphic ventricular tachycardia and arrhythmogenic right ventricular dysplasia80,81; CHRM3, encoding the cholinergic receptor muscarinic 3, a member of the G protein-coupled receptor family32; and ZP4, the product of which is zona pellucida glycoprotein 481. The strong association signals on the X chromosome using an expanded reference group (see below and Supplementary Table 9) are of substantial interest but they do not identify known genes of obvious relevance to HT.

Rheumatoid arthritis (RA)

Rheumatoid arthritis is a chronic inflammatory disease characterized by destruction of the synovial joints resulting in severe disability, particularly in patients who remain refractory to available therapies82. Susceptibility to, and severity of, RA are determined by both genetic and environmental factors, with λs estimates ranging from 5–10 (ref. 83).

An association between RA and alleles of the HLA-DRB1 locus has long been established84. Despite extensive linkage85,86,87 and association studies, only one other RA susceptibility locus has been convincingly identified in Caucasians. In common with several autoimmune diseases including T1D, carriage of the T allele of the rs2476601 SNP in the PTPN22 (protein tyrosine phosphatase, non-receptor type 22) gene has been reproducibly associated with RA, conferring a genetic relative risk of approximately 1.8 (refs 88, 89). These known associations with HLA-DRB1 and PTPN22 explain around 50% of the familial aggregation of RA.

Both these previous associations emerge strongly here (Table 2). The most associated marker within PTPN22 (rs6679677: chromosome 1p13) is perfectly correlated (HapMap CEU data r2 = 1) with the functionally relevant SNP (rs2476601) described previously, and the effect size is consistent with previous estimates89. Amongst other putative RA susceptibility genes, two SNPs mapping to CTLA-4 (cytotoxic T-lymphocyte associated 4) rs3087243 and rs11571300 were only nominally significant (P = 0.085 and P = 0.034, respectively) (Supplementary Table 10).

RA was the sole disease for which the sex-differentiated analysis generated a strong signal due to different genetic effects in males and females. The SNP rs11761231 (chromosome 7) generates a P value of 3.9 × 10-7 for the 2-degrees of freedom (d.f.) sex-differentiated test which combines trend tests in males and females (Table 3). (The trend test ignoring the sex of the individuals has a P value of 1.7 × 10-6.) This genotype has no effect on disease status in males, but a strong apparently additive effect in females (P value in a logistic regression model with additive log-odds is 0.68 in males and 6.8 × 10-8 in females, additive OR for females 1.32), and may represent one of the first sex-differentiated effects in human diseases. Cluster plots for this SNP seem good, but it is surrounded by recombination hotspots and has no other SNPs on the Affymetrix chip with r2 > 0.1 (Supplementary Fig. 11). Some caution is therefore required, but this represents a potentially interesting finding which warrants further investigation, particularly given the sex-related prevalence difference characteristic of this condition.

None of the 9 SNPs with nominal P values in the range 10-5 to 5 × 10-7 (Table 4) map to loci previously associated with RA. Of particular interest is the association of SNPs mapping close to both the alpha and beta chains of the IL2 receptor (rs2104286 in the case of IL2RA; rs743777 and IL2RB). The IL2 receptor mediates IL2 stimulation of T lymphocytes and is thereby thought to have an important role in preventing autoimmunity. A rare 4-base-pair deletion of IL2RA has been associated with development of severe autoimmune disease90, and there is evidence (from previous data91, and from this study and its follow-up) that SNPs within the IL2RA gene region are associated with T1D (see also T1D section).

Several of the SNPs with nominal significance in the range 10-4 to 10-5 (Supplementary Table 7) map to genes with plausible biological relevance. Examples include SNPs within genes implicated in the TNF pathway (for example, rs2771369 in TNFAIP2 (tumour necrosis factor, alpha-induced protein 2)) or in the regulation of T-cell function (rs854350 in GZMB (granzyme B) and rs4750316 in PRKCQ (protein kinase C, theta)). The association with rs10786617 in KAZALD1 (Kazal-type serine protease inhibitor domain-containing protein 1 precursor), a gene whose product is known to have a role in bone regeneration after injury, may be relevant to the development of bone erosions in RA.

RA and T1D were already known to have two disease susceptibility genes in common: at the MHC, and at PTPN22. As detailed elsewhere, our study provides data indicating that this list can be extended to include variants around IL2RA (chromosome 10p15), PTPN2 (chromosome 18p11) and the chromosome 12q24 region (Supplementary Table 11), all apparently novel in RA.

Type 1 diabetes (T1D)

Type 1 diabetes is a chronic autoimmune disorder with onset usually in childhood92. The λs for T1D is 15 and twin data suggest that over 85% of the phenotypic variance is due to genetic factors93. There are six genes/regions for which there is strong pre-existing statistical support for a role in T1D-susceptibility: these are the major histocompatibility complex (MHC), the genes encoding insulin, CTLA-4 (cytotoxic T-lymphocyte associated 4) and PTPN22 (protein tyrosine phosphatase, non-receptor type 22), and the regions around the interleukin 2 receptor alpha (IL2RA/CD25) and interferon-induced helicase 1 genes (IFIH1/MDA5)94. However, these signals can explain only part of the familial aggregation of T1D. Five of these previously identified associations were detected in this scan (P ≤ 0.001) (Table 2 and Supplementary Table 10), the exception being the INS gene discussed above.

In this study, single-point analyses revealed three novel regions (on chromosomes 12q13, 12q24 and 16p13) showing strong evidence of association (P < 5 × 10-7; Table 3). Four further regions attained similar levels of significance either through multilocus analyses (chromosomes 4q27 and 12p13: Table 3, Supplementary Fig. 12), or through the combined analysis of autoimmune cases (chromosomes 18p11 and the 10p15 CD25 region: Table 3, Supplementary Fig. 13). The associations with T1D for chromosomes 12q13, 12q24, 16p13 and 18p11 have been confirmed in independent and multiple populations10.

The two signals on chromosome 12 (at 12q13 and 12q24) map to regions of extensive linkage disequilibrium covering more than ten genes (Fig. 5). Several of these represent functional candidates because of their presumed roles in immune signalling, considered to be a major feature of T1D-susceptibility. These include ERBB3 (receptor tyrosine-protein kinase erbB-3 precursor) at 12q13 and SH2B3/LNK (SH2B adaptor protein 3), TRAFD1 (TRAF-type zinc finger domain containing 1) and PTPN11 (protein tyrosine phosphatase, non-receptor type 11) at 12q24. For these signal regions in particular, extensive resequencing, further genotyping and targeted functional studies will be essential steps in identifying which gene, or genes, are causal95. Of those listed, PTPN11 is a particularly attractive candidate given a major role in insulin and immune signalling96. It is also a member of the same family of regulatory phosphatases as PTPN22, already established as an important susceptibility gene for T1D and other autoimmune diseases94,97. Indeed, the 12q24 variant most associated with T1D also features in both the CD and RA scans, generating a combined signal for all autoimmune cases of 9.3 × 10-10 (Supplementary Table 11).

In contrast, available annotations suggest that the 16p13 region contains only two genes of unknown function, KIAA0350 and dexamethasone-induced transcript (Fig. 5). Also, the region of association identified on 18p11 (Supplementary Fig. 14), which seems to confer susceptibility to all three autoimmune conditions studied (combined trend test P = 9 × 10-8, P = 4.6 × 10-8 for CD, 1.9 × 10-2 for RA, and 1.9 × 10-6 for T1D: Supplementary Table 11), maps to a single gene, PTPN2 (protein tyrosine phosphatase, non-receptor type 2), a member of the same family as PTPN22 and PTPN11 and involved in immune regulation96.

Our scan found associations with SNPs within the chromosome 10p15 region containing CD25, encoding the high-affinity receptor for IL-2. This is consistent with a previous report of associations of this region with T1D91. The CD25 region has previously been shown to be associated with Graves’ disease98 and the present study also provides evidence of association with RA (combined trend test P = 5 × 10-8, P = 7 × 10-6 for RA and T1D separately, Supplementary Table 11). This finding has clear biological connections to the evidence of association between T1D and a region of 4q27 revealed by the multilocus analysis (Supplementary Table 12, Supplementary Fig. 12). This region contains the genes encoding both IL-2 and IL-21. Together with studies in the NOD (nonobese diabetic) mouse model of T1D, which have shown that a major non-MHC locus (Idd3) reflects regulatory variation of the Il2 gene99, our results point to the primary importance of the IL-2 pathway in T1D and other autoimmune diseases.

One further region deserves comment. In the multilocus analysis, there was increased support for a region on chromosome 12p13 containing several candidate genes, including CD69 (CD69 antigen (p60, early T-cell activation antigen)) and multiple CLEC (C-type lectin domain family) genes. In contrast to the chromosome 4 region where the effect of imputation is to tip an already-strong signal (5.01 × 10-7 for typed rs17388568, trend test) over the arbitrary threshold of 5 × 10-7, the 12p13 locus involves a more marked change between imputed and actual (7.2 × 10-7 for rs11052552, general test). Replication studies of this imputed SNP to date have produced equivocal results (for details see ref. 10).

Type 2 diabetes (T2D)

Type 2 diabetes is a chronic metabolic disorder typically first diagnosed in the middle to late adult years100. Strongly associated with obesity, the condition features defects in both the secretion and peripheral actions of insulin101. The appreciable familial aggregation of T2D (an estimated λs of 3.0 in European individuals)73 reflects both shared family environment and genetic predisposition. Heritability values vary widely with most estimates between 30 and 70%101.

To date, robust, widely replicated associations in non-isolate populations are limited to variants in three genes: PPARG (encoding the peroxisomal proliferative activated receptor gamma; P12A102), KCNJ11 (the inwardly-rectifying Kir6.2 component of the pancreatic beta-cell KATP channel; E23K103) and TCF7L2 (transcription factor 7-like 2; rs7903146 (refs 104, 105)).

All three of these signals are detected here with effect-sizes consistent with previous reports (Table 2). A cluster of SNPs on chromosome 10q, within TCF7L2, represented by rs4506565 (trend test, OR 1.36, P = 5.7 × 10-13) generates the strongest association signal for T2D (Table 3, Fig. 5). Rs4506565 is in tight linkage disequilibrium (r2 of 0.92 in the CEU component of HapMap) with rs7903146, the variant with the strongest aetiological claims104,106. In fact, our imputation analysis confirms that rs7903146, though unrepresented on the chip, is responsible for the strongest association effect in this region (Fig. 5). TCF7L2 acts within the WNT-signalling pathway, and effects on diabetes risk seem to be mediated predominantly through beta-cell dysfunction107.

As expected, given existing effect-size estimates, the signals associated with variants within the other established T2D-susceptibility genes, KCNJ11 (rs5215, r2 of 0.9 with rs5219, E23K) and PPARG (rs17036328, r2 of 1 with rs1801282, P12A) are less dramatic (trend test, OR 1.15 and 1.23 respectively, both P = 0.001). These examples illustrate how genuine disease-susceptibility variants can generate association signals which would not attract immediate attention for follow-up in the genomewide context.

Apart from TCF7L2, the scan reveals two signals for T2D with P values less than 5 × 10-7 (Table 3, Fig. 5). The first of these maps within the FTO (fat-mass and obesity-associated) gene on chromosome 16q. Several adjacent SNPs (including rs9939609, rs7193144 and rs8050136) generate signals characterized by a per-allele OR for T2D of 1.25 and a risk-allele frequency of 40% in controls. As recently described in follow-up studies prompted by this finding, the effect of these variants on T2D-risk has been replicated and is mediated entirely by their marked effect on adiposity24.

The third association signal (chromosome 6p22) features a cluster of highly associated SNPs (including rs9465871) with risk-allele frequencies between 18 and 35%, mapping to intron 5 of the CDKAL1 (CDK5 regulatory subunit associated protein 1-like 1) gene. Although the function of CDKAL1 is not known, it shares homology at the protein domain level with CDK5 regulatory subunit associated protein 1 (CDK5RAP1). CDK5RAP1 is known to inhibit the activation of CDK5, a cyclin-dependent kinase which has been implicated in the maintenance of normal beta-cell function108. Our own follow-up studies, and scans by other groups have shown strong replication of this finding19,20,21,22. The effect of this variant on T2D-risk shows significant departures from additivity (Supplementary Table 8).

One notable inclusion amongst the variants with more modest association signals is a cluster of SNPs on chromosome 10 including rs10748582 and rs7923866, which generate trend test P values between 10-4 and 10-5. This cluster maps in the vicinity of the HHEX (homeobox, hematopoietically expressed) and IDE (insulin-degrading enzyme) genes, in a region recently highlighted in a GWA scan for T2D performed in 1363 subjects of French origin109. The SNPs showing association in our data are proxies for those reported in the French study and generate similar effect-size estimates for T2D.

Of the three other regions highlighted by the French scan109, none can be confirmed by our data. The SNP in SLC30A8 associated with T2D in the French report (rs13266634) is poorly correlated with SNPs on the Affymetrix chip (r2 < 0.01), and extensive recombination events in the region limit the value of data-imputation methods. Coverage of the LOC387761 and EXT2 signals is considerably better, but, for these, neither genotyped nor imputed SNPs show evidence for association with T2D.

WTCCC data contributed to identification of two additional robustly replicating T2D signals, mapping to the IGF2BP2 gene and CDKN2A/CDKN2B regions19,21,22, although neither generated impressive P values on the primary scan analysis (neither single-point P was <10-4). The latter signal maps to the same region as the CAD signal on chromosome 9 though different SNPs are involved. The other SNPs in Table 4 do not map to genes or regions previously implicated in T2D pathogenesis, and replication efforts to date have not identified any confirmed signals19.

Expanded reference group analyses

For a fixed number of cases, power of a case-control study can be increased by enlarging the reference group. Our main analyses used a control:case ratio of 1.5:1 for each disease. The availability of the other 6 disease data sets gave us the opportunity to expand the reference group up to a ratio of 7.5:1, with potential reciprocal benefits for the analysis of each disease. For BD and T2D the expanded reference group comprised the 58C and UKBS controls supplemented by the other 6 disease sets; for CAD and HT this expanded reference group was reduced to exclude HT and CAD respectively; for CD, RA and T1D, the reference group was augmented only by the cases from the non-autoimmune diseases.

The utility of the expanded reference group approach was demonstrated by increased evidence for association at most of the loci that received strongest support from our primary analysis, including many of the signals at loci known to show robust association in T1D, T2D and CD (Supplementary Table 9). Additionally, this analysis elevated several loci with modest levels of statistical significance in the primary analysis, to the top tier of statistical significance (P < 5 × 10-7).

Our data indicate that this approach may be a useful adjunct to conventional analysis and that loci identified as highly significant should be considered for follow up. There are two important caveats. First, susceptibility genes that influence both the test disease and one or more of the diseases included in the reference group will cause loss of power. Second, a ‘mirror-image’ effect could occur whereby a strong association within the expanded reference sample (for example, HLA in autoimmune diseases) causes spurious association with the opposite allele in the test disease. Thus, a positive association using an expanded reference group must be interpreted within the context of association findings in the diseases included within the reference group.

Disease models

It is of interest to consider which statistical models best describe the data at and between loci that are strongly associated with disease status. Biological interpretation of these statistical models is not straightforward but they can help in choosing more powerful statistical tools for detecting associations.

First, consider separately each of the 19 non-MHC SNPs showing strong evidence for association on either the trend or genotypic test in Table 3. For four of these 19, the P value on the 2-d.f. genotypic test was smaller than that on the 1-d.f. trend test (Table 3). When comparing disease models, these were also the four SNPs with evidence for departure from a simple model in which odds of disease increase multiplicatively with the number of copies of the risk allele (Supplementary Table 8). This supports our view that the genotypic test should be carried out in addition to the trend test, although should perhaps be viewed more cautiously for two reasons: it is more susceptible to genotyping errors; and (on the basis of our findings) experience does not favour strong dominance effects.

A separate question relates to the best models for the way in which different loci combine to affect susceptibility to a disease, and as a consequence on the extent to which methods explicitly allowing interactions between loci should be employed to detect associations110. None of the analyses reported here includes such interactions, so we are not well placed to address the general question. Nonetheless, within each collection with multiple associated regions (CD, T1D and T2D) we considered all pairs of non-MHC SNPs in Table 3 and looked for a departure from the model in which the two loci combine to increase log-odds in an additive fashion. We found suggestive evidence of a departure from multilocus additivity between rs1000113 and rs10761659 in CD (unadjusted P value = 0.002) and between rs9465871 and rs4506565 in T2D (unadjusted P value = 0.004). Further investigation of this question, preferably on unbiased sets of disease loci found through the application of single locus and interaction-based approaches, would seem warranted.

Discussion

We have studied seven common familial diseases by genome-wide association analysis in 16,179 individuals. Our findings inform understanding of the genetic basis of the diseases concerned and provide methodological insights relevant to the pursuit of GWA studies in general.

A simple but important observation is that GWA analysis provides a highly effective approach for exploring the genetic underpinnings of common familial diseases. Our yield of novel, highly significant association findings is comparable to, or exceeds, the number of those hitherto-generated by candidate gene or positional cloning efforts. For many of the compelling signals, replication has already been obtained, including regions on chromosomes 3p21, 5q33, 10q24 and 18p11 for CD23, 12q13, 12q24, 16p13 and 18p11 in T1D10 and 6p22 and 16q12 in T2D19,20,21,22,24. For others, replication is required to establish a definitive relationship with disease. Additional findings of particular interest include the identification of several loci that seem to influence susceptibility to multiple autoimmune diseases, and the suggestion of a novel locus for RA which shows sex-specific effects.

Our study enables us to make several general recommendations relevant to GWA studies. The first relates to the importance of careful quality control. In such large data sets, small systematic differences can readily produce effects capable of obscuring the true associations being sought111,112. We implemented extensive quality control checks to minimize differences in sample DNA concentration, quality and handling procedures and combined a new genotype-calling algorithm (CHIAMO) with a set of filtering heuristics to select SNPs for further analysis. Given that infallible detection of incorrect genotype calls is not yet possible, the criteria used for SNP exclusion need to strike a compromise between stringency (which may discard true signals or generate spurious positives through differential missingness) and leniency (with the danger that true signals are swamped by spurious findings due to poor genotype calling). As such, systematic visual inspection of cluster plots for SNPs of interest remains an integral part of the quality control process.

The potential for population structure to undermine inferences in case-control association studies has long been debated113 but limited empirical data have been available to assess the issue. Our study highlighted several loci, some known and some new, which demonstrate substantial geographical variation in allele frequencies across Britain (Table 1), most probably due to natural selection in ancestral populations. Outside these loci, the effects of population structure are relatively minor, and do not represent a major source of confounding, provided that individuals with appreciable non-European ancestry are excluded. Although these conclusions may not generalize to studies in other locations, this finding reinforces the logistical and economic benefits of the case-control design over alternatives (such as family-based association studies).

Our study allowed us to address another important methodological issue: the adequacy, or otherwise, of using a common set of controls, rather than a sample recruited explicitly for use with a defined disease sample. It is often assumed that failure to match cases and controls for socio-demographic variables will lead to substantial inflation of the type I error rate. Our study demonstrates that, within the context of large-scale genetic association studies, for British populations at least, this concern has been overstated. A related argument against use of population controls relates to the perceived impact of misclassification bias when a proportion of controls meet the criteria used to define cases. However, the consequent loss of power is modest unless the trait of interest is very common6. Given the above, the present study provides a compelling case for both the suitability and efficiency of the common control design in Britain and warrants its serious consideration elsewhere. Further benefits can be expected from use of this common control genotype data set in future GWA studies in Britain. Finally, in failing to detect significant differences in performance between the epidemiological sample (58C) and that derived from blood donors (UKBS), we validate the use of the latter samples for cost-effective, large-scale control DNA provision.

In terms of general biological insights, the most profound relate to inferences about the allelic architecture of common traits. The novel variants we have uncovered are characterized by modest effect size (that is, per-allele ORs between 1.2 and 1.5) and even these estimates are likely to be inflated114. We identified no additional common variants of very large effect (akin to HLA in T1D: Supplementary Fig. 15). The observed distribution of effect sizes is consistent with models based on theoretical considerations and empirical data from animal models87,115,116 that suggest that, for any given trait, there will be few (if any) large effects, a handful of modest effects and a substantial number of genes generating small or very small increases in disease risk.

There are several important corollaries. Notwithstanding the incomplete coverage afforded by the genotyping reagents employed, most of the susceptibility effects yet to be uncovered for these diseases (at least those attributable to, or tagged by, common SNPs) are likely to have effects of similar or smaller magnitude to those we have highlighted. Beyond the signals with the strongest evidence for association, most of which are likely to be real (and many of which have already been confirmed), there will be many additional susceptibility variants for which the WTCCC provides some evidence, but for which extensive replication will be required to establish validity. PPARG and KCNJ11 provide examples of proven susceptibility genes (for T2D) that generated only modest evidence for association within the WTCCC, and which would only have been revealed by such replication efforts. Given the likely preponderance of susceptibility variants of small effect, the potential for identifying further loci is limited only by the clinical resources available for replication (assuming suitable study design, accurate genotyping and appropriate analysis and inference). Provided the attribution of a causal relationship with the trait of interest is robust, even variants of very small effect can offer fundamental biological insights.

The patterns of allelic architecture uncovered mean that replication efforts will need to feature comparably large sample sizes: even if one accepts more relaxed significance thresholds given the prior evidence, one has to consider the inflation in effect-size estimates in the primary study. Caution is required in reaching negative conclusions on the basis of a single failed attempt at replication, or any set of replication attempts that are inadequately powered.

One of our major design considerations was sample size. We set out to include samples larger than those previously examined for genome-wide association, and our results suggest that such large sample sizes were necessary. Even with 2,000 cases and 3,000 controls, adequate power is restricted to common variants of relatively large effect (see Supplementary Table 2). We carried out an experiment to see which SNPs showing strong evidence of association in the full data (that is, signals outside MHC with trend test P < 5 × 10-7), would have been detected at that same threshold in only a subset of our data (Fig. 6). Because it focuses on a particular but arbitrary P-value threshold, some care is needed in interpreting the figure. Nonetheless, for subsamples of 1,000 cases and 1,000 controls, of the 16 loci detected in the full study, we would have been certain of seeing only 2, with an expectation of about 6; for subsamples of 1,500 cases and 1,500 controls, we could expect to have seen about 9. These figures provide stark evidence that the larger the study sample, the more loci can be expected to reach threshold significance values. Indeed, given the likely distribution of effect sizes for most complex traits (see above), there are strong grounds for the prosecution of GWA studies on an even larger scale than ours, and, wherever possible, combining the results from existing GWA scans performed for the same trait. To assist such efforts, individual level data from this study will be widely available through the Consortium’s Data Access Committee (follow links from http://www.wtccc.org.uk).

Figure 6: Strong associations in subsamples of our data.
figure 6

For the 16 SNPs in Table 3 (outside the MHC) with P values for the trend test below 5 × 10-7, we randomly generated 1,000 subsets of our full data set corresponding to case-control studies with different numbers of cases, and the same number of controls (x axis). The y axis gives the proportion of subsamples of a given size in which that SNP achieved a P value for the trend test below 5 × 10-7. SNPs are numbered according to the row in which they occur in Table 3 (so that, for example, the CAD hit is numbered 2, and the TCF7L2 hit on chromosome 10 for T2D is numbered 20).

In our study, T1D and CD, the conditions showing strongest familial aggregation (as quantified by their sibling relative risks, λs), generated the largest number of highly significant associations. This relationship was not sustained in comparisons between the other five diseases. It is important to recognize that the association signals so far identified account for only a small proportion of overall familiality. There is a disparity in scale between the modest locus-specific λs effects attributable to the identified associations (for instance, the prominent TCF7L2 signal for T2D translates into a λs of only 1.03) and the estimates of overall familiality that reflects the combined effects of all genes and shared family environment. These estimates demonstrate the limited potential of the variants thus far identified (singly or in combination) to provide clinically useful prediction of disease117,118.

The identification and characterization of the aetiological variants that underlie replicated associations will necessitate extensive fine-mapping and functional validation. We view the WTCCC study and data set as an important first step towards harnessing the powerful molecular genomic tools now available to dissect the biological basis of common disease and translating those findings into improvements in human health.

Methods summary

A detailed description of materials and methods is given in Methods. The workflow and organization of the project are given in Supplementary Fig. 16. Case series came from previously established collections with nationally representative recruitment: 2,000 samples were genotyped for each. The control samples came from two sources: half from the 1958 Birth Cohort and the remainder from a new UK Blood Service sample. The latter collection was established specifically for this study and is a UK national repository of anonymized DNA samples from 3,622 consenting blood donors. The vast majority of subjects were self-reported as of European Caucasian ancestry. All DNA samples were requantified and tested for degradation and PCR amplification. Genotyping was performed using GeneChip 500K arrays at the Affymetrix Services Lab (California): arrays not passing the 93% call rate threshold at P = 0.33 with the Dynamic Model algorithm were repeated. CEL (cell intensity) files were transferred to WTCCC for quantile normalization, and genotypes called using a new genotyping algorithm, CHIAMO, developed for this project. QC/QA measures included sample call rate, overall heterozygosity and evidence of non-European ancestry (809 samples excluded; 16,179 retained for analysis). SNPs were excluded from analysis because of missing data rates, departures from Hardy–Weinberg equilibrium and other metrics (31,011 excluded; 469,557 retained). Standard 1-d.f. and 2-d.f. tests of case-control association were supplemented with bayesian approaches, multilocus methods (data imputation) and analyses with combined data sets, either as additional cases (to detect variants influencing multiple phenotypes) or as an expanded reference group (to increase power). Results for each SNP for all analyses reported will be available from http://www.wtccc.org.uk, as will details allowing other researchers to apply for access to WTCCC genotype data. Software packages developed within the WTCCC are available on request (see Methods for details).

Online Methods

BD phenotype description

BD cases were all over the age of 16 yr, living in mainland UK and of European descent. Recruitment was undertaken throughout the UK by teams based in Aberdeen (8% of cases), Birmingham (35% cases), Cardiff (33% cases), London (15% cases) and Newcastle (9% cases). Individuals who had been in contact with mental health services were recruited if they suffered with a major mood disorder in which clinically significant episodes of elevated mood had occurred. This was defined as a lifetime diagnosis of a bipolar mood disorder according to Research Diagnostic Criteria119 and included the bipolar subtypes that have been shown in family studies to co-aggregate for example29: bipolar I disorder (71% cases), schizoaffective disorder bipolar type (15% cases), bipolar II disorder (9% cases) and manic disorder (5% cases). After providing written informed consent, all subjects were interviewed by a trained psychologist or psychiatrist using a semi-structured lifetime diagnostic psychiatric interview (in most cases the Schedules for Clinical Assessment in Neuropsychiatry120 and available psychiatric medical records were reviewed). Using all available data, best-estimate ratings were made for a set of key phenotypic measures on the basis of the OPCRIT checklist (which covers both psychopathology and course of illness)121,122 and lifetime psychiatric diagnoses were assigned according to the Research Diagnostic Criteria119. The reliability of these methods has been shown to be high119,123,124. Further details of clinical methodology can be found in Green, 2005 (ref. 123) and Green, 2006 (ref. 124).

CAD phenotype description

CAD cases had a validated history of either myocardial infarction or coronary revascularization (coronary artery bypass surgery or percutaneous coronary angioplasty) before their 66th birthday. Verification of the history of CAD was required either from hospital records or the primary care physician. Recruitment was carried out on a national basis in the UK through a direct approach to the public via (1) the media and (2) mailing all general practices (family physicians) with information about the study, as previously described125. In an initial pilot phase, potential participants were also identified and approached through local CAD databases in the two lead centres (Leeds and Leicester). Although the majority of subjects had at least one further sib also affected with premature CAD, only one subject from each family was included in the present study.

CD phenotype description

CD cases were attendees at inflammatory bowel disease clinics in and around the five centres which contributed samples to the WTCCC (Cambridge, Oxford, London, Newcastle, Edinburgh). Ascertainment was based on a confirmed diagnosis of Crohn’s disease (CD) using conventional endoscopic, radiological and histopathological criteria126. We included all sub-types of CD as classified by disease extent and behaviour and the collection was not specifically enriched for family history or early age of onset. The median age of diagnosis was 26.1 yr and 62% of the collection had undergone CD-related abdominal surgery. A small proportion had previously been recruited as members of multiply affected families but only one affected individual was included per family.

HT phenotype description

HT cases comprised severely hypertensive probands ascertained from families with multiplex affected sibships or as parent–offspring trios. They were of white British ancestry (up to level of grand-parents) and were recruited from the Medical Research Council General Practice Framework and other primary care practices in the UK77. Each case had a history of hypertension diagnosed before 60 yr of age, with confirmed blood pressure recordings corresponding to seated levels >150/100 mm Hg (if based on one reading), or the mean of 3 readings greater than 145/95 mm Hg. These criteria correspond to the threshold for the uppermost 5% of blood pressure distribution in a contemporaneous health screening survey of 5,000 British men and women in 1995 (N. Wald and M. Law, personal communication). We excluded hypertensive individuals who self-reportedly consumed >21 units of alcohol per week and those with diabetes, intrinsic renal disease, a history of secondary hypertension or co-existing illness. Cases did not undergo systematic genetic screening to exclude the (rare) known monogenic causes of HT. We focused on the recruitment of hypertensive individuals with body mass indices <30 kg m-2. The probands were extensively phenotyped by trained nurses (see http://www.brightstudy.ac.uk for standard operating procedures, additional phenotypes and study questionnaires). Sample selection for WTCCC was based on DNA availability and quality.

RA phenotype description

RA cases were recruited to studies coordinated by the ARC (Arthritis Research Campaign) Epidemiology Unit. All subjects were Caucasian over the age of 18 yr and satisfied the 1987 American College of Rheumatology Criteria for RA127 modified for genetic studies128. Of the cases, 404 were recruited as part of the arc National Repository of Family Material129: of these, 301 were probands from affected sibling pair families and 103 were cases from trio families, having both parents or one parent and one unaffected sibling available for study. A further 109 cases were recruited from the Norfolk Arthritis Register, a primary care-based inception collection130. All other cases (n = 1348) were recruited from NHS Rheumatology Clinics throughout the UK. Samples for WTCCC were selected from the various studies on the basis of the quality and availability of DNA.

T1D phenotype description

T1D cases were recruited from paediatric and adult diabetes clinics at 150 National Health Service hospitals across mainland UK. The total T1D case data set (n = 8,000) from which the WTCCC cases were selected, represents close to half the T1D cases seen in such clinics. Nationwide coverage was achieved through the voluntary efforts of members of the British Society for Paediatric Endocrinology and Diabetes, who recruited about half of cases, the rest coming from peripatetic nurses employed by the JDRF/WT GRID project (http://www-gene.cimr.cam.ac.uk/todd/)131. To establish a positive diagnosis of T1D (and, in particular, to distinguish it from the more common, but later onset T2D), we required all cases to have an age of diagnosis below 17 yr and insulin dependence since diagnosis (with a minimum period of at least 6 months). However, a very few subjects were subsequently discovered to be suffering from rare monogenic disorders, such as maturity onset diabetes of the young (MODY), and latterly permanent neonatal diabetes (PNDM): these were excluded.

T2D phenotype description

The T2D cases were selected from UK Caucasian subjects who form part of the Diabetes UK Warren 2 repository. In each case, the diagnosis of diabetes was based on either current prescribed treatment with sulphonylureas, biguanides, other oral agents and/or insulin or, in the case of individuals treated with diet alone, historical or contemporary laboratory evidence of hyperglycaemia (as defined by the World Health Organization). Other forms of diabetes (for example, maturity-onset diabetes of the young, mitochondrial diabetes, and type 1 diabetes) were excluded by standard clinical criteria based on personal and family history. Criteria for excluding autoimmune diabetes included absence of first-degree relatives with T1D, an interval of ≥1 yr between diagnosis and institution of regular insulin therapy and negative testing for antibodies to glutamic acid decarboxylase (anti-GAD). Cases were limited to those who reported that all four grandparents had exclusively British and/or Irish origin, by both self-reported ethnicity and place of birth. All were diagnosed between age 25 and 75. Approximately 30% were explicitly recruited as part of multiplex sibships132 and 25% were offspring in parent–offspring ‘trios’ or ‘duos’ (that is, families comprising only one parent complemented by additional sibs)133. The remainder were recruited as isolated cases but these cases were (compared to population-based cases) of relatively early onset and had a high proportion of T2D parents and/or siblings134. Cases were ascertained across the UK but were centred around the main collection centres (Exeter, London, Newcastle, Norwich, Oxford). Selection of the samples typed in WTCCC from the larger collections was based primarily on DNA availability and success in passing Diabetes and Inflammation Laboratory (DIL)/Wellcome Trust Sanger Institute (WTSI) DNA quality control.

1958 Birth Cohort Controls (58BC)

The 1958 Birth Cohort (also known as the National Child Development Study) includes all births in England, Wales and Scotland, during one week in 1958. From an original sample of over 17,000 births, survivors were followed up at ages 7, 11, 16, 23, 33 and 42 yr (http://www.cls.ioe.ac.uk/studies.asp?section=000100020003)135. In a biomedical examination at 44-45 yr136 (http://www.b58cgene.sgul.ac.uk/followup.php), 9,377 cohort members were visited at home providing 7,692 blood samples with consent for future Epstein–Barr virus (EBV)-transformed cell lines. DNA samples extracted from 1,500 cell lines of self-reported white ethnicity and representative of gender and each geographical region were selected for use as controls.

UK Blood Services Controls (UKBS)

The second set of common controls was made up of 1,500 individuals selected from a sample of blood donors recruited as part of the current project. WTCCC in collaboration with the UK Blood Services (NHSBT in England, SNBTS in Scotland and WBS in Wales) set up a UK national repository of anonymized samples of DNA and viable mononuclear cells from 3,622 consenting blood donors, age range 18–69 yr (ethical approval 05/Q0106/74). A set of 1,564 samples was selected from the 3622 samples recruited based on sex and geographical region (to reproduce the distribution of the samples of the 1958 Birth Cohort) for use as common controls in the WTCCC study. DNA was extracted as described below with a yield of 3054 ± 1207 µg (mean ± 1 s.d.).

Protocol for DNA extraction

White blood cells were isolated from the filters by first pushing 10 ml air through the filter in contra direction to the initial blood flow through the filter, followed by 40 ml PBS, collecting into a 50 ml centrifuge tube, and centrifugation (2.000 r.p.m., 10 min, 20 °C).Cells were lysed by adding 40 ml Lysis buffer (320 mM Sucrose, 1% Triton-X-100, 4.9 mM MgCl2, 1 mM TRIS-HCl pH 7.4) and pelleted by centrifugation (2,500 r.p.m., 15 min, 4 °C). Pellets were frozen before extraction. Pellets were digested overnight at 37 °C with 5.25 M GuHCl, 490 mM NH4Ac, 1.25% Na Sarcosyl and 0.125 mg ml-1 Proteinase K and then mixed with 2 ml chloroform to form a white emulsion. The aqueous layer was separated by centrifugation (2,500 r.p.m., 3 min) and DNA was precipitated in ethanol overnight at -20 °C. DNA was further precipitated by rotation (40 r.p.m., 5 min) and then pelleted by centrifugation (3,000 r.p.m., 15 min). Pellets were washed twice by rinsing with 2 ml 70% ethanol, followed by centrifugation (3,000 r.p.m., 5 min). DNA pellets were air-dried before re-suspension in TE buffer (10 mM Tris, 0.1 mM EDTA).

Sample handling

Each participating sample collection was issued unique WTCCC barcode labels and a spreadsheet with unique sample identifiers for logging information on case/control status, DNA concentration (requested at 100 ng μl-1), DNA extraction method, sex, broad geographical region and age at requirement. Each collection supplied 10 μg aliquots of anonymized samples in bar-coded, deep 96-well plates. On receipt, samples had their DNA concentration measured by Picogreen (triplicate measurements), were checked for DNA degradation on a 0.75% agarose gel, and genotyped with up to 38 SNPs arranged in two multiplex reactions using the MassExtend (hME) and/or iPLEX37 assay. The above SNPs served for obtaining a molecular fingerprint (25 of the 38 SNPs were present on the GeneChip 500K) and experimentally confirming the sex of each sample.

Samples with concentrations ≥50 ng μl-1, showing limited or no degradation, having a minimum of 7/10 (hME reaction) and/or 14/23 (iPLEX reaction) SNPs typed, and having the sex markers in agreement or not violating the supplied information were deemed fit for whole genome genotyping. Note that the hME set was replaced with a second iPLEX reaction in the course of the project to increase marker density. We selected 2,000 and 1,500 samples from each disease and control collection respectively. Selected samples were normalized to 50 ng μl-1 and re-arrayed robotically into 96-well plates so that each plate was composed of 94 samples representing at least two different collections at a ratio of 1:1. For each collection, the selected samples were balanced first for sex and then geographical region (see above).

Genotyping

SNP genotyping was performed with the commercial release of the GeneChip 500K arrays at Affymetrix Services Lab. A modified version of the genotyping assay developed for the 100K Mapping Array137 was used. In brief, two aliquots of 250 ng of DNA each are digested with NspI and StyI, respectively, an adaptor is ligated and molecules are then fragmented and labelled. At this stage each enzyme preparation is hybridized to the corresponding SNP array (262,000 and 238,000 on the NspI and StyI array respectively). Samples were processed in 96-well plate format, each plate carried a positive and a negative control, up to the hybridization step. Individual arrays not passing the 93% call rate threshold at P = 0.33 with the Dynamic Model algorithm138 were repeated (fresh aliquot of initial end-labelled reaction). Samples failing twice at the hybridization stage were reprocessed using a fresh DNA aliquot. Affymetrix delivered successful samples as those having a Dynamic Model call rate of 93% at P = 0.33 for each array, over 90% concordance for the 50 SNPs that are common to the two arrays, both arrays agreed on gender, and showed over 70% identity to the Sequenom genotypes supplied by WTCCC.

CEL files provided the intensities of the various probes on each chip. Initially, genotypes were called with the Dynamic Model138 algorithm. Affymetrix subsequently developed an improved algorithm, BRLMM (Bayesian Robust Linear Model with Mahalanobis distance classifier139,140). This processes batches of samples and uses clustering techniques to call genotypes (the ‘mismatch’ probe intensities are not used). In Affymetrix’s standard protocol it is applied in batches of 96 samples (plates). This is, of course, a very small sample size and, for some SNPs, some clusters will contain few, if any, observations. This might be countered by combining information about cluster location over a large number of SNPs.

Throughout, physical coordinates refer to NCBI build-35 of the human genome. Alleles are expressed in the forward (+) strand of the reference human genome (NCBI build-35).

Power calculations

We assessed power of the Affymetrix 500K chip using the following simulation experiment. Separately for each SNP with MAF >5% in the 10 HapMap ENCODE regions, we assumed the SNP was causative and simulated genotype data at all SNPs in the same region as the putative disease SNP in case-control panels of 2,000 cases and 3,000 controls with linkage disequilibrium patterns that match those in HapMap. For controls, these simulations were based on the imputation algorithm described below (with all genotype data initially set to missing in the 3,000 control individuals). For cases, the assumed effect size was first used to calculate genotype frequencies in cases (via Bayes’ theorem), and genotypes in cases at the putative SNP were then simulated independently from theses calculated frequencies. Genotypes at all other SNPs in the region in cases were then simulated using the imputation algorithm described below (with all data other than the genotypes at the causative SNP initially set to missing in the cases). For each such simulated case-control panel, trend tests were performed at each of the SNPs in the region that are actually on the Affymetrix chip, and if any of these reached the stated P-value threshold the putative disease SNP was deemed to be detected, and otherwise to be undetected. Power estimates are then calculated as the proportion of putative disease SNPs with MAFs >5% across the HapMap ENCODE regions that are detected at the given P-value threshold. There are various approximations here. Actual numbers of cases and controls for each disease are slightly smaller than the 3,000:2,000 values used in the simulations, but in the other direction, our simulations ignore the possibility that a disease SNP might be detected by a genotyped SNP outside its ENCODE region. The accuracy reported below of the imputation algorithm in imputing genotypes leads us to believe these simulations should be a reasonable proxy for real data. Some such simulation is needed if power calculations are to take account of the fact that any given putative disease SNP could typically be detected by several SNPs on the chip. Exploitation of this simulation approach to assess power across different platforms and SNP chips and for different experimental designs will be reported elsewhere.

CHIAMO

We developed a new genotype calling algorithm, CHIAMO, which is applied after quantile normalization of the data from each sample. A complete description is given in Supplementary Information. We briefly summarize some features here. Normalized intensities for each genotype were mapped to a two-dimensional intensity vector and then we applied CHIAMO, which uses a bayesian hierarchical 4-class mixture model to call genotypes for the whole project. We used optimization based on 12 random starts to find the set of parameters () that maximize the posterior distribution of the model. This parameter set was used to calculate the maximum a posteriori estimates of the probabilities of each genotype call, , where Zij{0, 1, 2, 3}≡{AA, AB, BB, null} is the genotype call for individual j in collection i. All CHIAMO genotype calls analysed in this paper were based on an a posteriori probability threshold of 0.9 for making a call, following our analysis of the relationship between concordance and missing data rates (data not shown). CHIAMO differs from BRLMM in several respects: (1) it uses a different transformation of the CEL files to give the two-dimensional summary for each individual at an SNP leading to better defined clusters; (2) it makes use of mis-match probe signals; (3) it uses a different method for fitting the clusters; and (4) it allows the data for all samples to be called simultaneously, thus allowing better estimation of cluster location and shape parameters, while making allowance for possible differences in these parameter values between case/control groups that could arise as a result of differences in DNA quality. This is achieved using a hierarchical statistical model that specifies the joint distribution of the three cluster centres, their spread, and likely allele frequencies (using HapMap) and genotype frequencies (centred on Hardy–Weinberg proportions but allowing some variation).

CHIAMO improved both call rate and accuracy in comparison to BRLMM, the current standard Affymetrix calling algorithm (Supplementary Table 3)—it roughly halved missing data rates and discordance rates with another platform. See Supplementary Information for full details, discussion of some challenges for genotype calling, and example cluster plots (Supplementary Figs 10 and 17).

Quantile-quantile plots

Quantile-quantile (Q-Q) plots are constructed by ranking a set of values of a statistic from smallest to largest (the ‘order statistics’) and plotting them against their expected values, given the assumption that the values have been sampled from a distribution of known theoretical form (in our case, the chi-squared distribution, usually on one degree of freedom—for example, the distribution of our trend tests under the null hypothesis). Deviations from the line of equality indicate either that the theoretical distribution is incorrect, or that the sample is contaminated with values generated in some other manner (for example, by a true association). To aid interpretation of such plots we have also calculated 95% ‘concentration bands’ (shaded grey in all Q-Q plots). These are formed by calculating, for each order statistic, the 2.5th and 97.5th centiles of the distribution of the order statistic under random sampling and the null hypothesis (for details see ref. 141). We should add two notes of caution. First, concentration bands are calculated point by point and, although there are very strong correlations between nearby order statistics, the probability that a real quantile-quantile plot will stray outside the concentration band at some point is some bit larger than 5%. Second, the theoretical chi-squared distribution is an approximation, valid for large samples; it is not clear whether this approximation continues to hold into the extreme right hand tail of the distribution explored in a GWA study (although the indications are that it is probably not far wrong for a study as large as ours).

Data quality control

Of samples for which Affymetrix returned CEL files, a total of 809 were excluded from the analysis. A complete breakdown by collection is given in Supplementary Table 4. Missing data rate per sample acts as an indicator of low DNA quality. Most samples had very low rates of missing data (study-wide average 0.00925, standard deviation 0.0187) and we chose to exclude 250 samples with >3% missing data across all SNPs (Supplementary Fig. 18, and Supplementary Tables 4 and 13). We also set empirical thresholds on genome-wide heterozygosity (excess heterozygosity in particular may indicate contamination). Six samples with >30% heterozygosity and a further three with <23% heterozygosity were excluded (see Supplementary Fig. 18). We excluded 16 samples with discrepancies between WTCCC information and external identifying information (such as genotypes from another experiment, blood type or incorrect disease status). We sought to detect individuals with non-Caucasian ancestry using multi-dimensional scaling to provide a two-dimensional projection of the data whose axes represent geographic genetic variation. In the interest of computational efficiency and to avoid confounding of the multi-dimensional scaling by extended linkage disequilibrium we thinned the data to a set of 71,458 SNPs, within which no pair were correlated with r2 > 0.2. For this set of nearly independent SNPs we computed genome-wide average identity by state (sum of the number of identical-by-state alleles at each locus divided by twice the number of loci) between each pair of individuals in each sample collection along with the 270 HapMap samples. We converted these identity by-state-relationships to distances by subtracting them from 1, and the matrix of pairwise identity by state values was used as input to multi-dimensional scaling. The projection onto the two multi-dimensional scaling axes is shown in Supplementary Fig. 5. We excluded 153 samples that were clearly separate from the main cluster of WTCCC individuals. Exclusion of these individuals resulted in a substantial reduction in estimates of over-dispersion in test statistic distributions (data not shown). We also excluded 295 duplicated (>99% identity) and 86 related (86–98% identity) samples from the analysis.

Filtering out suboptimal markers depends on both the platform and the genotype calling algorithm. We experimented with various quality metrics for CHIAMO calls, for example, based on the location and/or separation of the clusters, but found that the best indicator of a SNP being difficult to call was the amount of missing data in its calls: CHIAMO consistently marked many individuals missing for SNPs with poorly defined or overlapping clusters, whereas it successfully called genotypes for nearly all individuals on high-quality SNPs (data not shown). We excluded 26,567 SNPs with a study-wide missing data rate >5% (Supplementary Fig. 19), or >1% for SNPs with a study-wide MAF < 5%. We additionally excluded 4,351 SNPs with Hardy–Weinberg exact P value < 5.7 × 10-7 in the combined set of 2,938 controls, and 93 SNPs with P value < 5.7 × 10-7 for either a one- or two-degree of freedom test of association between the two control groups (corresponding to a 1 d.f. chi-squared statistic of about 25). See Supplementary Fig. 20 and Fig. 1 respectively for the empirical distributions of these statistics used to motivate the thresholds above.

Overall, we found that the 809 excluded individuals (which represent 4.8% of the study samples) accounted for 35.6% of the missing data at non-excluded SNPs. In total, 469,557 SNPs passed the quality control filters.

Supplementary Fig. 20 shows the effect of quality control filters, and visual inspection of the cluster plots of SNPs showing apparently strong association, on quantile-quantile plots for one disease (T2D, others are similar), and the success of these filters in excluding poorly performing SNPs. The figure (panel d) also shows the marked effect on the tails of the distribution of test statistics of regions of genuine association (for this disease the three regions removed because of strong evidence of association have all been independently replicated, see main text). The aim in filtering is to exclude poor SNPs but without removing genuine associations. No single criterion will do this. In order not to exclude possible genuine associations, we chose to apply relatively light quality control filters but then to subject all apparently associated SNPs to visual inspection of cluster plots (see Supplementary Information). Around 100 cluster plots were assessed per disease.

We used X-chromosome SNPs to check for sex discrepancies with the sample files (Supplementary Fig. 21). These were fed back to disease groups for amendment and verification. The 80 samples where it was not possible to discern the source of the discrepancy were left in the study for analysis, on the grounds that mishandling was considered unlikely to have introduced samples with altogether different phenotypes.

DNA quality between cases and controls could result in false-positive associations through differential effects on genotype calling111. DNAs in our study came from various sources between, and in some cases within, case and control series, but with the combination of centralized sample quality control, simultaneous genotype calling with CHIAMO (which explicitly allows for differences between collections), and inspection of cluster plots for SNPs with very small P values, our study did not experience such difficulties.

Comparing linkage disequilibrium

Two questions which have been raised about the HapMap data are how well it describes linkage disequilibrium in populations other than the ones that were sampled, and whether the sample sizes in HapMap (60 Caucasian individuals, for example) are adequate to describe patterns of linkage disequilibrium. With data on 2,938 controls and 16,179 individuals in total at around 400,000 polymorphic SNPs, we are well placed to address this for the British population. Initial analyses suggest that patterns of linkage disequilibrium in our samples are very similar to those in HapMap. As an example, Supplementary Fig. 3 compares patterns of linkage disequilibrium in HapMap CEU individuals and our 58C sample at SNPs on the Affymetrix chip across 22× 1 Mb regions of the genome and they seem almost identical. We calculated r2 values directly from the phased haplotypes available in HapMap, but using unphased genotype data from our study. Note that visual representations of linkage disequilibrium in this form can be very sensitive to SNP density so comparisons across regions is difficult without correction for SNP density, and direct comparison of linkage disequilibrium patterns at all HapMap SNPs with those at the subset of SNPs on the Affymetrix 500K chip is not straightforward.

Geographical variation and population structure

Principal component analysis was performed as a two-stage process: we formed a matrix of estimated correlations (formally, the inner product measure of similarity) between all pairs of individuals, and then computed the eigenvectors and eigenvalues of that matrix. We estimated the correlation between two individuals as described by14. We identified components that reflected genome-wide structure in two ways. First, we created two subsets of the data containing SNPs from the odd- and even-numbered chromosomes, repeated the PCA on each of these, and inspected scatter plots of pairs of components between the two subsets of the data. A component which is due to a region of linkage disequilibrium on a chromosome (as opposed to genome-wide structure) will appear only when analysing the data set containing SNPs from that chromosome. Second, we computed the score of every SNP on the components. For a component that is due to a region of linkage disequilibrium, there will be a spike of high SNP scores only in that region. To minimize the contribution from regions of extensive strong linkage disequilibrium, the correlation estimates were based on a subset of 197,175 SNPs that were spaced at least 0.001 cM apart (HapMap estimates) and specifically excluded the MHC region.

To assess the level of over-dispersion in each collection we first created a very clean set of data to ameliorate the effects of over-dispersion due to calling problems and missing data. In addition to the main filters described above, we filtered out all SNPs that had a clear genotype-calling problem revealed by visual inspection, SNPs with a study-wide missing data rate >1% and SNPs with study-wide minor allele frequency <1%. Around 360,000 SNPs passed these filters. Estimates of λ were calculated using an estimator based on the median test statistic15. Estimates of λ were also calculated from tests that conditioned on the scores for each individual along the two estimated principal components described above. The tests (1 d.f. and 2 d.f.) were carried out by including the scores as additional covariates in a logistic regression model fit.

Bayes factors

The box in the main text makes the point that understanding the strength of evidence conveyed by a particular P value also requires knowledge of power. In contrast, the Bayes factor (BF) provides a single measure of the strength of the evidence for an association, and we report these in addition to P values (Supplementary Table 14). As for power, calculation of Bayes factors requires assumptions about effect sizes. The assumptions underlying our calculations are given below and in Supplementary Information.

There is broad agreement between the way in which P values and our Bayes factors rank SNPs, except for SNPs with low MAFs (Supplementary Fig. 22). This is intuitive: unless one believed, a priori, that rare causative SNPs have substantially larger effect sizes, there will be reduced power for these SNPs and hence weaker evidence for association than for common SNPs with the same P value.

One perspective on GWAs is that in practice they will be used to prioritize SNPs for further study or additional typing. In addition to BFs providing a single quantity that can be directly compared between SNPs, it is also straightforward for investigators to give different a priori weights to different classes of SNPs, such as non-synonymous (ns)SNPs, genic SNPs, SNPs in highly conserved regions, or SNPs in linkage disequilibrium with many (or few) other SNPs.

We now describe calculation of the Bayes factors. We use M0 to denote a model of no association, M1 for a model with an additive effect on the log-odds scale and M2 for a general 3 parameter model of association. At each SNP we calculate two Bayes factors: one for the additive model versus the null model, BF1, and one for the general model versus the null model, BF2. That is,

where , where θ denotes the parameters for the model. For all 3 models we use a logistic regression model for the likelihood where the log-odds for individual i is equal to μ for model M0, for model M1 and for model M2. is the genotype (coded 0, 1 and 2) for individual i and is the indicator function that individual i has the genotype coded as m. For each model we choose the priors on the parameters, , to reflect our belief about the likely effect sizes underlying complex trait loci.

The parameter γ in models M1 and M2 is the increase in log-odds of disease for every copy of the allele coded as 1, and eγ is the additive model odds ratio. For both models we use a N(0, 0.2) prior on γ. This prior puts probability 0.31 on odds ratios above 1.2 or below 0.8, and probability 0.02 on odds ratios above 1.5 or below 0.5. The parameter μ in all three models represents the baseline odds of disease. In a case-control design the numbers of cases in the sample have been elevated artificially, which will have a large effect on likely values of μ. Our prior beliefs about the baseline risk of disease must take this into account. For all three models we have used a N(0, 1) for μ and have found that the resulting Bayes factors are relatively insensitive to choice of priors for this parameter as long as the same prior is used for the two models being compared. The parameter φ in model M2 represents a recessive effect over and above an additive effect. We use a N(1, 1) prior for φ. Combined with the prior on γ, this results in a prior probability of 0.25 on the odds ratios above 1.5 and below 0.5 for the genotype coded as 2. In addition, we note that the evaluation of the Bayes factors will depend on the way the alleles at the SNP have been coded 0 and 1. To account for this we average over the two possible codings of each SNP with equal weight. A fuller description of the priors used can be found in Supplementary Information.

Sex-differentiated tests

We examined the possibility of differential genetic effects in males and females by reapplying the two single-locus analyses (trend test and genotypic test) separately in males and females and combining the results (simply adding the chi-squared statistics for the male and female analyses, and comparing with the 2 d.f. or 4 d.f. null hypothesis; results are shown in Supplementary Table 15). We refer to this as a sex-differentiated test. This test is sensitive to association that is of a different magnitude and/or direction in the two sexes, although it is less powerful than the simple test when the effect size does not vary with sex.

X Chromosome analysis

For several reasons the X chromosome needs to be treated differently from the autosomes (note that the Affymetrix chip used does not assay the Y chromosome). First, samples sizes and hence power are different from the autosomes (only one copy of X in males). Also, because the effective population size on the X chromosome is smaller than the autosomes, linkage disequilibrium extends further. And unlike the autosomes, there are choices in how to implement even single locus analyses: these relate to the relative weight to be given to males and females in comparisons between cases and controls.

For autosomal SNPs, the 1 d.f. trend test statistic is calculated by dividing the square of the difference between means of the SNP genotypes (scored 0, 1, 2) between cases and controls by an estimate of its variance. The variance estimate used is an empirical estimate that does not assume Hardy–Weinberg equilibrium. The numerator can also be represented as the squared difference in allele frequencies between cases and controls, as in the allele counting test. At first sight, a natural generalization of this test to deal with SNPs on the X chromosome would involve comparing allele frequencies, by allele counting, but using a variance estimate which does not assume Hardy–Weinberg equilibrium in females. However, we took the view that, because most loci on the X chromosome are subject to X chromosome inactivation, it is more logical to treat males as if they were homozygous females. Thus we score female genotypes 0, 1 or 2 and male genotypes 0 or 2, comparing mean scores of cases and controls as before. The variance estimate allows for the different variance of male and female contributions and does not assume Hardy–Weinberg equilibrium in females.

A stratified version of the test is constructed using the same principles by which the trend test is extended to the Mantel extension test; a score that contrasts cases and controls is computed for each stratum together with its variance; these are then summed over strata. The final test is the squared total score divided by the total variance. To extend these tests to a 2 d.f. test, we add a score that compares heterozygosity between cases and controls. Clearly, only females contribute to this component. Results of these analyses of X chromosome SNPs are shown in Supplementary Table 16.

Multilocus analysis

We use (1) the genotype data of this study, (2) the HapMap data, and (3) a population genetics model, to simulate genotypes at the HapMap SNPs that are not on the Affymetrix 500K chip. Informally, we determine which haplotypes are present in each individual in a region, and then use HapMap to ‘fill in’ these haplotypes at untyped SNPs (see below for details). These ‘in silico’ genotypes are then tested for association with the disease as before. This powerful multilocus tool for association studies143 has the advantage of using information from all markers in linkage disequilibrium with an untyped SNP, but in a way that decreases with genetic distance. Our imputation method was applied to individuals passing project filters, and used markers which passed the project filters and in addition had MAF > 1%. As a validation we compared our imputed genotypes for 58C individuals with genotypes obtained on an Illumina platform for 10,180 SNPs that are polymorphic in CEU HapMap samples. At these SNPs, for imputed genotypes with posterior call probabilities above 0.95, there was 98.4% agreement with the Illumina genotypes.

In our association analyses we imputed genotypes at 2,139,483 HapMap SNPs, and tested these for association with each disease using the trend test or the genotypic test. We included the results from imputed SNPs in the signal plots (Fig. 5) because they are useful in (1) assessing signal strength within a region; (2) providing a wider range of SNPs for follow up; and (3) indicating possible locations for the causal variant. For example in the case of TCF7L2 in T2D, there is a substantially stronger signal from rs7903146 than for any of the typed SNPs (see also Supplementary Fig. 12).

To be conservative, stringent quality control filters were applied to genomic regions where imputed SNPs (but not genotyped SNPs) were responsible for a strong signal for association. These were as follows: (1) any such region was required to contain more than one imputed SNP showing the required level of association with a MAF > 2% and posterior probability for imputed genotypes averaged across the SNP >0.95 (empirical studies showed imputation at low MAF SNPs more prone to error); (2) all cluster plots for genotyped SNPs within 0.3 cM (from HapMap Phase II estimated recombination rates) were checked and where there was evidence of any mis-calling the region was rejected (the major problem with imputation arises around SNPs with genotype calling errors); and (3) if there was no genotyped SNP with a P value < 10-4 for association on either trend or genotypic test, the region was rejected. Note that accuracy of imputation with these filters applied will be larger than the figure of 98.4% reported above.

We use H = {H1,…, HN} to denote a set of N known haplotypes where Hi = {Hi1,…, HiL} is an individual haplotype and L is the number of SNP loci. In practice, we set H to be the 120 CEU haplotypes estimated as part of the HapMap project owing to the expected similarity in haplotype structure between the CEU and UK populations. We let G = {G1,…, Gk} denote the genotype data on the K individuals in the study where Gi = {Gi1,…, GiL} and Gij{0, 1, 2, missing}. In this setting, the majority of SNPs will have entirely missing genotypes, because the Affymetrix 500K chip has approximately 1/6th of the number of SNPs in the Phase II HapMap. The missing genotypes are imputed by modelling the distribution of each individual’s genotype vector Giconditional on the known set of haplotypes H, . Our model for each individual’s genotype vector is a Hidden Markov Model in which the hidden states are a sequence of pairs of the N known haplotypes in the set H. That is,

where and are the two sequences of copying states at the L sites and . Here, defines our prior probability on how the sequences of copying states change along the sequence and models how the observed genotypes will be close to but not exactly the same as the haplotypes being copied. The precise form of these terms (described in ref. 142) are based on an approximate population genetics model that makes direct use of the recently estimated fine-scale recombination map across the genome142,143. At each of the missing genotypes in the study, we use this model to calculate probabilities for the three possible genotypes. At each imputed SNP, we used these probabilities to calculate the 2×3 table of expected genotype counts for cases and controls and used these counts to carry out a standard test of association.

Disease models

To test for deviations from additivity (in log-odds) at a locus we fit a logistic regression model using the function glm in the statistical software R (http://www.r-project.org/). For each region we considered the most significant SNP and compared an additive model to a general 2-d.f. model by fitting a model with an additive sub-model nested in a general model. The additive effect was modelled by a variable encoded 0, 1, or 2 for the effect at the three genotypes and a second term for a general model was included by a variable encoded 1 for heterozygotes and 0 otherwise. We rejected an additive model if the second term was significant and then compared a dominant or recessive model to a general model. For the pairwise interaction analysis, we fixed the marginal model at each locus on the basis of the single locus analysis. We compared the two locus model with these marginals and no interaction terms with a larger model including interactions. This larger interaction model has 1, 2, or 4 additional parameters depending on whether both marginal models are additive, one is additive and one general, or both general.

Software

Several software packages were developed within the WTCCC for data analysis, data management and simulation studies. We found it necessary to normalize the Affymetrix probe intensity data to minimize chip-to-chip variability. A C++ program was written to carry out this normalization efficiently. To obtain a copy of the software please email Hin-Tak Leung at hin-tak.leung@cimr.cam.ac.uk.

We developed a new genotype calling algorithm, CHIAMO, implemented in C++. CHIAMO uses a hierarchical statistical model, which allows it to simultaneously call genotypes at all data samples. To obtain a copy of the software please email J. L. Marchini at marchini@stats.ox.ac.uk.

To perform genome-wide association analysis we developed two software packages: snpMatrix and SNPTEST. snpMatrix is an R package and is freely available from http://www-gene.cimr.cam.ac.uk/clayton/software/. Both quantitative and qualitative phenotypes can by analysed using snpMatrix and flexible association testing functions are provided that control for potential confounding by quantitative and qualitative covariates. SNPTEST is a standalone C++ program that implements both frequentist tests and bayesian analysis of association and allows the user to include quantitative or qualitative covariates. This program works directly with the output of CHIAMO and IMPUTE (see below). To obtain a copy of the software please email J. L. Marchini at marchini@stats.ox.ac.uk.

Genotypes at SNPs that are in HapMap but not on the Affymetrix 500K chip were imputed using the C++ program IMPUTE, which makes use of genotype information at neighbouring SNPs. To obtain a copy of the software please email J. L. Marchini at marchini@stats.ox.ac.uk.