Identification of new susceptibility loci for IgA nephropathy in Han Chinese

IgA nephropathy (IgAN) is one of the most common primary glomerulonephritis. Previously identified genome-wide association study (GWAS) loci explain only a fraction of disease risk. To identify novel susceptibility loci in Han Chinese, we conduct a four-stage GWAS comprising 8,313 cases and 19,680 controls. Here, we show novel associations at ST6GAL1 on 3q27.3 (rs7634389, odds ratio (OR)=1.13, P=7.27 × 10−10), ACCS on 11p11.2 (rs2074038, OR=1.14, P=3.93 × 10−9) and ODF1-KLF10 on 8q22.3 (rs2033562, OR=1.13, P=1.41 × 10−9), validate a recently reported association at ITGAX-ITGAM on 16p11.2 (rs7190997, OR=1.22, P=2.26 × 10−19), and identify three independent signals within the DEFA locus (rs2738058, P=1.15 × 10−19; rs12716641, P=9.53 × 10−9; rs9314614, P=4.25 × 10−9, multivariate association). The risk variants on 3q27.3 and 11p11.2 show strong association with mRNA expression levels in blood cells while allele frequencies of the risk variants within ST6GAL1, ACCS and DEFA correlate with geographical variation in IgAN prevalence. Our findings expand our understanding on IgAN genetic susceptibility and provide novel biological insights into molecular mechanisms underlying IgAN.

I gA nephropathy (IgAN) is the most common primary glomerulonephritis and a major cause of end-stage renal disease in the Chinese population. It is characterized by the deposition of IgA in the mesangial area of glomeruli, and up to 40% of the cases progress to end-stage renal diseases within 20 years of disease onset 1,2 . There is a marked regional difference in frequency of IgAN. It occurs with highest frequency in Asian populations, accounting for 45-58.2% of primary glomerular disease, with modest frequency in Caucasians (USA and Europe) and with lower frequency in the African population 2,3 . These differences, together with evidence of familial clustering, strongly suggest the presence of a substantial genetic contribution to the disease. Previous genome-wide association studies (GWASs) have identified five loci significantly associated with IgAN, namely chromosome 1q32 (CFHR3-CFHR1 genes), 6p21 (MHC), 8p23 (DEFA gene cluster), 17p13.1 (TNFSF13) and 22q12 (HORMAD2), providing the first valuable insights into genetic risk factors underlying disease mechanisms 4,5 . However, these explain only a fraction of the disease risk and it is clear that more genes and loci remain to be discovered.

Results
Genome-wide discovery analysis. We performed a new GWAS analysis (stage 1) by combining our published GWAS data set 5 (consisting 1,434 cases and 4,270 controls) with an additional 6,511 control subjects of Chinese Han ethnicity, including 981 subjects from Guangdong, 523 subjects from Shandong and 5,007 subjects from Singapore (Supplementary Table 1). In each data set, we imputed untyped single-nucleotide polymorphisms (SNPs) using the 1,000-genome multi-ethnic reference panel (Feb 2012, IMPUTE v2) (refs 6-8). After merging shared SNPs across the data sets and stringent quality control filtering (see Methods), we analysed a total of 3,792,949 autosomal SNPs in 1,434 cases and 10,661 controls. With the expanded sample size and deep imputation of untyped variants, this new GWAS analysis has improved statistical power over the previous GWAS to detect risk variants in the range of odds ratio (OR) o1.3 and minor allele frequency (MAF) o10% (Supplementary Table 2) 9 . We analysed genotype dosages taking into account imputation uncertainties 10 , using a logistic regression model assuming an additive effect for allelic risk and adjusting for population stratification using the first five principal components (PCs) as covariates . The genomic inflation factor after PC correction was low (l GC ¼ 1.087; 1.073 for SNPs with MAF Z5%, l 1,000 ¼ 1.034), suggesting minimal effects of population stratification in our discovery GWAS. In addition, we examined genotypes of 3,731,832 SNPs imputed with high confidence (info score 40.8, with call rates 495% for genotype with probabilities 40.9) using a multivariate linear mixed model implemented in GEMMA 11,12 (l GC ¼ 1.039; 1.015 for SNPs with MAF Z5%, l 1,000 ¼ 1.015) ( Supplementary Figs 2 and 3), which gave results that were largely consistent with the logistic regression analysis with PC correction (Tables 1 and 2, Supplementary Data 1).
A similar trend of association was also observed at 1q32, even though the result was not statistically significant (Supplementary  Table 3). We also observed strong and consistent evidence for the three independent association signals within the major histocompatibility complex (MHC) region 6p21 that we previously reported (Supplementary Table 3). In addition, we identified a novel independent signal at rs2295119 in the MHC region that remained genome-wide significant after conditioning on all previously reported human leukocyte antigen (HLA) SNPs (OR ¼ 1.359, logistic regression P unconditioned ¼ 3.24 Â 10 À 11 , P conditioned ¼ 7.52 Â 10 À 10 ). This SNP is in linkage disequilibrium (LD) with rs9277554 (r 2 ¼ 0.556, D 0 ¼ 0.993) which was found to be independently associated but did not reach genome-wide significance in our previous study 5 .
Our analysis also suggested three independent association signals within the DEFA gene cluster on 8p23 (each Po10 À 4 , r 2 o0.1 between each pair of SNPs). After excluding the SNPs within the five previously identified regions, a notable excess of extremely small P values was observed on the quantile-quantile plot compared with the expected null distribution ( Supplementary Fig. 9), which suggests the existence of additional associations beyond the ones already identified.
Validation analysis. We first selected the top 136 independent SNPs exceeding P o1 Â 10 À 4 in either the PC-adjusted logistic regression or GEMMA analysis that are not within the known loci (hypothesis free). Finally, by including the three independent SNPs within the 8p23 locus, a total of 139 SNPs were selected for the first validation study, but only the assays for 122 were successfully designed for multiplex genotyping analysis by Sequenom.
As the initial validation, 115 SNPs were successfully genotyped in 2,651 IgAN cases and 2,907 controls of Han Chinese ethnicity recruited from the Southern region of China (stage 2). We performed logistic regression analysis of the validation samples and combined the association statistics across the discovery (PC-adjusted or GEMMA) and validation samples by a fixedeffects meta-analysis (Supplementary Data 1). We then took forward the top 22 SNPs with Po1 Â 10 À 4 in the meta-analysis of the combined stages 1 and 2 samples for further validation analysis in an independent set of samples (stage 3) consisting of 2,428 IgAN cases and 4,202 controls recruited from the Northern (1,463 cases and 1,683 controls) and Southern (965 cases and 2,519 controls) regions of China (Supplementary Table 5). Finally, the top eight SNPs with Po5 Â 10 À 7 in the combined samples of stages 1 þ 2 þ 3 and showing consistent associations across all four sample collections were analysed in an additional independent set of 1,800 IgAN cases and 1,910 controls (stage 4) recruited from Northern (704 cases and 805 controls) and Southern (1,096 cases and 1,105 controls) China (Table 1). We then conducted a full meta-analysis of all the stages 1-4 samples ( Supplementary Fig. 1) by analysing the Northern and Southern samples of each of the four stages as independent sample collections (six independent samples in total) to minimize bias resulting from population stratification.
To further ensure that none of the novel associations were influenced by population stratification and/or batch effects among the expanded control samples of stage 1 (genotyped in different arrays), we re-examined the imputation info scores and allele frequencies of all the validated SNPs across the different control data sets and found them to be high quality and very consistent across genotyping arrays and sample collections (Supplementary Table 7). We re-examined our reported SNPs in a subset of samples (1,434 cases and 4,270 controls) that were analysed in our previous GWAS 5 , on which the imputation was performed as a batch from 444,882 overlapping autosomal SNPs that were genotyped on different Illumina chips and found that the strengths of these associations (ORs) were consistent with the results from the full data set, indicating that the new evidences of these associations were driven by improved statistical power and genetic variation coverage rather than systematic bias due to the  (Tables 1 and 2). Next, we re-ran the full meta-analysis by dividing the discovery data set into Northern (414 cases and 2,306 controls) and Southern (1,020 cases and 8,355 controls) clusters with adjustment of the top five principal components re-calculated within each cluster (see Methods) ( Supplementary Fig. 5), and the full meta-analysis results were consistent at the top loci (Supplementary Table 8). The association effects were not significantly different between the combined (discovery and validation) Northern and Southern samples (P het 40.05) (Supplementary Table 9). In addition, we did another full metaanalysis with lambda GC correction of the PC-adjusted results from the discovery data set. All the novel loci remained genomewide significant (Supplementary Table 9). Finally, the associations were not influenced by age and gender, and similar effects were observed in males and females (Supplementary Table 10).
While this manuscript was under preparation, an independent GWAS on IgAN conducted in Europeans and Asians was published and reported three novel loci ITGAX, VAV3 and CARD9 (ref. 16). Of the two previously reported independent associations within ITGAX locus, our top SNP (rs7190997) at ITGAX shows a high LD with the reported SNP rs11150612 (r 2 ¼ 0.877, D 0 ¼ 0.988). The other reported SNP rs11574637 as well as SNPs in LD with it are either very rare (MAFo1%) or absent in both our samples and HapMap Asians (http://hapmap.ncbi.nlm.nih.gov/). At the other two novel loci VAV3 and CARD9, we observed a similar direction of association at the reported SNPs 16 but the results did not reach statistical significance in our discovery samples (P40.05; Supplementary Table 3, Supplementary Fig. 8).
We have also done further investigation of the reported association at the CFH locus 4,16 by genotyping rs6677604 in our validation 2 and 3 samples and analysing the Northern and Southern samples of each collection separately in the metaanalysis of the combined discovery and validation samples (Supplementary Table 11). We observed significant association of the CFH locus (OR ¼ 1.19 (95% confidence interval Only rs2738058 is in LD (r 2 ¼ 0.71) with the reported SNP rs2738048.For the GWAS samples, only the result from logistic regression analyses (GWAS logistic) was used in the fixed-effects metaanalysis with the validation samples. *Pairwise LD (r 2 ) between rs2738058 and rs9314614 ¼ 0.001, between rs12716641 and rs2738058 ¼ 0.074. There is no LD between rs9314614 and rs12716641 (r 2 ¼ 0).
(CI) ¼ 1.07-1.31), meta-analysis P ¼ 0.0011) without evidence of heterogeneity across all the six sample collections (I 2 ¼ 0%, Cochrane's Q test P heterogeneity ¼ 0.70). The effect size in our samples was, however, smaller than what was previously reported, and a slightly larger effect was observed in our Northern (OR ¼ 1.26) than Southern (OR ¼ 1.12) samples (Supplementary Table 11).
Independent associations within the DEFA locus. We discovered three independent signals at the DEFA locus (Table 2,  Supplementary Table 12). Of these, only rs2738058 is in LD with the previously reported SNP rs2738048 (r 2 ¼ 0.71) (ref. 5), and the other two signals at rs12716641 and rs9314614 were located 76-77 kb and 124-125 kb away from rs2738058 and rs2738048, respectively (Fig. 2). While rs12716641 is located within the DEFA gene cluster, rs9314614 is located in the intron of the longcoding RNA GS1-24F4.2 and separated from the DEFA gene cluster by two recombination hotspots, likely representing an independent novel locus. Furthermore, all three are poorly correlated with rs10086568, an independent association within the DEFA locus reported in ref. 16 (r 2 o0.1 in our samples and 1,000 genomes Asians, although r 2 ¼ 0.17 with rs12716641 in Europeans). Multiple copy-number variants (CNVs) of DEFA1-A3 exist within this region 17 and were previously found to be associated with Crohn's disease 18 and severe sepsis 19 .
Recently, rs4300027 has been reported to tag the CNVs in Europeans (r 2 ¼ 0.35) (ref. 20). rs4300027 is in moderate LD with rs2738048 (r 2 ¼ 0.15), rs2738058 (r 2 ¼ 0.12) and rs12716641 (r 2 ¼ 0.28), but not rs9314614 (r 2 ¼ 0.002). This suggests that the associations at rs2738048/rs2738058 and rs12716641 may   implicate the role of DEFA1-A3 CNVs in IgAN. Further fine mapping and functional study will be needed to investigate the association of the DEFA1-A3 CNVs with IgAN and its relationship with the complex association patterns at these SNPs in Europeans and Asians.
Functional investigation through eQTL (expression quantitative trait loci) and GRAIL (genetic relationships across implicated loci) analyses. We then investigated potential biological effects of the novel associated SNPs and loci by looking for effects on mRNA expression levels (eQTLs) 21,22 , ENCODE annotations of associated variants 23,24 , documented associations with other diseases and known biological functions of nearby genes (Table 3,  Supplementary Tables 13-15, Supplementary Data 2 and 3). Three out of the four novel loci showed strong associations with the expression levels of the genes ITGAX, ITGAM, ACCS, EXT2 and ST6GAL1 in peripheral blood cells and others, suggesting an important role of regulatory variants of these genes in IgAN risk. All five SNPs also tag variants that lie in predicted protein binding sites. Although no strong eQTL effects were observed for rs2033562 near ODF1-KLF10 in blood cells, ENCODE annotations suggest it may play a role in the expression of the nearby UBR5 gene 24 . Furthermore, although there was no evidence for eQTL effects of rs2738048/rs2738058 and rs12716641, a moderate association with expression of the gene DEFB1 was observed at rs9314614 in monocytes (linear model P ¼ 4.22 Â 10 À 4 ; Supplementary Tables 13 and 15) 22 , providing further support that this SNP could tag a novel independent locus and pathway despite its close proximity to the DEFA cluster. These variants are also not in strong LD with those previously associated with other traits and diseases in Europeans or Asians (Table 3, Supplementary Table 13) 25 , and a GRAIL (https://www.broadinstitute.org/mpg/grail/) 26 analysis of previously reported and novel loci did not identify any major pathways that can account for their associations with IgAN. We anticipate that these loci highlight multiple different pathways including mucosal immune response 16 that may jointly influence IgAN pathogenesis through changes in gene expression levels. Since most of the existing Risk allele is strongly associated with increased ACCS gene expression levels in peripheral blood cells (Po9.81 Â 10 À 198 ), monocytes (P ¼ 1.07 Â 10 À 36 ), and B-cells (P ¼ 1.46 Â 10 À 29 ). It is also associated with increased EXT2 expression levels in peripheral blood cells (P ¼ 2.63 Â 10 À 53 ). This SNP and others in LD are also located at sites predicted with high likelihood to affect chromatin structure and protein binding.
Not previously associated with any other trait.
ACCS encodes a1-aminocyclopropane-1-carboxylate synthase homologue that belongs to the class-I pyridoxalphosphate-dependent aminotransferase family. Shown to interact with the protein encoded by FBF1 (Fas (TNFRSF6) binding factor 1), a keratin-binding protein that is required for epithelial cell polarization, apical junction complex assembly andciliogenesis 37,38 EXT2 is a glycosyltransferase, which plays a role in heparan sulfate biosynthesis. Heparan sulfate in turn influences angiogenesis and cell proliferation; mutations in this gene cause multiple exotoses 39,40 . rs7634389 3q27.3 SNP location: Lies within an intron of the ST6GAL1.
Risk allele is in strong LD (r 2 40.9 in our samples, HapMap Europeans and Asians) with variants associated with decreased expression levels of ST6GAL1 in peripheral blood cells (rs3821819; P ¼ 5.96 Â 10 À 20 ) and B-cells (rs17776120; P ¼ 1.61 Â 10 À 7 ). This SNP and others in LD may affect chromatin structure and protein binding.
ST6GAL1 encodes ST6 betagalactosamide alpha-2,6sialyltranferase, a member of glycosyltransferase family involved in the generation of the cell-surface carbohydrate determinants and differentiation antigens. Regulates macrophage apoptosis via alpha2-6 sialylation of the TNFR1 death receptor. May play a regulatory role in innate immune response. Up-regulated in human cancers 41,42 . rs2033562 8q22.3 SNP location: Located in an intergenic region closest to the genes ODF1, KLF10 and UBR5.
No evidence for effects on gene expression levels in peripheral blood cells, monocytes, B-cells. In LD with SNPs with potential regulatory effects on UBR5 expression, chromatin structure and protein binding.
Chronic lymphocytic leukaemia (CLL) Not in LD with CLLassociated SNP.
KLF10 encodes a transcriptional repressor that acts as an effector of transforming growth factor beta signalling and activity of this protein may inhibit the growth of cancers. UBR5 encodes a E3 ubiquitin ligase which interacts with the deubiquitinase DUBA which in turns plays a role in IL-17 production in T-cells and inflammatory response in the small intestine 43 . SNP, single-nucleotide polymorphism. Other recently and previously reported loci are described in Supplementary Table 13. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms8270 expression data sets were generated in healthy individuals that were mostly of European descent 21,22 , a more careful analysis will be needed to examine the association of these genotypes on the expression levels of these genes in Asian subjects, and to evaluate their biological and clinical relevance in IgAN patients and controls.
At the ST6GAL1 locus, we observed a trend of increasing allele frequencies of risk allele C at rs7634389 from Africans (16%), Europeans (37.5%) to Asians (40%) in HapMap populations (Supplementary Table 16). Similarly, the ACCS SNP rs2074038 risk allele T is absent in Africans, present at moderate frequency in Europe (12%) and highest frequency in Asia (33%). Similar observations were made at two independent variants in the DEFA cluster (Supplementary Table 16). Consistent with previous reports on the geographical distribution of risk variants within 1q32, 22q12 (ref. 1), and more recently, 16p11.2 (ref. 16), these trends of increasing risk allele frequencies suggest that pooled differences in risk allele frequencies of these loci combined may contribute to differences in disease prevalence across different world populations 1,4,16 .

Discussion
This study has several advantages over our previous study. Firstly, we have performed deep genome-wide imputation in the discovery samples, leading to the discovery of loci tagged by SNPs that were not previously represented on SNP arrays. Secondly, the addition of 6,391 controls led to a moderate gain in statistical power to detect risk alleles with small effect sizes (ORo1.2). Thirdly, we have expanded our validation data sets to include both Northern and Southern Chinese samples that have provided firm independent replication of the association signals. We used geographic matching as a proxy for genetic matching to control the effects of population stratification in our validation data sets, as has been done in previous studies on the Chinese population 27 , although ancestry informative markers may also be helpful. We also acknowledge a number of limitations. Despite deep imputation, we expect that the coverage of low frequency and rare variants (minor allele frequencies o5%) is limited given that most of the imputed rare variants in our study had a low information score that did not pass our strict quality control thresholds and were therefore excluded from further analysis. Also, with the current sample size, we had limited power to analyse lower-frequency variants with minor allele frequencies below 5% (Supplementary Table 2) 9 . Further studies will be needed to directly sequence or genotype rare variants on a larger number of cases and controls to evaluate their role in IgAN risk.
We have conducted the largest study on IgAN in the Han Chinese population to date by analysing a total of 8,313 IgAN cases and 19,680 controls. We have discovered three new loci at 11p11.2, 8q22.3 and 3q27.3, two novel independent associations within the DEFA-DEFB gene cluster and validated the recently reported locus at 16p11.2 (ref. 16). We estimate that these novel association signals explain about 1.7% of the disease variance and 5.5% of the variance in combination with the previously published loci 4,5 . Some risk variants show significant difference in frequency and may contribute to IgAN prevalence differences across world populations. Our study has significantly expanded our understanding on the genetic basis of IgAN susceptibility and provided novel insight into the mechanisms underlying the development of IgAN.

Methods
Study subjects. The original genome-wide discovery analysis involved 1,523 cases (from Southern China) and 4,276 controls (972 controls from southern China, 1,228 controls from Northern China and 2,076 Chinese controls from Singapore who share the same ancestral origin as the other Chinese controls) 5 . To boost the statistical power of the current study, we included 6,511 control subjects of Chinese Han ethnicity from several of our previous GWAS including 981 subjects from Guangdong, 523 subjects from Shandong and 5,007 subjects from Singapore. For the validation study, three independent case-control samples were recruited from China as replication 1 (2,651 IgAN cases and 2,907 controls), replication 2 (2,428 cases and 4,202 controls) and replication 3 (1,800 IgAN cases and1, 910 controls) (Supplementary Note 1).
All the cases were histopathologically diagnosed by biopsy according to the following criteria: (i) immunofluorescence showing at least 2 þ (scale 0 to 3 þ ) mesangial deposition of IgA, with IgA comprising the dominant immunoglobulin deposited in the glomeruli and (ii) excluding individuals with cirrhosis, Henoch-Schönlein purpura nephritis, hepatitis B-associated glomerulonephritis, HIV infection and systemic lupus erythematosus 5 . In accordance to the Oxford Classification of IgAN, our samples were graded by the four pathological features (mesangial hypercellularity M, endocapillary hypercellularity E, segmental glomerulosclerosis S and tubular atrophy/interstitial fibrosis T, resulting in a MEST score; Supplementary Note 1) 28,29 .
The study was approved by the Institutional Review Board at The First Affiliated Hospital of Sun Yat-sen University and at the National University of Singapore. Written informed consent was obtained from all of the participants.
Sample genotyping and quality control. Genomic DNA was isolated from whole blood using a Qiagen DNA extraction kit and quantified using a Picogreen assay (Invitrogen). Genotyping analysis of the discovery samples was conducted using Human660-Quad (1,523 cases), Human610-Quad (1,953 southern and 1,228 Northern Chinese controls and the 3,998 Singaporean Chinese controls, 523 Shandong controls), Human 550K (1,022 Singaporean Chinese controls), Human 1M-Duo (930 Singaporean Chinese controls) and Human OmniExpress (1,133 Singaporean Chinese controls) BeadChips (Illumina).We excluded SNPs from the X, Y and mitochondrial chromosomes and focused all further analyses on autosomal SNPs. We performed identity by descent analysis using PLINK 30 and 104 first degree relative pairs were identified; the relative with a lower sample call rate was excluded. Principal components analysis 31 (Eigensoft v3.0: http:// genepath.med.harvard.edu/Breich/Software.htm) was done on using a set on 47,462 common SNPs (MAF41%) that were derived from 250,201 genotyped SNPs overlapping across all arrays. These SNPs were pruned to remove SNPs in LD (r 2 40.1, using PLINK --indep-pairwise 50 5 0.1) after exclusion of SNPs in the five conserved long-range LD regions in Chinese, namely the HLA region on chromosome 6, inversions on chromosomes 8 and 5 and two regions on chromosome 11 (refs. 5,27). After principal components analysis, 16 outliers were identified based on principal components (PCs) 1-5 and excluded such that 12,095 samples (1,434 cases and 10,661 controls) were left for the final analysis.
HLA imputation and analysis. Imputation of two-and four-digit classical HLA alleles was performed using the SNP2HLA tool (https://www.broadinstitute.org/ mpg/snp2hla/) and the Pan-Asian reference panel [13][14][15] using the same set of directly genotyped SNPs within chr6:20-40 Mb (Hg19, build 37) that passed quality control thresholds in each of the data sets as described above. We analysed genotype dosages of all imputed HLA alleles, with most of the reported alleles having high imputation confidence (r 2 40.9; Supplementary Table 4).
Genotyping and quality controls in the validation study. Genotyping analysis of the SNPs selected for validation was performed using the MassArray system from Sequenom. Locus-specific PCR and detection primers were designed using the MassArray Assay Design 3.0 software (Sequenom). SNPs that failed Sequenom design or genotyping in validation 2 and 3 samples were genotyped using Taqman assays (Life Technologies). TaqMan reactions were carried out in 5-ml volumes containing 10-20 ng DNA according to the manufacturer's protocols. Fluorescence data were obtained in the ABI PRISM 7900HT and SDS 2.4 software (Life Technologies) was used to call genotypes. For all SNPs, we examined the clustering patterns of genotypes and selected mass peaks and confirmed that the genotype NATURE COMMUNICATIONS | DOI: 10.1038/ncomms8270 ARTICLE calls were of good quality. All SNPs with call rates o95% and/or HWE Po1 Â 10 À 6 in controls and all samples with call rates of o90% were removed from further analysis from each batch (validations 1-3). After quality control, 130 SNPs from validation 1, 24 SNPs from validation 2 and 10 SNPs from validation 3 (including CFH SNP rs6677604) were left for further analysis.
Association tests. Imputed dosage data were analysed using SNPTEST (https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html) 10 . Genomewide case-control analysis was performed using frequentist tests under a missing data logistic regression model, as implemented in SNPTEST. We assumed an additive model for allelic risk, with the first five principal components as covariates to control for population stratification. To evaluate the effects of mixing Northern and Southern Chinese samples in the discovery analysis, we also split the discovery cohort into two clusters based on the first two principal components 30 , one of which is predominantly Northern Chinese (414 cases and 2,306 controls), and the other predominantly Southern Chinese (1,020 cases and 8,355 controls). In each cluster, we re-ran the principal components analysis and performed the association analysis adjusted for PCs1-5 in that cluster, then meta-analysed the PC-adjusted results. The results were found to be similar and hence we kept the results from the first analysis in which we combined Northern and Southern samples. Finally, we ran Wald tests using a multilvariate linear mixed model to correct for population stratification as implemented in the package GEMMA [6][7][8] . SNPs with P values differing by more than three log 10 between the two methods were excluded from further validation. Lambda 1,000 was calculated as a standardized estimate of the genomic inflation regardless of the sample size of the study 33,34 , using the following formula: l 1,000 ¼ 1 þ (1 À l obs ) Â (1/n cases þ 1/n controls )/(1/1,000 cases þ 1/1,000 controls ).
For the validation studies, we performed the trend test in a logistic regression model, analysing samples from Northern and Southern regions of China in validation 2 as separate sample collections to control for potential confounding by population stratification. To combine the association statistics from the GWAS and the three replication samples, we conducted a fixed-effects inverse variance metaanalysis using PLINK (http://pngu.mgh.harvard.edu/Bpurcell/plink/) 30 . Tests for independent association signals were carried out using conditional logistic regression analyses implemented in PLINK 30 . Haplotype analyses were conducted by phasing genotypes of interest using PHASE (http://stephenslab.uchicago.edu/ software.html#phase) 35 and analysing phased haplotypes using logistic regression analyses on PLINK 30 . Regional association plots, centred on the top SNP, were generated using LocusZoom (http://csg.sph.umich.edu/locuszoom/) 36 to represent results from the logistic regression and fixed-effects meta-analysis at each stage of the study.
Furthermore, to evaluate the effects of age and gender on the association results, we entered age and gender as covariates into the logistic regression model and compared the results without adjustment within the same subset of samples with age and gender information available (6,323 cases and 17,349 controls; 84.6%). Each of the four stages were analysed separately and the results combined in a meta-analysis as before. Stratified analysis was performed by analysing males (3,067 cases and 9,993 controls) and females (3,324 cases and 7,371 controls) separately, among 6,391 cases and 17,364 controls with gender information available. Association results were also based on the combined meta-analysis across all four stages. Cochrane's Q test was used to test for heterogeneity in effect sizes between males and females, and also between Northern and Southern samples.
Fraction of variance explained by loci. The percentage of the total variance explained was estimated by calculating Nagelkerke's pseudo R 2 using the fmsb package (http://cran.r-project.org/web/packages/fmsb/index.html), from the result of entering SNP genotypes and affection status into the glm function in R (v 2.15.1).