Introduction

Ulcerative colitis (UC), a subtype of inflammatory bowel disorder (IBD), is a complex autoimmune disorder of severe medical consequences. Multiple genetic along with environmental and immunological factors and their interactions contribute to susceptibility to the disease.1 This condition is emerging as an important health problem in India with an incidence rate of 6.02/105 persons/year and a crude prevalence rate of 44.3/105 individuals, which is comparable to the west, where incidence is 3–15/105/year and prevalence is 50–80/105. But these statistics are much higher than other Asian countries like Japan and Korea, with incidence rates of 1.95/105/year and 1.23/105/year, respectively, and prevalence rates of 5.5–18.12/105 and 7.57/105, respectively.2

Over the preceding years, several potential UC associated loci were identified, initially via genome-wide linkage scans and thereafter by genome-wide association studies (GWASs) and their meta-analysis revealing new insights into UC pathogenesis.3, 4, 5, 6, 7, 8, 9 However, most of these studies were primarily carried out in European populations. Recently, International Inflammatory Bowel Disease Genetics Consortium (IIBDGC) conducted a trans-ancestry study using new genotype array, called Immunochip. The chip was designed to densely genotype overlapping risk loci among common immune-mediated diseases. This study substantially increased the number of known genetic risk loci for IBD to 200.10 The non-European UC GWAS performed to date also identified novel susceptibility loci11 and revealed shared UC risk loci between European and non-European cohorts.12 These UC-specific studies also confirmed the long-established association between UC and the classical human leukocyte antigen (HLA) locus, which contains genes encoding antigen-presenting proteins, and plays a crucial role in the regulation of the adaptive immune system.

Our first ever GWAS on UC from the genetically distinct north Indian (NI) population identified seven novel susceptibility genes namely CFB, SLC44A4, 3.8-1/HCG26, MSH5, NOTCH4, HSPA1L and BAT2 from the extended HLA region and were shown to be HLA independent based on conditional regression analysis.13 Of these seven novel genes, the two top significant hits, namely CFB (rs4151657; P=5.10 × 10−14) and SLC44A4 (rs2736428; P=4.86 × 10−11), were selected for further analysis.

Complement activation can occur via three pathways: classical, alternative or the lectin pathway. CFB (Complement factor B; 6141 bp) encodes a secreted protein that is involved in the alternative pathway of complement activation and is expressed mainly by liver and mononuclear phagocytes.14, 15 The complement system has an important role to play in the body and is involved in lysis of pathogens, opsonization, inflammation and immune clearance,16 thus warranting perfect regulation. Improper regulation of the complement system has been implicated in a number of autoimmune and inflammatory disorders.17 Variations within CFB have been previously associated with age-related macular degeneration18 and atypical hemolytic uremic syndrome,19 suggesting its potential role in inflammatory disorders. A recent study20 showed overexpression of CFB mRNA in inflamed versus normal colonic mucosa of IBD patients, suggesting its role in IBD pathogenesis by inappropriate activation of the complement system, contributing to chronic inflammation, one of the hallmarks of UC. This confirms the role of CFB in UC etiology and further supports our novel GWAS findings. Based on this knowledge, complete exon resequencing of CFB in 50 NI UC cases to identify novel UC associated variant(s) revealed five reported SNPs, one non-synonymous in exon 1 (rs4151667 T>A), two adjacent non-synonymous in exon 2 (rs12614 C>T, rs641153 G>A) and two synonymous (rs1048709 G>A in exon 3 and rs4151669 G>A in exon 4), all of which were in the same haplotype block (D′=1) with the GWAS index SNP rs4151657 within intron 10. Of these, rs12614 was predicted to be the most damaging on the basis of in silico analysis and was taken forward for functional analysis. The % alternate pathway activity assessed in the 52 UC case sera samples with 29 wild-type homozygous (CC) and 23 heterozygous and homozygous variant (CT+TT) genotypes of rs12614 revealed significantly (P=0.01) lower activity in the latter group.13 These findings correlate to lower hemolytic activity of variant CFB which is consistent with the autoimmune nature of the disease, resultant lower efficiency of clearance of pathogens and thus increased susceptibility to infections and consequently disease development.

Next, an extensive investigation of structural and regulatory variants within SLC44A4 (solute carrier family 44, member 4; 15855 bp) was undertaken,21 which revealed possible functional relevance of this gene in UC biology. The protein encoded by this gene, also named TPPT (thiamine pyrophosphate transporter), is a transmembrane thiamine pyrophosphate transporter expressed mainly in the colon. It has been suggested that TPPT plays an important role in the uptake of thiamine pyrophosphate generated in the colon by gut microbiota, thus contributing to thiamine nutrition, especially of the colonocytes.22 It has been observed that chronic fatigue in IBD is a consequence of mild thiamine deficiency.23

Given the biological relevance of CFB and SLC44A4 in UC pathogenesis as exemplified by our work, the present study evaluated allelic heterogeneity in these two genes across three genetically divergent populations namely NI, Japanese and Dutch to (a) corroborate our GWAS findings and (b) identify population-specific signals by utilizing high-density ImmunoChip genotype data generated as a part of the IIBDGC project.

Subjects and methods

ImmunoChip genotype data and quality control

Genotype data for a total of 28 SNPs within CFB (~6 kb) and 22 SNPs within SLC44A4 (~16 kb) were retrieved from the total genotype data generated on an Illumina Infinium ImmunoChip platform, a custom-made chip with 196 524 markers used in a recently completed trans-ethnic ImmunoChip study.10 Sample quality control (QC) for the Indian and Japanese study samples was done using PLINK v1.07 (http://pngu.mgh.harvard.edu/purcell/plink/).24 Samples with ambiguous sex, missing genotype rate ≥0.02 and outlying heterozygosity rate (threshold=mean±4 SD) were removed. Sample QC for Dutch study samples are detailed elsewhere.10

Study participants

Indian UC patients and controls were self-reported north Indians, recruited from Dayanand Medical College and Hospital, Ludhiana, Punjab state. These were a subset of the larger cohort previously used for the GWAS as detailed elsewhere.13 Similarly, Japanese UC patients were recruited from the Kyushu University with 25 affiliated hospitals. Controls were collected from the Midosuji and other related Rotary Clubs and the BioBank Japan project. All these samples were used in previous studies.11, 25 Dutch UC patients were recruited from the outpatient IBD clinic at the Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, the Netherlands. Control DNA samples were derived from healthy blood donors. All these samples were used in previous studies.9 All the three sample sets have been included in the recent ImmunoChip analysis.10 Briefly, UC subjects were diagnosed according to standard clinical diagnostic criteria. The controls were age, sex and ethnicity matched healthy unrelated blood donors with no history of chronic inflammatory autoimmune or infectious diseases. Informed consent was obtained from each participant, and approval for the study was obtained from the ethical committees of respective institutions.

Statistical analyses

Firstly, LD was estimated in each of the three populations using Haploview 4.2 (http://www.broadinstitute.org/haploview/haploview).26 We next performed single SNP and haplotypic association analyses using PLINK v1.07 (http://pngu.mgh.harvard.edu/purcell/plink/).24 Sliding window haplotypes were generated using UNPHASED 3.1.5.27 P-values for individual marker and sliding window haplotypes were represented graphically using Graphical Assessment of Sliding P-values (GrASP v0.82 beta) (http://research.nhgri.nih.gov/GrASP/)28 to present and assess P-values from multiple tests.

In silico analysis of SNPs

SIFT (http://sift.jcvi.org/);29 PolyPhen2 (http://genetics.bwh.harvard.edu/pph2/);30 PolyMiRTS (http://compbio.uthsc.edu/miRSNP/)31, 32 and RegulomeDB (http://regulomedb.org/)33 were used for in silico characterization of SNPs analyzed in this study.

The association data for NI, Japanese and Dutch populations have been submitted to GWAS central database (Submission ID: HGVST 1840) available at the URL http://www.gwascentral.org/study/HGVST1840.

Results

CFB

ImmunoChip genotype data for 28 CFB SNPs (Table 1) obtained for NI (897 cases and 896 controls), Japanese (719 cases and 3263 controls) and Dutch (1729 cases and 1350 controls) UC case–control cohorts were tested for allelic and haplotypic association separately and population-wise results are presented below.

Table 1 Association status of CFB SNPs with UC in north Indian, Japanese and Dutch populations

NI UC cohort

CFB coverage on ImmunoChip, QC and LD profile

Of the 28 SNPs, 13 were monomorphic and one deviated from Hardy–Weinberg Equilibrium (HWE) (P=2.08 × 10−7). Of the 14 remaining SNPs, each with HWE P>10−3, MAF >0.001 and ≥99.7% genotyping efficiency (Table 1) three exonic SNPs namely rs4151667, rs4151669 and rs4151672 were in LD (r2>0.9) with each other and an intronic SNP rs541862 was in LD (r2=0.78) with rs2072634, an exonic SNP (Figure 1). These 14 SNPs were taken forward for analysis.

Figure 1
figure 1

LD plots of CFB SNPs analyzed in NI, Japanese and Dutch populations. LD plot of CFB in NI displaying (a) D′ values and (b) r2 values, LD plot of CFB in Japanese displaying (c) D′ values and (d) r2 values, LD plot of CFB in Dutch displaying (e) D′ values and (f) r2 values.

Allelic association

Of the 14 SNPs, the Indian GWAS index SNP rs4151657 (intron 10) was the most significant (unadjusted P=1.73 × 10−10), and five others namely rs12614 (exon 2), rs13194698 (intron 2), rs1048709 (exon 3), rs4151670 (exon 5) and rs17201431 (intron 6) were nominally associated at P≤0.05 (Table 1).

Haplotypic association

Of the three exonic SNPs in LD namely rs4151667, rs4151669 and rs4151672, only rs4151667 was used as proxy as it was non-synonymous and damaging on in silico predictions and of rs541862 and rs2072634 in LD, rs2072634 which was exonic was retained. Using these 11 markers and 1–11 marker sliding window haplotypes constructed on PLINK, 66 sliding windows and a total of 280 haplotypes with minimum frequency ≥0.01 were generated (Supplementary Table S1). The threshold P-value of <1.8 × 10−4 was set after Bonferroni correction was applied. A number of haplotypes were found significantly associated. A four marker haplotype (rs17201431–rs537160–rs2072634–rs4151657) was the smallest haplotype (A–G–G–G), encompassing the GWAS index SNP rs4151657 that was most significantly associated (P=4.4 × 10−11). Of the 11 marker haplotypes that were generated, one predisposing haplotype (T–G–G–G–G–G–A–G–G–G–G) with frequency 0.42 in cases and 0.31 in controls (377 cases and 278 controls), containing the predisposing alleles of GWAS index SNP rs4151657 and all SNPs except rs17201431 showing allelic association was found to be significantly associated (P=2.7 × 10−11). Global P-values of 1–11 marker sliding window haplotypes generated using UNPHASED 3.1.5 and graphed using GrASP v0.82 beta, keeping a minimum haplotype frequency threshold of 0.001 are presented in Supplementary Table S2 and Figure 2. Of the 11 marker haplotypes generated, the same haplotype as shown above (T–G–G–G–G–G–A–G–G–G–G) was found significantly associated (P=2.4 × 10−10). rs12614, rs13194698 and rs4151657 seem to be the main contributors within CFB, as can be seen from Figure 2.

Figure 2
figure 2

CFB haplotype associations illustrated using graphical assessment of sliding P-values (GrASP v0.82 beta) in (a) NI, (b) Japanese and (c) Dutch populations.

Japanese UC cohort

CFB coverage on immunochip, QC and LD profile

Of the 28 SNPs in CFB on ImmunoChip, 2 were not called and 13 were monomorphic/uninformative in the Japanese UC cohort. Of the 13 remaining SNPs, each with MAF >0.001, HWE P>10−3 and ≥99.7% genotyping efficiency (Table 1), the two exonic SNPs rs4151667 and rs4151672 were in LD (r2=1) with each other, an exonic SNP rs1048709 was in LD (r2=0.78) with an intronic SNP rs537160, and an intronic SNP rs541862 was in LD (r2>0.9) with an exonic SNP rs2072634, which is slightly different from the NI pattern (Figure 1). These 13 SNPs were taken forward for analysis.

Allelic association

NI UC GWAS index SNP rs4151657 came up significantly associated (P=2.02 × 10−12) along with eight other SNPs namely rs4151667 (exon 1), rs12614 (exon 2), rs13194698 (intron 2), rs1048709 (exon 3), rs4151670 (exon 5), rs537160 (intron 7), rs2072633 (intron 17) and rs4151672 (3’UTR) showing nominal association (P≤0.05) (Table 1).

Haplotypic association

Of the two exonic SNPs namely rs4151667 and rs4151672 in LD, rs4151667 was retained as it is non-synonymous and damaging on in silico predictions; of rs1048709 and rs537160 in LD, rs1048709 was retained as it is exonic and damaging on in silico predictions; and of rs541862 and rs2072634 in LD, rs2072634 was retained as it is exonic. 1–10 marker sliding window haplotypes constructed on PLINK generated 55 sliding windows and a total of 187 haplotypes with minimum frequency ≥0.01 (Supplementary Table S3). The threshold P-value of <2.7 × 10−4 was set after Bonferroni correction was applied. A number of haplotypes were found significantly associated. A three-marker haplotype (rs17201431–rs2072634–rs4151657) was the smallest haplotype (A–G–G), encompassing the GWAS index SNP rs4151657 that showed most significant association (P=9.6 × 10−13). Of the 10 marker haplotypes that were generated, the same predisposing haplotype (T–C–C–G–G–C–T–C–C–G) that was found associated in NI was found associated in Japanese population (P=2.6 × 10−11) as well, with frequency 0.53 in cases and 0.43 in controls (384 cases and 1407 controls). 1–10 marker sliding window haplotypes generated using UNPHASED 3.1.5 and GrASP v0.82 beta, keeping a minimum haplotype frequency threshold of 0.001, revealed the same pattern of association (Supplementary Table S2 and Figure 2). From Figure 2, it is apparent that the three SNPs namely rs1048709, rs4151657 and rs2072633 are the main drivers for association of this region to UC.

Dutch UC cohort

CFB coverage on ImmunoChip, QC and LD profile

Of the 28 CFB SNPs on ImmunoChip three were not in 1000 Genome, one failed HWE, one failed heterogeneity, and three were monomorphic and three had MAF<0.001 in Dutch UC cohort. Of the remaining 17 SNPs which were analyzed further, each with MAF >0.001, HWE P>10−3 and ≥99.8% genotyping efficiency (Table 1), three exonic SNPs namely rs4151667, rs4151669 and rs4151672 were in LD (r2=0.99) with each other (Figure 1).

Allelic association

India GWAS index SNP rs4151657 showed only a nominal association (P=0.002), along with two other intronic SNPs, namely rs537160 (P=4.29 × 10−5) and rs2072633 (P=0.003, Table 1).

Haplotypic association

Of the three exonic SNPs namely rs4151667, rs4151669 and rs4151672 which were in LD, only rs4151667 was retained as it was non-synonymous and damaging on in silico predictions. 1–15 marker sliding window haplotypes constructed on PLINK generated 120 sliding windows and a total of 695 haplotypes with minimum frequency ≥0.01 (Supplementary Table S4). Some haplotypic combinations withstood the Bonferroni corrected P-value threshold of <7.2 × 10−5. A five-marker haplotype (rs4151651–rs4151652–rs17201431–rs512559–rs537160) was the smallest haplotype (G–G–A–A–A) showing most significant association (P=2.07 × 10−6). It was a protective haplotype with a frequency of 0.3 in cases and 0.35 in controls. 1–15 marker sliding window haplotypes generated using UNPHASED 3.1.5 and GrASP v0.82 beta, keeping a minimum haplotype frequency threshold of 0.001 revealed similar pattern of association (Supplementary Table S2 and Figure 2). As can be seen from Figure 2, rs537160 seems to be the only contributor within this region to UC.

In silico analysis of CFB SNPs

SIFT and POLYPHEN2 prediction of the four missense SNPs namely rs4151667 (exon 1), rs12614 (exon 2), rs4151651 (exon 5) and rs4151659 (exon 13) showed the first two to be damaging (Supplementary Table S5). The 3′ UTR SNP rs4151672 was checked on PolymiRTS Database 3.0, and the reference allele C was found to disrupt two conserved miRNA sites and variant allele T was found to create a new miRNA site, and thus possibly functional. Checking all SNPs on RegulomeDB, most SNPs were predicted to be near DNA features or regulatory elements like transcription factor-binding sites and also affect protein binding. Of note, three SNPs namely rs1048709 (exon 3), rs17201431 (intron 6) and rs2072633 (intron 17), which showed allelic association in either of the three populations (Table 1), were predicted to have cis-eQTL effects on a number of HLA genes (Supplementary Table S6), which are in the vicinity of CFB, which may suggest the role of CFB via HLA genes.

SLC44A4

ImmunoChip genotype data for 22 SLC44A4 SNPs (Table 2) obtained for NI (897 cases and 896 controls), Japanese (724 cases and 3271 controls) and Dutch (1729 cases and 1350 controls) UC case–control cohorts were tested for allelic and haplotypic association separately and population-wise results are presented below.

Table 2 Association status of SLC44A4 SNPs with UC in north Indian, Japanese and Dutch populations

NI UC cohort

SLC44A4 coverage on ImmunoChip, QC and LD profile

Of the 22 SNPs, one was monomorphic and one deviated from Hardy–Weinberg equilibrium (HWE) (P=7 × 10−4). The 20 remaining SNPs, each with HWE P>10−3, MAF>0.001 and ≥99.9% genotyping efficiency (Table 2), were taken forward for analysis. Nine SNPs namely rs660594, rs577272, rs644827, rs644774, rs2242665, rs2242664, rs3132442, rs3130481 and rs3130482 were in LD (r2≥0.88) with each other; rs494620 was in LD (r2=0.83) with rs614549 and rs521977 and rs9267659 were also in LD (r2=0.82) with each other (Figure 3).21

Figure 3
figure 3

LD plots of SLC44A4 SNPs analyzed in NI,21 Japanese and Dutch populations. LD plot of SLC44A4 in NI displaying (a) D′ values and (b) r2 values, LD plot of SLC44A4 in Japanese displaying (c) D ′ values and (d) r2 values, LD plot of SLC44A4 in Dutch displaying (e) D′ values and (f) r2 values.

Allelic association

The NI GWAS index SNP rs2736428 (intron 2) was the most significantly associated (P=4.94 × 10−10), while 13 others were nominally associated at P≤0.05, namely rs4947332 (intron 13), rs660594 (intron 12), rs577272 (intron 11), rs644827 (exon 11), rs644774 (intron 10), rs494620 (exon 10), rs12661281 (exon 6), rs2242665 and rs2242664 (exon 8), rs3132442, rs3130481, rs3130482 and rs614549 (intron 7) (Table 2).

Haplotypic association

Of the nine SNPs in LD as mentioned above, rs644827 was selected as proxy as it was an exonic missense variant and seemed more damaging than others on in silico predictions. Of rs494620 and rs614549 in LD, rs494620 was retained as it was exonic; and of the two intronic SNPs rs521977 and rs9267659 in LD, rs9267659 was retained as it seemed more likely to have regulatory effects as predicted on RegulomeDB. 1–10 marker sliding window haplotypes constructed on PLINK generated 55 sliding windows and a total of 308 haplotypes with minimum frequency ≥0.01 (Supplementary Table S7). The threshold P-value of <1.6 × 10−4 was set after Bonferroni correction was applied. A number of haplotypes were found significantly associated. A six marker haplotype (rs9461727–rs4947332–rs693906–rs11965547–rs644827–rs494620) was the smallest haplotype (C–G–G–G–G–A) showing most significant association (P=5.97 × 10−11) with a frequency of 0.42 in cases and 0.32 in controls. 1–10 marker sliding window haplotypes generated using UNPHASED 3.1.5 and GrASP v0.82 beta, keeping a minimum haplotype frequency threshold of 0.001 revealed rs4947332, rs494620, rs12661281 and rs2736428 to be the main drivers for association (Supplementary Table S8 and Figure 4).

Figure 4
figure 4

SLC44A4 haplotype associations illustrated using graphical assessment of sliding P-values (GrASP v0.82 beta) in (a) NI, (b) Japanese and (c) Dutch population.

Japanese UC cohort

SLC44A4 coverage on immunochip, QC and LD profile

Of the 22 SNPs, one was not called and one had MAF<0.001. The 20 remaining SNPs, each with HWE P>10−3, MAF >0.001 and 100% genotyping efficiency (Table 2) were taken forward for analysis. Eight SNPs namely rs577272, rs644827, rs644774, rs2242665, rs2242664, rs3132442, rs3130481 and rs3130482 were in LD (r2≥0.99) with each other; NI GWAS index SNP rs2736428 was in LD (r2=0.92) with rs614549 and rs521977 and rs9267659 were also in LD (r2=0.94) with each other (Figure 3).

Allelic association

The NI GWAS index SNP rs2736428 (intron 2) was the most significantly associated (P=3.37 × 10−9), while 16 others showed nominal (P≤0.05) to moderate association (P≤10−5) (Table 2).

Haplotypic association

Of the eight SNPs in LD as mentioned above, rs644827 was selected as proxy as it was an exonic missense variant and seemed more damaging than others on in silico predictions. Of rs2736428 and rs614549 in LD, rs2736428 was retained as it showed more significant association; of the two intronic SNPs rs521977 and rs9267659 in LD, rs9267659 was retained as it seemed more likely to have regulatory effects as predicted on RegulomeDB. 1–11 marker sliding window haplotypes constructed on PLINK generated 66 sliding windows and a total of 316 haplotypes with minimum frequency ≥0.01 (Supplementary Table S9). The threshold P-value of <1.6 × 10−4 was set after Bonferroni correction was applied. A number of haplotypes were found significantly associated. A five-marker haplotype (rs11965547–rs644827–rs494620–rs12661281–rs2736428) was the smallest haplotype (G–G–A–A–A) showing most significant association (P=9.91 × 10−20) with a frequency of 0.47 in cases and 0.35 in controls. 1–11 marker sliding window haplotypes generated using UNPHASED 3.1.5 and GrASP v0.82 beta, keeping a minimum haplotype frequency threshold of 0.001 revealed rs644827, rs494620, rs12661281 and rs2736428 to be the main drivers for association (Supplementary Table S8 and Figure 4).

Dutch UC cohort

SLC44A4 coverage on ImmunoChip, QC and LD profile

Of the 22 SNPs, only one was monomorphic. The 21 remaining SNPs, each with HWE P>10−3, MAF >0.001 and ≥99.8% genotyping efficiency (Table 2), were taken forward for analysis. Nine SNPs namely rs660594, rs577272, rs644827, rs644774, rs2242665, rs2242664, rs3132442, rs3130481 and rs3130482 were in LD (r2≥0.9) with each other; NI GWAS index SNP rs2736428 and rs494620 were in LD (r2=0.78) with rs614549 (Figure 3).

Allelic association

Unlike in the other two populations detailed above, the NI GWAS index SNP rs2736428 (intron 2) showed only nominal association (P≤0.05) along with 15 other SNPs (Table 2).

Haplotypic association

rs644827 which is non-synonymous was selected as proxy out of the nine SNPs in LD and out of rs494620 and rs614549 in LD, rs494620 which is exonic was retained. 1–12 marker sliding window haplotypes constructed on PLINK generated 78 sliding windows and a total of 491 haplotypes with minimum frequency ≥0.01 (Supplementary Table S10). Only three haplotypes, namely rs11965547–rs521977 (G–C), rs4947332–rs693906–rs11965547–rs521977 (G–G–G–C) and rs9461727–rs4947332–rs693906–rs11965547–rs521977 (C–G–G–G–C) (P=~10−5), all three common haplotypes with a frequency ~0.6 in cases and ~0.5 in controls crossed the Bonferroni corrected P-value threshold (P≤10−4). 1–12 marker sliding window haplotypes were also generated using UNPHASED 3.1.5 and GrASP v0.82 beta (Supplementary Table S8 and Figure 4).

In silico analysis of SLC44A4 SNPs

The three missense SNPs namely rs12661281, rs2242665 and rs644827 were predicted to be benign on both SIFT and POLYPHEN2.21 RegulomeDB predicted most of the SNPs to be within transcription factor-binding motifs and affect protein binding. Some SNPs were also found to have cis-eQTL effects on a number of HLA genes (Supplementary Table S11).21

Discussion

CFB, a component of the alternate pathway of complement system, emerged as a novel susceptibility gene in the first ever GWAS on UC among NI.13 There is evidence for circulating immune complexes and enhanced production of components of the complement system in IBD in Caucasian populations, suggesting increased complement activation in such patients.34, 35, 36, 37 SLC44A4, a thiamine pyrophosphate transporter, was another of our NI UC GWAS top hits.13 However, neither CFB nor SLC44A4 have been identified in any of the larger Caucasian GWAS,3, 4, 5, 6, 7 their meta-analysis8, 9 and more recently in ImmunoChip analysis,10 or in the non-European UC cohorts studied to date.11, 12 Needless to say, such a striking difference across ethnic groups may be due to inherent statistical limitations of GWAS, which mainly relies on single SNP analysis, incomplete coverage of functional common or rare variants, poor representation of appropriate proxies on commercial genotyping arrays due to population-specific LD patterns, among others factors like allelic/genetic heterogeneity, varying environmental components like gut microbiome influenced by geographical location, lifestyle factors such as diet, smoking, etc. leaving much of the disease heritability unexplained. Keeping in view the biological significance of CFB and SLC44A4, we attempted to identify allelic heterogeneity in these two genes by comparing three populations namely NI, Japanese and Dutch of different ethnic origin.

Of the 28 CFB SNPs present on the ImmunoChip, 14, 13 and 17 were retained after stringent QC in NI, Japanese and Dutch, respectively, while approximately 40% of CFB SNPs present on the ImmunoChip were monomorphic/uninformative (Table 1) in all the three study populations reiterating the need to have population-specific commercial arrays, which will undoubtedly contribute to the black box of missing disease heritability and partially explain non-replication of European findings in other ethnically distinct populations. The reported NI UC GWAS index SNP rs4151657 within CFB consistently showed strong allelic association in the Japanese as well (P=2.02 × 10−8), but nominal in the Dutch (P=0.002, Table 1). It would be interesting to mention that a long-range haplotype in the MHC region (25–35 Mb), including CFB reflected strong association in the Japanese with UC, which they considered as one susceptibility locus.25 On the other hand, rs537160 was suggestively significant (P=4.29 × 10−5) in the Dutch cohort, which is indicative of allelic heterogeneity at CFB. Of the remaining nominally associated SNPs (P≤0.05) in any of the three populations, (a) none were common between NI and Dutch; (b) three exonic and one intronic SNPs were shared between NI and Japanese; and (c) two intronic SNPs were shared between Japanese and Dutch cohorts (Table 1). These promising findings suggest that trans-ethnic fine-mapping efforts using high-density genotyping/sequencing will undoubtedly restore the momentum of causal variant identification in complex disease research and may identify population-specific determinants. Genuine contribution of these alleles to UC may derive further support from the observed absence of LD between these markers in these two populations (Figure 1). Such allelic heterogeneity across distinct ethnic populations is not unexpected and, for example, has already been demonstrated for NOD2 in our previous study on UC patients from north India.38

Considering the associated SNP may not be the only or predominant determinant of the respective gene function and other SNPs in the gene, singly or in haplotypic combinations may contribute to the phenotype, we next estimated haplotypic diversity across the three populations. It may be mentioned that previous association studies have demonstrated high-risk haplotypes for various complex disorders, for example, a rare haplotype within CFH was found associated with age-related macular degeneration39 and haplotypes within STAT4 were found associated with systemic lupus erythematosus.40 Our haplotypic association results further reaffirm these findings. A minimal three-marker haplotype within CFB namely rs17201431–rs2072634–rs4151657 was shared across NI and Japanese (P<10−8), but a different five-marker haplotype namely rs4151651–rs4151652–rs17201431–rs512559–rs537160 was significantly associated (P=2.07 × 10−6) in the Dutch population after Bonferroni corrections. However, in NI and Japanese populations, the association seems to be driven mainly by rs4151657, the India GWAS index SNP and in the Dutch population, it is rs537160 (Figure 2), also identified in allelic association (Table 1). It is also noteworthy that the haplotypes associated in NI (0.41 in cases and 0.31 in controls) or Japanese (0.52 in cases and 0.42 in controls) and Dutch (0.3 in cases and 0.35 in controls) cohorts are rather common. As for the likely role of these two driver SNPs namely rs4151657 and rs537160, they may be involved in regulation of gene expression through transcription factor binding, as predicted by various in silico tools (Supplementary Table S6).

The NI UC GWAS index SNP rs2736428 within SLC44A4 was found significantly associated in Japanese (P=3.37 × 10−9) but only nominally associated (P=0.002) in the Dutch cohorts. Other than this, 11 out of the 22 SNPs within SLC44A4 showed nominal association in all three ethnic groups (Table 2), most of which were predicted to have regulatory effects (Supplementary Table S11). Allelic as well as haplotype associations revealed similar patterns across Indians and Japanese, but a different pattern was observed in the Dutch (Supplementary Table S8 and Figure 4), suggesting genetic heterogeneity across the two populations.

Taken together, our findings unequivocally demonstrate evidence of allelic heterogeneity in CFB and genetic heterogeneity in SLC44A4, biologically relevant genes for UC and utility of trans-ethnic studies. These observations reiterate the contemporary need for fine mapping of known loci and trans-ethnic comparisons for identification of common and unique risk variants. This in turn would have implications for predictive medicine and for further understanding of disease biology.