Introduction

Celiac disease (CD, MIM: 212750) is a chronic, autoimmune disorder caused by intolerance to dietary gluten that develops in genetically susceptible individuals. It is a common disease (around 1% of the population) that is characterized by the presence of autoantibodies against tissue transglutaminase and villous atrophy, crypt hyperplasia and lymphocytic infiltration of the small intestinal mucosa. The major histocompatibility complex region on 6p21 harbors the major contributors to CD risk: in Caucasians, HLA-DQ2/-DQ8 heterodimers are present in >90% of CD patients, but also in around 30% of the general population, so that HLA alone cannot explain all the genetic component.1 Genome-wide association studies (GWAS) and Immunochip project identified 57 association signals from 39 loci, that together contribute 5–7% to the genetic risk.1, 2 More recently, new association signals have been detected in the major histocompatibility complex region, increasing up to 48% proportion of the heritability that is known so far.3 However, the effect of rare, coding variants within the Immunochip genes is minimal,4 and thus the remaining genetic component related to CD should still reside, in part, in common and known variants.

Despite the progress made, it has proven difficult to reconcile the results from association analyses across different populations, and to square SNP association results and expression levels of cis-located genes in patient tissues.5, 6 This limited success could be partly owing to certain genetic heterogeneity within CD, so that not every associated SNP is relevant to all CD cases. Random effects modeling has recently shown that SNPs reported to be associated with the disease (rs1050976C>T in IRF4, h38 chr6:g.408079C>T and rs11851414:C>T in ZFP36L1, h38 chr14:g.68792785T>C) would not have reached the significance threshold if heterogeneity among the different collections analyzed in the Immunochip had been accounted for.7 In the original analysis, a covariate was introduced to indicate collection membership, but not the possible heterogeneity within. We believe that heterogeneity within each one of the Immunochip cohorts could be stronger than what has been assumed. In the present work, we propose taking into consideration the (immuno)genomic background of each individual (revealed by the Immunochip itself) rather than geographical origin, as an alternative strategy for disease association analysis of the Immunochip data.

Subjects and Methods

We reanalyzed the 139 553 SNPs from the Immunochip in 12 041 CD patients and 12 228 non-celiac controls. To stratify individuals according to their genetic background, we first detected 8537 conserved LD blocks of SNPs using Plink8 and selected one random SNP from each block. These 8537 SNPs were used to calculate the possible number of ancestries using Admixture9 and the optimal number was set to 30 because it was the first K with a lower cross-validation value than the next K (Supplementary Figure S1). We then assigned each individual to 1 of the 30 immunogroups (named this way because they are based on the Immunochip SNPs), according to their major ancestry component (Supplementary Figure S2). Immunogroup sizes ranged from 19 to 4178 individuals (Supplementary Table S1), and contained celiac and control individuals from different geographical origins, except for one where all the samples of Indian origin clustered (Supplementary Figure S3), stressing the limitations of the Immunochip for the genetic analysis of non-European populations.10 Finally, we performed an association analysis, correcting for stratification of the 30 immunogroups, using a Cochran–Mantel–Haenszel test implemented in Plink.8 We set the significance cutoff to P<7.02 × 10−07, as there were 71 208 independent tests (8537 LD blocks plus 62 671 SNPs outside them), as calculated previously.10 The results of the association study are available at GWAS Central http://www.gwascentral.org/study/HGVST1839.

The expression of 14 protein-coding genes was measured in intestinal biopsies from 15 CD patients at the time of diagnosis and after >2 years on gluten-free diet (GFD), and from 15 non-celiac controls (Supplementary Table S3). CD was diagnosed according to the ESPGHAN criteria. The study was approved by the Cruces University Hospital, and Basque Clinical Trials and Ethics Committees (CEIC- E09/10 and PI2013072), and biopsies of distal duodenum were obtained by endoscopy after informed consent from all subjects or their parents. Total RNA was extracted using the NucleoSpin microRNA kit (Macherey-Nagel, Düren, Germany) and converted to cDNA using the AffinityScript cDNA Synthesis kit (Agilent Technologies, Santa Clara, CA, USA). Gene expression was analyzed using Fluidigm Biomark 48.48 dynamic arrays (Fluidigm Corp., South San Francisco, CA, USA) and commercially available TaqMan Gene Expression assays, including RPLPO as an endogenous control of input RNA (Thermo Fisher Scientific Inc., Waltham, MA, USA). Relative expression was calculated using the accurate ΔΔCt method and normalized to the average expression value of the control samples. Difference between conditions was tested using nonparametric tests, paired in the case of the comparison between active and treated CD. Gene expression data are available at Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) with accession number GSE84729.

Results and Discussion

There were 4881 SNPs significantly associated with CD (Supplementary Figure S4), of which 636 had not been detected in the original analysis (Supplementary Figure S5). Only one previously associated region was not detected: hg38 chr14:68,792,689–68,805,252 (P=3.146 × 10−06), where ZFP36L1 had been proposed as the putative candidate gene, a region which was also ‘missed’ by the Random Effects study.7 A total of 500 novel SNPs were located in five previously unidentified genomic regions or close to or within four previously known loci, extending two of them; and there were also several isolated signals defined by one or few significant SNPs (Table 1,Supplementary Table S2). The most strongly associated novel region (hg38 chr2:134,533,564-136,169,524) contains markers and genes that have been associated with type 2 diabetes11 (Figure 1a).

Table 1 Summary of relevant CD-associated regions identified or extended in this study (excluding the MHC)
Figure 1
figure 1

The novel hg38 chr2:134,533,564–136,169,524 region that is associated with celiac disease. (a) Graphical representation of the region made using the Locuszoom online tool (http://locuszoom.sph.umich.edu/locuszoom/). (b) Expression analysis of selected genes from the region. REAC: expression relative to control average; below each gene, Mann–Whitney test P-values, when significant. (c) Spearman correlation analysis of expression levels.

In this region, CCNT2 and R3HDM1 showed decreased expression between CD patients and controls, both at diagnosis and on GFD (Figure 1b), pointing to a constitutive defect. CCNT2 is a cyclin that is involved in cell cycle and RNA transcription. R3HDM1 is a poorly characterized gene that could have a poly(A) RNA-binding function. The expression of the aspartate-tRNA ligase gene DARS was also reduced in patients, although it was not significant in active disease. In addition, the expression of genes within the region was strongly correlated among CD patients but not in controls (Figure 1c), suggesting common, disease-dependent regulatory mechanisms in the region, as has been previously shown.12, 13 The lactase gene LCT showed a pronounced decrease in expression in active CD that recovered after GFD treatment (Supplementary Figure S6), indicative of the lactose intolerance observed in active CD. The other novel regions identified (Table 1) contain genes relevant to the immune response associated with allergy (IL21R),14 Crohn’s disease15 and psoriasis16 (IL23R and IL12RB2).

Our analysis also identified novel SNPs in previously known regions, extending two of them (Supplementary Table S2). In the hg38 chr1:200,901,626–201,054,931 region (Supplementary Figure S7A), C1orf106 is the proposed candidate gene for CD,2 but there were several associated SNPs that extend it 3′-wards up to CACNA1S; including KIF21B, a kinesis related to immune-mediated chronic diseases like multiple sclerosis,17 whose expression was significantly increased in active CD (Supplementary Figure S8A).

In the hg38 chr2:60,850,682–61,644,518 region (Supplementary Figure S7B), PUS10 was the proposed candidate gene,2 but our results extend the region 5′-wards to REL and up to XPO1 on the 3′side. Both genes participate in the NFκB pathway, which is known to be altered in CD.12, 18 The expression of REL (Supplementary Figure S8B), a gene associated with CD,19 was reduced in active CD patients.

There were also seven genes with only one significant SNP (Supplementary Table S2) sometimes because those regions have a very low SNP density. From them, we analyzed the expression of SORD, a gene involved in the interconversion of polyols, that has been related to type 2 diabetic retinopathy,20 and its expression was significantly lower in CD individuals (Supplementary Figure S8C). Finally, a number of novel SNPs were located in intergenic regions, but functional analyses will be necessary to determine their possible role in disease susceptibility.

In conclusion, the immunoancestry-based analysis of the Immunochip data has allowed us to discover novel regions associated with CD that harbor genes that are functionally altered in patient intestinal mucosa. We believe that this type of stratified analysis is applicable to other large-scale genotype data from complex disease association studies and will help to find novel susceptibility genes, and to conciliate genotype and expression data.