Introduction

Celiac disease (CD) is a prevalent inflammatory disorder of the small intestine with a multifactorial etiology. Patients suffering from the disease are intolerant to wheat gluten and related cereal proteins of rye and barley. There is a clear correlation between the disease and the human leukocyte antigen (HLA)-DQ status; 90–95% of all patients carry the gene pair encoding the DQ2 heterodimer (DQA1*05/DQB1*02), whereas most of the remaining patients are positive for DQ8 (DQA1*03/DQB1*0302).1 The DQ2 and DQ8 molecules confer susceptibility by binding and presenting gluten-derived peptides to CD4+ T cells in the small intestine. The chronic inflammation is accompanied by villous atrophy and crypt hyperplasia. Although the association with HLA is strong in CD, there are strong indications that non-HLA genes also contribute to disease susceptibility. One indication is the high frequency of the DQ2 heterodimer among healthy individuals (20–30%), but the chief argument is the large difference in concordance rates between monozygotic twins and HLA-identical siblings.2

Several genome-wide linkage screens have pointed out non-HLA candidate regions; some regions have shown linkage in several studies whereas others have not. The chromosomal region 5q31–33 was initially pointed out in Italian cohorts,3, 4 and thereafter in three subsequent linkage studies.5, 6, 7 In addition, a pooled analysis using raw data from four independent genome scans, including the Swedish/Norwegian data, further confirmed linkage to this region.8 Apart from the HLA region, 5q31–33 was the only region in this study that reached genome-wide level of significance as defined by Lander and Kruglyak,9 with maximum linkage at marker D5S640; Zlr=4.39, P-value = 6 × 10−6. Hence, 5q31–33 can be considered to be a confirmed linked region. A substantial increase of the linkage signal with a maximum Zlr score of 4.6 at marker rs1972644 (P-value=2 × 10−6) was evident when linkage analysis in our Scandinavian cohort was refined with more densely spaced markers (both microsatellites and single nucleotide polymorphisms (SNPs)) (Adamovic et al, unpublished data). Marker rs1972644 is located approximately 1.8 Mb centromeric of D5S640. In the same study, several of the included markers demonstrated nominally significant genetic association with CD. However, no susceptibility gene(s) could be identified convincingly.

To identify the CD susceptibility gene(s) located at 5q31–33, we have performed an extensive SNP association screen of the 16 Mb region which defines the 95% confidence interval (CI) (position 131.895.324–148.053.211 bp) of the linkage peak found in our Swedish/Norwegian CD cohort. From our available multiplex cohort, we included only families who had previously shown genetic linkage to the region, that is 97 families. A total of 1404 SNPs were successfully genotyped, and the genotypes of these SNPs were used for single- and multipoint association analysis.

Materials and methods

Subjects

From the previously described 106 unrelated Swedish/Norwegian CD multiplex families (families with two or more affected children) used in the initial genome-wide linkage screen,6 we included 97 families in the current SNP screen. A more detailed description of these 106 multiplex families is available elsewhere.10 The 97 families were selected based on the identical-by-descent (IBD) status for each individual sib-pair. The IBD status for the entire 5q31–33 region was defined using genotype information available from the previous genotyped markers used in the genome-wide linkage screen6 and the association study (Adamovic et al, unpublished data). Allegro v2 software11 was run to extract phased haplotypes that were used to determine the IBD status for each family. All families where the sib-pairs did not share at least 80% of the 5q31–33 region IBD for at least one of the chromosomes were defined as 0 IBD families and were excluded from this study. All members of the 97 selected families were genotyped in our screen, except the affected siblings of the probands from the 21 families that showed two IBD over the whole region (based on the assumption that genotyping both sibs would not provide additional information since they share identical genotypes). In total, 372 individuals were genotyped. All patients fulfilled the European Society for Paediatric Gastroenterology and Nutrition diagnostic criteria.12

Definition of the chromosomal region and the SNP selection

The region of interest was limited to the region representing the 95% CI of the linkage peak at marker rs1972644. The 95% CI for the linkage peak was calculated by bootstrapping from the family scores (calculated using the Allegro v2 software11) and was found to cover a 16 Mb interval between chromosomal position 131.895.324 and 148.053.211 bp (data not shown). In the present study, we selected 1536 tag SNPs distributed within this 95% CI, of which 1372 SNPs were successfully genotyped. Tag SNP selection was performed by first obtaining SNP genotype information for Centre d'Etude du Polymorphisme Humain individuals (Utah residents with ancestry from northern and western Europe) from the HapMap phase I database release #16b (http://www.hapmap.org/). Of the 36551 SNPs reported by Illumina to locate within the interval of interest, the HapMap phase I provided genotype information for 5516 SNPs. An assay score value for each of these SNPs was provided by the Illumina SNP service. This score value indicates the likelihood for the individual SNP of being successfully genotyped in the Illumina assay plexing system. Therefore, following the Illumina service recommendation, only SNPs with an assay score >0.6 were considered for the final tag SNP selection (4733 SNPs of the 5516 SNPs with genotype information fulfilled the assay score criteria). The genotype information for these 4733 SNPs was used for further tag SNP selection applying the tagger function implemented in the Haploview version 3.32 software13 (using r2>0.9 and minor allele frequency (MAF) >0.05) and by applying the pairwise tag SNP selection method.14 In total, 1536 tag SNPs were identified using these criteria.

SNP genotyping and quality control

Illumina bead array SNP genotyping (Illumina, San Diego, CA, USA) was performed at the Wallenberg Consortium North SNP platform (University of Uppsala, Uppsala, Sweden). Recommended control samples were included in each run. Pedcheck15 was run to reveal Mendelian misinheritance, and the SNPs were examine for deviation from Hardy–Weinberg equilibrium. In the statistical analysis we included 30 previously genotyped SNPs (Adamovic et al, unpublished data) leaving a total of 1404 SNPs. The SNPs were distributed throughout the region with an average spacing of approximately 11 kb; five gaps of 110 kb were seen, 66% of the SNPs were located less than 10 kb apart, 31% within the range of 10–50 kb and 2% within the range of 50–100 kb apart.

Statistics

Based on the linkage signal seen in our population, we performed statistical calculations to predict the size of the association signals expected to be obtained in our material under the assumption of one gene effect and one founder. Using the estimated IBD sharing probabilities (z0, z1, z2) generated by Genehunter v2.116 at the linkage peak, and assuming a single disease allele frequency (P), the penetrances (f0, f1, f2) were derived by numerically solving the nonlinear equations that relate these quantities. The expected genotype and transmission configurations were calculated by application of Bayes' theorem when the disease model was fully specified through P, f0, f1 and f2. Assuming the existence of one close SNP at linkage disequilibrium (LD)-distance D′=1 from the disease locus (ie no recombination observed between them) we varied the disease allele frequencies (0.02–0.3) and the associated marker allele frequency, by varying r2 values (0.4–1).

Single-point association analysis was performed using the family-based association test, FBAT v5.5 assuming an additive risk model.17 Haplotype comprising all SNPs across the entire region was constructed utilizing the Allegro v2 software.11 Parental haplotypes-transmitted IBD to the affected individuals were defined as case haplotypes whereas the complementary never-transmitted parental haplotypes were defined as control haplotypes. Thereafter, these case and control haplotypes were explored by the HapMiner software v1.1.18, 19 HapMiner is a direct haplotype-mining program that utilizes a density-based clustering algorithm to assess association. In our analysis we fixed weight 1 (‘the counting measure’ which represents the total number of matching alleles within a given window size) and allowed for clustering of haplotypes with respect to weight 2 only (‘the length measure’ which represents the length of the longest continuous interval of matching alleles around a locus). In addition, we generated haplotypes of different SNP lengths: 7, 15, 19 and 27. Depending on whether the haplotype is under- or overrepresented in cases versus controls, either a negative or positive Z-score value is assigned (for simplicity, we refer to the absolute Z-score value (Z-score) in the text).

Results

Genotype quality control

Only two Mendelian inconsistencies were revealed in the data set, which indicates extremely accurate genotyping. The genotypes in the involved families were removed. SNPs that appeared nonpolymorphic (n=108) or that did not reach the genotype success score threshold at 90% (n=56) were removed from the data set before analysis. After exclusion of such SNPs, 1372 of the 1536 tag SNPs remained.

Prediction of association signal strength based on linkage results

The IBD sharing probabilities (z0, z1, z2) were estimated to (0.12, 0.47, 0.41). Using these probabilities we estimated the strength of association signal we would expect to achieve if the linkage peak was caused by one gene effect and assuming one founder. In the examined models compatible with the IBD sharing probabilities the P-values for the expected outcomes in single SNP analysis varied between 10−17 and 10−6 (data not shown). The positive single- and multipoint association signals we obtained in this study were less significant (uncorrected P-value (Pnc-value) in the range of 0.05–0.001). Therefore, the observed strength of association signals seen in the SNP screen were not within the range of what we could expect to observe if the linkage signal were due to one gene effect, one founder and r2>0.4 between the test marker and the disease locus.

Single-point association analysis

The single-point association analysis revealed 59 SNPs, which showed association with an Pnc-value below 0.05 (Supplementary Table 1). In total, eight SNPs displayed a significant Pnc-value<0.01, of these, the two most associated SNPs were located within the Jak and microtubule-interacting protein 2 (JAKMIP2) gene (rs12653715, observed/expected number of alleles (S/E(S))=5/16.5; Pnc=0.0019 and rs12655012, S/E(S)=5/15.5; Pnc=0.0032; Supplementary Table 1). Except for two SNPs, all the associated SNPs (Pnc<0.05) were located within two broad regions; region 1, 133.913.277–136.848.899 (3 Mb) and region 2, 141.997.428–147.856.177 (6 Mb) (Figure 1).

Figure 1
figure 1

Single-point association for all tested single-nucleotide polymorphisms (SNPs). Each dot represents the –log 10 of the P-value generated with FBAT statistics for one single SNP at its chromosomal location. The chromosomal positions correspond to the human July 2003 (hg16) assembly of the University of California Santa Cruz database (http://genome.ucsc.edu/).

Haplotype association analysis

We performed haplotype association analysis by HapMiner utilizing four different haplotype lengths (7, 15, 19 and 27). In all four analyses, numerous haplotypes displayed moderate statistical significant association with Z-score between 2.0 and 3.0 (corresponding to a Pnc-value within the range of 0.05–0.003) (Supplementary Table 2). In fact, only 18 haplotypes located within seven regions demonstrated stronger association than what was seen for the single SNP displaying a Z-score>3.5 (corresponding to a Pnc<0.0005) with any haplotype length (Figure 2). A detailed description of these 18 haplotypes is given in Table 1. Overall, the analyses of haplotypes of various lengths reflected the same associated haplotypes within each of the seven candidate regions. All the 59 single-point-associated SNPs were present on associated haplotypes, of which 15 SNPs colocalized to the seven candidate regions. Consequently, most of the associated single SNPs did not displayed stronger association as part of a haplotype. For instance the two strongest associated SNPs located within the JAKMIP2 gene did not display stronger association as a haplotype. Only one of the seven regions did not display association at the single SNP level (region 2 in Table 1). The strongest haplotype associations were seen within three chromosomal regions. The associated haplotypes with highest Z-score (between 3.92 and 4.04 corresponding to a statistical significance level of Pnc-value of 8.8 × 10−5−5.2 × 10−5) include region 1 covering the hypothetical protein FLJ23312 (AK26965), region 3 harboring the σ-GTPase activating protein 26 (ARHGAP26) gene and region 4 with the minor histocompatibility antigen HB-1 (Table 1).

Figure 2
figure 2

Associations of haplotypes of various lengths (7, 15, 19 and 27 single-nucleotide polymorphisms (SNPs)) generated by HapMiner. Each dot represents the absolute Z-score (Z-score) value of one haplotype. The chromosomal positions correspond to the human July 2003 (hg16) assembly of the University of California Santa Cruz database (http://genome.ucsc.edu/).

Table 1 The strongest associated haplotypes within the seven candidate regions obtained by HapMiner (Z-score>3.5)

Discussion

In this study we have performed an extensive SNP association screen testing 1404 SNPs within the 95% CI of the linkage peak at chromosome 5q31–33 previously obtained in our Swedish/Norwegian CD family cohort. We were unable to identify an association signal of any of the markers that alone could explain the linkage signal observed in the same patient cohort.

Under the assumption of a single gene effect and one founder mutation it is possible to predict the strength of the association signal which would explain the observed magnitude of the linkage peak in our material. This prediction indicates a single-point association signal with a P-value less than 10−6. It should however be noted that this prediction is hampered with some limitations. The calculation is based on one causal gene variant located at the maximum of the linkage peak, and by assuming complete LD between the marker and disease variant (D′=1), but varying r2 (0.4–1). Therefore, if the disease gene is located more distant from the peak, or if the risk variant is less correlated with the tested markers, the predicted signal would be weaker. Moreover, our prediction does not take into account the uncertainty which exists for the IBD sharing probabilities as this would require much more extensive calculations. Linkage of chromosomal region 5q and CD has been confirmed in several studies.3, 4, 5, 6, 7, 8 This speaks against a false-positive linkage signal in our cohort. It is possible, however, that the linkage signal in our cohort is overestimated, which would lead to a falsely inflate magnitude of the predicted association signals. Despite these constrains in our predictions, there appears to be a striking discrepancy between the predicted association signal and the signals we observe. One obvious explanation for this discrepancy could be that this region contains several susceptibility genes that collectively contribute with moderate risk to CD susceptibility; effects that with our limited sample size would be difficult to detect.

Inability to detect ‘the true association signal’ could be another reason for not achieving a strong association signal in this study. The HapMap project provides a description of the LD composition of the human genome as to minimize the number of tag SNPs needed for association studies. It is important to note that the HapMap phase I provide genotype data mostly for common SNPs with MAF>0.05 selected in accordance with the ‘common disease/common variant’ hypothesis. Due to this underlying selection bias of high-frequent SNPs, tag SNPs selected from HapMap data have shown to have limited capacity to tag rare variants (variants with MAF<5%).20, 21 Possible risk factors caused by low frequent variants would thus easily be overlooked. However, the most important parameters that influence how well tag SNPs cover genetic variation are SNP density and the pattern of LD. Tag SNPs selected in high-LD regions are robust to variable marker density whereas low-LD regions are not.20, 22 In the phase I project, genotype information of SNPs with a density of one SNP per 5 kb is available. This spacing has been estimated and later shown to enable sufficient coverage of approximately 75–80% of the genome if tag SNPs are selected using the pairwise algorithm and r2=0.8.20, 23 Marker density and local LD pattern do therefore not make the transferability straight forward, and the tagging performance would be variable between different chromosomal regions. In our study, tag SNPs were selected from the, at that time, available phase I HapMap data (release #16b) by applying the pairwise algorithm, r2>0.9 and MAF>0.05. By only including relatively common SNPs, we ensured sufficient power in the study. It is also worth to note that HapMap provided genotypes for about 15% (ie 5516 SNPs) of the total amount of SNPs reported within the 95% CI of interest. The HapMap data do most likely therefore not provide coverage of all the genetic variation within our region of interest.

The apparent discrepancy between linkage and association signals as we experienced in this study could also be related to DNA copy number variation (CNV). CNVs represent a huge source of genetic diversity that in many cases will have functional implications.24 CNVs, however, are often located in regions of complex genomic structure that are poorly covered by genotyped SNPs.24, 25 There is often a low LD between many CNVs and SNPs, which is reducing the likelihood of detecting a disease associated CNV in an association study. Furthermore, SNPs located within CNVs may give incorrect genotyping thereby perturbing proper analysis.26 It is interesting that one of the strongest haplotype associations (the haplotypes in region 2 shown in Table 1) is located in the 5 Mb gap where no single SNP associations were seen. This 5 Mb region contains several annotated segmental duplications. Whether the CNVs located within this region have had any influence in our SNP association analysis is however unknown.

In common diseases where several genes are thought to contribute with modest risk to disease susceptibility, the use of too conservative correction procedures would lead to increased type II errors rather than facilitate identification of the real disease variant (especially in moderate-sized sample sets as is used in most studies).27 As the single-point association signals we obtained were predominantly of moderate significance, correction for multiple testing would clearly not assist discrimination between true- and false-positive association signals. The number of associated SNPs did not exceed the number of false positives expected by chance for this number of tested SNPs (both at the 5 and 1% significance level), which in theory means that all associations could potentially be false positive findings. Single-point association studies are statistically weak and have clear limitations. Haplotype association analyses are more powerful as they can tag the founder chromosome more efficiently. This type of analysis is often done to provide additional support for the single marker associations. Our haplotype association analysis revealed seven regions with increased association signals. Except for one region, they all harbored markers with single-point association signals. Notably, however none of the haplotype association signals would have remained significant after correction for multiple testing. Hence, this analysis did not bring clarity into identifying the underlying causal variant(s). Whether any or more of these regions are involved in CD genetics should be scrutinized by replication attempts.

In conclusion, our comprehensive association analysis of a region with one of the strongest linkage signals seen in CD has failed to establish markers that demonstrate convincing association with the disease. Collective effects of multiple risk genes within the region, incomplete genetic coverage or effects related to CNV appear to be possible explanations for our findings.