Introduction

Inflammatory bowel disease (IBD) comprises different entities characterized by the presence of chronic inflammatory and relapsing damages in the gastrointestinal tract, especially in the small intestine and in the colon. Its most important subtypes are Crohn’s Disease (CD) and Ulcerative Colitis (UC). The former can be located in any part of the gastrointestinal tract and it is characterized by transmural inflammation; while the latter is usually located in the colon and it is confined to the mucosa. The most common symptoms developed by IBD patients include diarrhoea, anaemia, abdominal pain and weight loss1.

Although its aetiology remains unknown, epidemiological and genetic data suggest that IBD is triggered by environmental factors in genetically-predisposed individuals. As consequence of those factors, there is an excessive inflammatory response that causes the symptomatology. Among the environmental factors, infections and tobacco consumption have been proposed, but ample uncertainty remains in this area1. The genetic component of IBD has been analysed using genome-wide association studies (GWAS). More than 200 risk loci have been identified in European ancestry and patients from other ethnicities. In addition, the majority of those risk loci are common for CD and UC, with similar effects; and among other signals, some independent signals in the human leukocyte antigen (HLA) region have been previously described2,3,4.

However, those risk loci explain only a minor proportion of the observed heritability of IBD and, as it happens in other complex diseases, the prevalence of the IBD and associated genetic risk variants associated with IBD vary across populations5,6. For example, NOD2 gene has been associated with CD in some European populations, but the evidence for association in a Scottish population was lower6. Known biological sources of heterogeneity between populations include differences due to variation in allele frequency (for example, in NOD2 gene), effect size (for example, TNFSF15 and ATG16L1 genes) or the combinations of both (for example, IL23R and IRGM genes)3.

The availability of genetic information permits to develop Polygenic Risk Scores (PRS) for IBD. The promise of PRS is the stratification of patients according to their genetic variants and the risk of developing a complex disease. Based on the carriership of risk alleles, an individual can be identified as more prone to develop the disease, with the entailed potential to translate the genetic knowledge into clinical practice7. However, a general theme across complex diseases is that the performance of the application of PRS is dependent on the population, even if they are from the same ethnicity8,9,10.

The Basque population shows some genetic differences compared to the rest of European populations, probably due to their isolation and the effect of genetic drift. As consequence of that particular genetic history, the Basque population has retained more genetic makeup related to populations that lived in Europe in the Neolithic11 or Iron Age12, with less impact from latter migrations associated to the Steppe pastoralism. For example, the Basque population shows a slightly different frequency of the haplotypes of HLA region13, as aforementioned, a region associated to IBD2. Of note, according to the Basque Statistic Institute (https://en.eustat.eus), between 2016 and 2019, in the Basque Autonomous Community (Northern Spain) there were 2804 hospitalizations involving 27,789 days of hospital stays due to IBD.

Our aim with this study is two-fold. First, to characterize for the first time the genetic architecture of IBD in the Basque region, a population that presents genetic particularities within the general European genetic background that has been profusely studied in GWAS for IBD. Secondly, in order to explore the transferability of genetic risk estimators across population, we study the performance of European-based polygenic risk scores in the Basque population, therefore, to infer the utility of the genetic information for IBD in the clinical practice among different populations.

Results

In the present study we have analysed 498 IBD cases, of which 284 were CD cases and 208 UC cases, and 935 healthy controls (Table 1). We found that the patients with IBD were older than the controls (41.46 years ± 11.85 vs 51.42 years ± 13.97, respectively; t-test p = 9.11 × 10−38). In addition, the proportion of females was higher in patients with IBD (48.59%) when compared with controls (32.83%). Regarding the clinic features of the disease, the majority of CD cases had ileal (46.5%) or ileocolonic location (40.8%); and for UC, more than half of the cases had left-sided extension (50.5%, Table 1).

Table 1 Demographics and features of the Basque cohort analysed in the present study.

We first established the genetic background of our cohort and its placement in the context of European populations (Fig. 1A). The genetic background of our cohort overlapped with Iberian population of 1000 Genomes Project, although some of the analysed individuals distanced from the core of the Iberian population (Fig. 1A). In more detail, we analysed the first two principal components of the genetic distance between individuals and we did not detect any particular clustering (Fig. 1B). Due to the particular genetic history of the Basque population, we analysed the admixture of our cohort, where two ancestral groups had the lowest cross-validation results. The first two principal component reflected the ancestry component of each individual, placing them into a general continuity of the mixture of the two inferred ancestral populations (Fig. 1B), and we used that information as covariate in the GWAS analysis.

Figure 1
figure 1

Genetic background of the Basque cohort analysed in the present study. (A) Relationship of the Basque cohort within 1000 genomes project European populations, according to Principal Component Analysis. (B) Principal Component Analysis of the Basque cohort, coloured by their ancestry according to Admixture analysis. Graphics were depicted using R language 4.0.5 (https://www.r-project.org) and ggplot2 3.3.5 (https://ggplot2.tidyverse.org).

Genome-Wide association study

In the GWAS we evaluated 5,411,568 SNPs to find differences in allele frequency between patients with IBD (cases) and healthy controls. We found that 41 SNPs had suggestive significance (p < 5 × 10−6) when all IBD cases were analysed, 25 SNPs when only CD cases were analysed and 49 SNPs when only UC cases were analysed. Those SNPs were located in 12, 14 and 12 suggestive loci, respectively (Table 2), for a total of 33 unique loci study-wide. From those signals, we found one genome-wide significant signal in UC (Table 2), in HLA region (rs41291790, p = 2.9 × 10−8, OR = 5.3). That association, as well another 3 loci, were previously associated with IBD or its subtypes (Table 2), according to the PheWAS analysis. Among the genes mapped in the suggestive loci, we found genes previously linked to IBD and its subtypes (such as IL23R, JAK2 or genes located in HLA region), as well as genes not previously associated to IBD or its subtypes, including among others, AGT, BZW2 or FSTL1 genes, located on loci where the lead SNP had an OR of 2.0 (95% of confidence interval of 1.5–2.7), 3.2 (2.1–5.1) and 1.5 (1.3–1.8), respectively (Table 2). On the whole, regardless of their significance, the direction of the effect of those suggestive signals was concordant in CD and UC in all the lead SNPs except for one (Table 2).

Table 2 Basque IBD GWAS association results and annotation, suggestive loci.

We observed further association in some of those signals with location or extent of disease (Table 3). In the case of CD, 5 loci were more significantly associated with ileal CD than in ileocolonic CD, for example, rs1826333 (ileal CD p = 1.7E−07, ileocolonic CD p = 0.084); while 7 loci were more significant in ileocolonic CD than in ileal CD, for example, rs11129387 (ileal CD p = 0.034, ileocolonic CD p = 7.7E−06). In the case of UC, 8 loci were more significantly associated with left-sided extension than in extensive extension, for example, rs871822 (left-sided UC p = 2.7E−05, extensive UC p = 0.006); while 8 loci were more significantly associated with extensive extension than in left-sided extension, for example, rs17231595 (left-sided UC p = 0.020, extensive UC p = 4.3E−07).

Table 3 Basque IBD GWAS association results in each subtype, suggestive loci.

We further characterized the results through gene-set enrichment analyses and alternative methods for gene mapping. While the physically genes located in loci in IBD and CD do not show any significant enrichment, in UC, due to the markers located in HLA region, those genes belonged mainly to immunity related function, such as innate immune response, interferon gamma mediated signalling or antigen processing and presentation (Supplementary Table S1). However, when we used alternative gene mapping strategies, namely Depict and S-PrediXcan methods, we did not obtain any significant result after multiple test correction.

Moreover, we examined the significant loci from the results of International IBD Genetic Consortium (IIBDGC) in our cohort. On the whole, we observed few lead SNPs located in those loci involved in IBD or its subtypes were nominally significant in Basque cohort (Supplementary Table S2). In total, we found 25 of those loci nominally significant in IBD, 27 in CD and 23 in UC; and the direction of the effect was consistent between IIBDGC results and our cohort in 21, 23, and 18 loci, respectively (Supplementary Table S2).

Considering the size and the allele frequencies in our cohort, we calculated the statistical power to replicate nominally (p < 0.05) the signals detected in IIBDGC. We concluded that our power to replicate those signals at p < 0.05 was up to 36, 35 and 24 for IBD, CD and UC, respectively. From those signals we detected a nominal p-value in 24, 25 and 21 loci, respectively. Therefore, the effective replicability rate of IIBDGC signals in the Basque cohort was 67% for IBD, 71.4% for CD, and 87.5% for UC; and we detected a nominal p-value in one signal in IBD, 2 signals in CD and 2 signals in UC that, theoretically, we have not enough power.

Finally, we selected some of the most relevant genes well-known to be associated to IBD, namely, IL23R, ATG26L1, IRGM, TNFSF15, LRRK2 and NOD2 to study in detail the evidence of association in our cohort (Supplementary Table S3). In the case of IL23R and NOD2 genes, we showed that the significance of some SNPs located in those genes was higher when only CD cases were analysed that in all IBD cases; namely rs11209023 in IL23R and rs5743292 SNPs in NOD2. The significance of those SNPs in each location of CD (ileal or ileocolonic) was similar for IL23R; while in NOD2 some SNPs were more significant in ileocolonic CD than colonic CD, such us, for example rs5743292 (Supplementary Table S3). When we analysed the SNPs located in LRRK2 gene, there were SNPs whose significance was higher when all IBD cases were analysed than analysing each subtype separately (rs4767970); and their significance was higher in ileal CD than in ileocolonic CD, and in left UC than in pancolitis UC (rs4767970). In the rest of the analysed genes in detail, such as, ATG26L1, IRGM or TNFSF15, we did not find any relevant signal (Supplementary Table S3).

Heritability and genetic correlations

The estimated heritability was calculated using LDSC: the heritability of IBD in our cohort was h2 = 0.579 ± 0.338 and, in the case of the subtypes, the estimate was particularly larger for CD (h2 = 0.773 ± 0.411) than for UC (h2 = 0.464 ± 0.362). Therefore, Z score of the heritability was 1.71 for IBD, 1.88 for CD and 1.28 for UC, all values below the significance threshold (Z score > 1.96 for p = 0.05).

Regarding the genetic correlation analysis carried out using LDSC program, we found that IBD and CD GWAS findings from the Basque cohort were significantly correlated with their counterparts from IIBDGC, with a significant regression score: 0.817 ± 0.235 (p = 0.0005) and 0.892 ± 0.235 (p = 0.0001) respectively; while the genetic overlap was not significant in UC (Fig. 2). Furthermore, in the Basque cohort there was significant correlation between IBD and CD (p = 2.14 × 10−29); and IBD and UC (p = 0.0001); but not between CD and UC; while in the results from IIBDGC IBD and its subtypes were genetically correlated between them (Fig. 2).

Figure 2
figure 2

Genetic regression of the results of the present study and their counterparts from IIBDGC, for IBD and its subtypes. Circle size and colour depict regression coefficients. Inside the circle the significance of the regression coefficient, ***p < 0.001, **p < 0.01, *p < 0.05; otherwise, not significant.

In addition, we carried out a genetic correlation analysis with the traits available in CTG-VL and LDHub tools. The top hits were IBD and its subtypes, but after False Discovery Rate correction, we did not find any significant genetic correlation with those traits.

HLA association analysis

In the analysis of HLA imputation using HIBAG, we found 19 HLA alleles associated with IBD, CD or UC (Table 4). Eight of those alleles were significant when all IBD patients were analysed; 10 when only CD patients were analysed; and 9 when only patients with UC were analysed (Table 4). The most significant haplotype was HLA_A_0201 in UC (p = 1.21 × 10−5, OR = 1.99), a signal previously known in UC (Table 4). Among the haplotypes, we found that 7 haplotypes were not previously associated with IBD or its subtypes (Table 4).

Table 4 HLA imputation association results in the Basque IBD cohort, significant alleles.

Application of polygenic risk score

Firstly, we applied to our Basque cohort a set of publicly available polygenic risk scores (PRS) previously derived from GWAS analyses of UK Biobank as described in Khera et al.7 (Fig. 3A) and available through PGS catalog. In total, we could use in our cohort the weights of 5,913,246 SNPs from that PRS model. The Area Under the Curve (AUC) value was 0.69 (Confidence Interval of 95% 0.66–0.72) and the difference of the mean PRS score between IBD cases and controls was significant (t-test p of 6.49 × 10−24).

Figure 3
figure 3

Polygenic risk score (PRS) analysis of IBD and its subtypes. T-test p, p-value of the T-test comparing the PRS scores of cases and controls. (A) PRS calculated for all Inflammatory Bowel Disease samples using the PRS derived in Khera et al.7. (B) Optimal PRS calculated for all Inflammatory Bowel Disease samples using IIBDGC results as model. (C) Optimal PRS calculated only for Crohn’s Disease samples using IIBDGC results as model. (D) Optimal PRS calculated only for Ulcerative Colitis samples using IIBDGC results as model. (E) Optimal PRS calculated only for Ulcerative Colitis samples, excluding HLA region, using IIBDGC results as model.

Then, in order to derive Basque-specific PRS, we computed polygenic risk scores in the Basque cohort by using summary statistics from the IIBDGC GWAS results, using PRSice-2 (Fig. 3B–E). The best PRS models included 809 SNPs markers for IBD (at a p-value threshold of 0.0002), 733 SNPs for CD (p-value threshold of 0.0002) and 303 SNPs for UC (p-value threshold of 5 × 10−05). With the limitation that we used these PRS in the same population used to generate them (lack of independent replication cohort), the accuracy of a prediction model was higher in IBD and CD, with AUC values of 0.72 (CI of 95% 0.69–0.74) and 0.73 (CI of 95% 0.69–0.76), respectively, than in UC (AUC of 0.68, CI of 95% 0.63–0.72). Accordingly, the difference of the mean PRS score between cases and controls (again from the same cohort) was more significant in IBD and CD (t-test p of 1.70 × 10−33 and 5.50 × 10−25, respectively, Fig. 3B and C) than in UC (p of 3.30 × 10−13, Fig. 3D). Since UC showed a bimodal distribution both in cases and controls, we removed the HLA region from the PRS calculation (Fig. 3E), using 295 SNPs (p-value threshold of 5 × 10−05) in the best model. This led to a distribution resembling normality, but the AUC was lower (0.66, CI of 95% 0.62–0.70) and the comparison of the average scores was less significant (t-test p of 3.93 × 10−11).

Discussion

In the present study we have analysed for the first time the genetic architecture of inflammatory bowel disease (IBD) and its main subtypes, Crohn’s Disease (CD) and Ulcerative colitis (UC), in a cohort from the Basque region. Although the small sample size of our study hampers the discovery of significant signals, our results provide clues about the transferability of genetic findings in European populations not studied to date, especially in those with particular genetic history as the current Basques.

It has been established that the Basque population has been less affected by the admixture processes that shaped the modern European genetic pool, maintaining more ancestry fractions from the Neolithic11 and the Iron Age12. Indeed, likely composed of “modern Basques”, our cohort reflected such an admixed nature, with the two first PC possibly reflecting the effect of the mentioned historical processes. Thus, we incorporated the correction of PC to avoid spurious results in the GWAS analysis, due to the effect of a possible subtle stratification, as it has been previously used successfully in a more complex admixed populations14.

The genetic architecture of IBD and its subtypes have been established in different cohorts and populations, mainly from European ancestry cohorts3,15. Compared with those studies, the number of patients of each subtype and the location and behaviour of the disease in our cohort was slightly different. For example, in our cohort the inflammatory behaviour of CD represented 67% of the CD cases while in Cleynen et al. was 50%. In addition, we have shown genetic differences between the different localization or extension of the disease, both in suggestive loci and in SNPs located in different genes. Those differences could be an effect of the sampling, the results of environmental effects16 or a reflect of local genetic differences and, therefore, those could affect our results and our comparison with what is established in IBD and its subtypes.

We have found one genome-wide significant result: rs41291790 in the HLA region in UC, that was previously associated to IBD and its subtypes. The rest of signals are suggestive, some of them associated previously to IBD or its subtypes; and the overlap of known associated loci3 and their significance in our cohort was scarce. However, considering the expected replicability in our cohort, we captured 67–87% of the expected signals, suggesting slight differences that could be affected by different genetic architecture or environmental effects; and that is important to study different populations to capture all the heterogeneity. In addition, when the whole genetic background is considered, we showed that IBD and CD correlated better with what is known from IIBDGC results3 (rg > 0.8) whereas, in the Basque population, the overlap of UC with European populations was lower. In fact, in IIBDGC results, CD and UC seem to share partially the genetic architecture3, while in our cohort the genetic overlap was not significant. The same can be concluded from heritability analyses: although they were not significant, the heritability of CD was higher than UC in our cohort. In addition, on the whole, the direction of the effects of genetic variants in Basque cohort were concordant between subtypes, and with the ones from IIBDGC. In the case of loci that were not previously associated with IBD further replication analyses are needed to stablish their relevance. Moreover, and considering all the limitations of our cohort, we were able to detect differences in the effects of suggestive loci depending on the location or extension of the disease, as it has been previously described15. Genetic heterogeneity between populations have been previously described in IBD5,6, and, since the genetic background our population is slightly different from the rest of European populations, it is to be expected that there are slightly genetic differences, as we have found. Therefore, although the sample size of our cohort and its statistical power could be a limitation to discover new strong signals, even more so considering the possible influence from differences in the linkage disequilibrium in the Basques, we were able to detect the main features of the genetic architecture of IBD.

As mentioned, the strongest signals in UC in Basque population are located in HLA region, the previously mentioned rs41291790, and rs3910312, which are associated with IBD, according to the PheWAS analysis. In addition, the strongest HLA allelic association in the Basque cohort (HLA_A_0201) had higher OR than IIBDGC results (1.99 in Basque cohort, 1.14 in IIBDGC results2); and we have detected new HLA alleles that has not been associated to IBD or its subtypes. It is well established that HLA is a genomic region associated with UC and its behaviour2,4 and, therefore, our results are consistent with the involvement of HLA region in UC. In addition, the frequency of the haplotypes of HLA region is slightly different in the Basques13 or Northern Spain17 from other European populations; and it has been established that the risk haplotypes of HLA in rheumatoid arthritis in Basques were different to other populations18; as well as for multiple sclerosis19. Thus, the results we obtained in the HLA region in UC are consistent with the observation in other complex diseases that the involvement of HLA alleles is slightly different in the Basque population.

A complementary way to infer the strengths and limits of our results is to inspect individual genes. NOD2 is a gene that is associated with CD, especially with ileum affectation15, it is known to vary in association patterns across populations, even for near groups6, and it has been pointed out as the source of the risk to CD in European and non-European admixed populations20,21. Our results, although not genome-wide significant, are consistent with those observations: we found almost suggestive significance of NOD2 in CD and in some SNPs more significant results in ileal CD. LRRK2 gene have been associated with IBD3,22, specially with CD3,22, and another chronic inflammatory diseases23. In our results we see that is significant in IBD, and there are not relevant differences between subtypes. LRRK2 gene is also well known to be a risk gene in Parkinson Disease, and one of the known mutations that confers more risk in that disease has its origin in the Basque population, while that mutation is scarcer in other populations24. Thus, although more refined work is need to understand the haplotype effects in this genomic region, this might suggest that LRRK2 presents differences in effects in the Basque population, since that gene is an example of a gene that reflects the distinctive genetic background of the Basque population24.

Moreover, as mentioned before, we detect some suggestive loci that require further validation in a Basque cohort. Among the genes located on those loci, we found AGT gene, a gene involved in the genetic risk of thromboembolic events in IBD25; in the prognosis of colorectal cancer26, a cancer whose risk is increased in CD27; and it has been proposed that AGT is an important regulator of apoptosis in the intestinal epithelial cells28. In addition, other genes located in those suggestive loci are BZW2 gene, a possible oncogene that could be a driver gene in colorectal cancer29; DAPK2 gene, a gene involved in the progression of colorectal cancer30; and FSTL1 gene, a gene involved in proinflammatory response in inflammatory diseases31. Due to the biological mechanism where those genes are involved, although suggestive, those genes seem good candidate genes for follow-up analyses to understand the development and prognosis of IBD, at least in the Basque population. Therefore, the role of the mentioned genes in the development of IBD should be established in future studies, at least in Basque cohort.

Considering the genetic correlations and that some genes showed consistent involvement in IBD and CD compared with other European populations, it seems that the genetic architecture of IBD and CD in the Basque population is more similar to other European population, while the genetic architecture of UC was slightly different.

The use of the PRS derived from UK Biobank7 in IBD showed a slightly better performance than in that work (AUC of 0.69 in our cohort, 0.63 in UK Biobank7). When a Basque-specific PRS model was derived using IIBDGC GWAS results, the performance was slightly better in IBD (AUC value of 0.72), although with the important limitation that the same population was used both to derive the PRS and to test them for their discriminative potential (possibly generating inflated results). In the case of CD, the most optimal model had an AUC of 0.73, which is lower than other studies32,33. In one study32, first IIBDGC data from 4906 CD cases and 11,494 controls was used to derived the PRS using different methods, such as, mixed linear models, elastic net regularization or Bayesian methods, to get the best predictive model. Then the best model was applied in 2204 CD cases and 997 controls from Australia and New Zeeland and the highest AUC was 0.7832. In other study33, 112 SNPs were tested to build the most optimal model for PRS in Slovenian population, where 202 CD cases and 236 controls were analysed; and the best AUC was 0.78 using 33 SNPs33. In the case of UC, the performance of the most optimal model (AUC = 0.68) in our cohort was not as good as IBD and CD. The lower performance of PRS in UC than in CD was previously observed32: using 5788 UC cases and 16,194 controls from IIBDGC data to construct the best model and then applying it in 1193 UC cases and 997 controls from Australia and New Zeeland, the best AUC was 0.7032. Therefore, the most optimal model used in the present work should be analysed in an independent Basque cohort to validate its applicability. In addition, considering the good performance of IIBDGC panel in Basque and other cohorts, it seems that application of PRS in IBD and CD should be based in data generated from multiple populations and, in this way, be useful in the clinical practice in different populations. As mentioned, the case of UC seems to be slightly different. Although we removed the HLA region from the PRS calculation to avoid the slightly different allelic frequencies in the Basque population13, the performance of PRS did not improve. Therefore, that translation of genetic results of UC to clinical practice seems more complicated, as it has been previously described in other complex diseases in the use of PRS in close populations8,9. In conclusion, it seems that the performance of PRS reflected the differences in the genetic architectures of IBD and its subtypes.

On the whole, we explored genetic features of IBD and its subtypes in a small Basque cohort for the first time. We detected signals mostly compatible and overlapping with those previously described in large multicentre cohorts of European descent, further suggesting the potential transferability of GWAS findings across European populations. Some of the association signals detected here in the Basques, may correspond to bona fide risk loci and variants specific to this population, which warrants further investigation in much larger samples from the same area.

Methods

Samples

IBD cases were diagnosed using standard criteria; and the samples used in this study were obtained in the standard clinical practice, after informed consenting, in Hospital Universitario Donostia (San Sebastian, Spain) and Hospital Universitario de Cruces (Barakaldo, Spain). The samples of non-IBD controls were obtained through the Basque Biobank. In total 549 cases were recruited and 987 controls were used. All participants provided written informed consent.

The present study was approved by the Local Ethics Committee (Comité de Ética de la Investigación con medicamentos de Euskadi, code: PI + CES-BIOEF 2017-10).

Genotyping and imputation

DNA samples from the individuals included in this study were genotyped using Illumina Global Screening Array on Illumina iScan high-throughput screening system in the Institute of Clinical Molecular Biology (Kiel, Germany). To call the alleles from raw intensities the GenCall algorithm available in Illumina GenomeStudio 2.0 (https://www.illumina.com/techniques/microarrays/array-data-analysis-experimental-design/genomestudio.html) software was used.

Genotyped data was filtered removing samples and markers using the following procedure: exclusion of samples with ≥ 15% missing rates; exclusion of markers with non-called alleles; exclusion of markers with missing call rates > 0.05; exclusion of samples with ≥ 5% missing rates; exclusion of related samples (PI-HAT > 0.1875); exclusion of samples whose genotyped sex could not be determined; exclusion of samples with high heterozygosity rate (more than three times SD from the mean); only autosomal SNPs were kept; removal of markers with Hardy–Weinberg equilibrium p < 1 × 10−5; removal of markers whose p of difference in missingness between cases and control was < 1 × 10−5; and removal of samples which were outliers, identified using principal component analysis (deviation of more than six times interquartile range).

Imputation of missing genotyped was done using the Sanger Imputation service. The reference panel used was the release 1.1 of Haplotype Reference Consortium and the pipeline used was EAGLE2 + PBWT34,35,36. Once imputed, markers with INFO score < 0.80, MAF < 0.01 and non-biallelic markers were removed.

After genotyping, quality control and imputation, 5,411,568 SNPs from 1433 individuals (498 cases and 935 controls) were kept.

Genetic analyses

Admixture analysis

Genotyped SNPs were pruned using Plink37 and SNPs from regions with high linkage disequilibrium were removed. Considering the particular genetic history of our cohort, a population admixture analysis was carried out using Admixture38, setting K between 1 and 10, and using the results with lowest cross-validation value. The analysis was carried out using the samples of our cohort.

Genome-wide association studies

GWAS analyses were performed using logistic regression implemented in Plink37, adjusting by sex and first four principal components. The analyses were performed with all IBD cases, as well as only CD cases and only UC cases separated.

In addition, ileal CD (N = 132), ileocolinic CD (N = 116), left-sided UC (N = 105) and extensive UC (N = 72) were separately analysed using logistic regression implemented in Plink37, adjusting by sex and first 4 principal components.

Loci definition and gene-mapping

Risk loci from the analysed phenotypes were defined as non- overlapping genomic regions extending a linkage disequilibrium window (r2 = 0.4) from the association signals with p < 5.0 × 10−6. Annotation of GWAS results, including genes mapping to the identified risk loci, was performed with functional mapping and annotation (FUMA) of GWAS39.

Power analysis

195 independent genome-wide significant loci from IIBDGC results were selected3. To study the statistical power to replicate the IBDGC signals in the Basque IBD GWAS, a power analysis was carried out using the R package “genpwr”40. The power calculation was performed for all IBD subtypes (i.e., IBD, Cd and UC) separately.

Replicating SNPs were defined as SNPs with nominally significant p-values (p < 0.05) in our study. Expected number of replicating SNPs can be estimated as the sum of the power to attain nominal replication of every IIBDGC SNP. The ratio between observed and expected number of SNPs permits to calculate the effective replicability rate.

PheWAS analysis

Lead SNPs from each suggestive locus was inspected using Phenoscanner V241,42. Traits associated to the Lead SNP or with SNPs in LD with the Lead SNP (R2 ≥ 0.8) were retrieved; and traits with genome-wide significant p-value (p < 5 × 10−8) were kept.

Gene-set enrichment analyses

To test for over-representation of biological functions based on gene annotations (gene set enrichment analysis), we screened the Molecular Signature Database (MsigDB) using the list of FUMA mapped genes against all genes in hypergeometric enrichment tests. Gene sets with an adjusted p < 0.05 (false discovery rate correction according to Benjamini–Hochberg) were considered significant evidence of enrichment.

Depict43, as it is available in CTG-VL (https://vl.genoma.io), was used to find the causal genes at associated loci and to perform an gene-set enrichment and tissue enrichment analyses. In that analysis SNPs with p < 1 × 10−5 were used.

S-PrediXcan, an extension of PrediXcan for summary data, was used to map genes through expression data of relevant tissues44, as it is available in CTG-VL. The expression data used was based on GTEx45 and the tissues inspected were terminal ileum, colon transverse and colon sigmoid. Genes with p < 2.5E−7 were considered significant. In addition, gene set enrichment analyses with those genes were performed using FUMA.

Heritability and genetic correlation

To study the heritability and genetic correlation of the results of this study and the results from IIBDGC ldsc program46 was used, as it is available in CTG-VL. Results from all IBD cases, only CD cases and only UC cases association analyses of the present study were compared with their counterparts available from IIBDGC. In addition, we analysed the genetic correlations of IBD, CD and UC association analyses with the traits available in CTG-VL and LDHub47.

HLA association analysis

HLA types were imputed from genotyped data using HIBAG package48 available in R language49. In the imputation European panel was used as model.

The association analysis was carried out with HIBAG using logistic regression and testing dominant model, adjusting by sex and first four principal components.

The analyses were performed with all IBD cases, as well as only CD cases and only UC cases separated.

Polygenic risk score

Firstly, Polygenic risk score (PRS) was calculated using the weights calculated by by Khera et al.7 and retrieved from PGS catalog50. Those weights were applied in the Basque cohort using Plink37.

Secondly, PRS were calculated using PRSice software51. As base summary statistics the results from IIBDGC was used; additive model was tested; and the analysis was adjusted by sex and first four principal components. The analyses were performed with all IBD cases, as well as only CD cases and only UC cases separated. The performance of the PRS was measured comparing the PRS score distribution of cases and controls using a T-test using R language49; and calculating the area under de curve using pROC package of R language. The 95% of confidence interval of the area under the curve was calculated using that package and DeLong method.

Graphics were depicted using R language49, and ggplot2 3.3.552 and corrplot 0.87 (https://github.com/taiyun/corrplot) packages.

All methods were performed in accordance with relevant guidelines and regulations including the Declarations of Helsinki.