Introduction

Crohn’s disease (CD) and ulcerative colitis (UC) are the most common inflammatory bowel diseases (IBD) and are characterized by chronic remitting and relapsing gastrointestinal inflammation. In the United States, the prevalence of IBD for children (<20 years old) was estimated to be 92 cases per 100,000 in 2009, accounting for approximately 5% of prevalent cases [1]. Increasing prevalence [1] and rates of hospitalization [2] for pediatric IBD have been observed in the US, mirroring the trend of increasing IBD incidence in both pediatric [3, 4] and adult [5] populations worldwide. Diagnosed early in life, pediatric patients face years of medication, surveillance colonoscopy, and a high probability of surgery. Better understanding of disease etiology and progression in this group is therefore vital.

IBD is thought to have a strong genetic component, since family history of IBD is the greatest risk factor for disease at all ages. IBD patients with a family history of disease often present at a younger age [6,7,8], are more likely to experience extra-intestinal manifestations [6], have perforating disease, and require longer follow-up compared to patients without family history [6, 7], likely reflecting an increased genetic liability to disease. Genetic analyses of pediatric cohorts are therefore useful in exploring genetic architecture of IBD.

Large genome-wide association studies (GWAS) of IBD have found more than 200 common loci associated with disease [9, 10]. Pathway analysis of associated loci has found an enrichment of immune system genes, especially those related to host response to microbes, and a great deal of overlap with other immune diseases [9]. Findings of studies of common variation in pediatric IBD cohorts generally echo findings in adult populations. One study of greater than 1000 pediatric-onset IBD cases and 1600 controls found slightly increased odds ratios for risk alleles also found in adult populations (including the well-known NOD2), and greater burden of these common variants was weakly correlated with earlier age of onset in CD [11].

A small proportion of disease liability has been explained by common variants in IBD—13.1% in CD and 8.2% in UC [9]—but the contribution of rare variants has not been assessed. This class of genetic variation is important because explosive growth of the human population in recent history has led to a corresponding excess of rare alleles [12], and most variants in protein-coding sequence are at low frequency [13,14,15]. The availability of public data sets allows us to compare whole-exome sequencing (WES) of a pediatric IBD cohort to other WES data [16] and to large databases containing population allele frequency information [15, 17]. We can further examine pathways implicated by genes annotated to these rare variants to gain greater understanding of IBD.

Results

Study participant characteristics

Relevant demographic and clinical characteristics are shown in Table 1 for the 368 cases with pediatric-onset IBD (<18 years of age at diagnosis) and 625 publicly available controls from the database of Genotypes and Phenotypes (dbGaP) whose data passed our quality control filters and principal components criteria (see Methods and Supplementary Fig. 1). The characteristics of the initial cohort of 517 pediatric-onset IBD cases (see Methods) are also available in Supplementary Table 1.

Table 1 Clinical and demographic characteristics of samples with exome sequencing data used in analysis

Common variants (MAF>0.05)

Using logistic regression to compare sites with minor allele frequency (MAF) > 0.05 between the 368 pediatric-onset IBD cases and 625 publicly available controls, we found no sites that reached genome-wide significance after genomic control (p < 2E-06, Figure 1 and Table 2). However, 14 out of the top 20 sites were within known CD- or IBD-associated loci (full list of loci from Jostins 2012 [9] and Liu 2015 [10] available as Supplementary Table 2). Nine variants were around the locus containing CARD9, a gene associated with both CD and UC (Supplementary Fig. 2), and three variants were near the locus containing CD-associated NOD2. Two protective variants also appeared at other CD loci in ADAM30 and NOTCH2. Genes annotated to the top 20 sites that also appeared in our list of genes involved in neutrophil function (Supplementary Table 3) included NOD2 and CARD9, which have key roles in anti-bacterial and anti-fungal functions of monocytes and macrophages.

Fig. 1
figure 1

Manhattan plot of p-values from logistic regression (with significant principal components and sex as covariates) comparing frequency of exome sequencing common variants in pediatric IBD cases to controls from dbGaP

Table 2 Top 20 most significant loci found in our common variant logistic regression

Pathway enrichment

Many of the pathways we found in our ClueGO pathway enrichment analysis that were implicated by the top 200 most significant annotated genes were immune-related (Table 3 and Fig. 2). The largest network of significant gene ontology (GO) terms included regulation of production of molecular mediators of immune response, as well as regulation of cytokine and tumor necrosis factor production. Terms related to regulation of leukocyte-mediated immunity, cytotoxicity, and apoptosis were also significant. Other associated pathways related to the theme of cell killing included positive regulation of apoptotic cell clearance and regulation of complement activation. Regulation of keratinocyte proliferation, Ras signal transduction, and muscle cell and neural crest cell development were also implicated.

Table 3 Significantly enriched pathways in the top 200 most significant genes in our common variant (dbGaP) analysis
Fig. 2
figure 2

Pathway enrichment of the genes annotated to the top 200 most significant common genes tested in our logistic regression

Rare variants (MAF<0.05)

Optimal unified association test (SKAT-O) analysis of rare variants

Using the same IBD and dbGaP cohorts, we tested rare variants with combined annotation dependent depletion (CADD) scores [18] greater than 10 to see if any genes were significantly enriched with these possibly pathogenic variants. The only genome-wide significant gene (p < 2E-05) was the well-known NOD2 (Table 4A). When we tested enrichment of variants in loci associated with IBD, the only significant list was the Crohn’s-disease-associated loci (p = 0.009, Table 4B). We also found a suggestive relationship between case status and rare variants in 144 genes that have been implicated in neutrophil function (p = 0.05, Table 4C).

Table 4A Top 15 results from SKAT-O analysis of enrichment of rare, likely-pathogenic (CADD > 10) variants in genes with five or more variants
Table 4B SKAT-O analysis for enrichment of rare variants with CADD scores >10 in loci associated with Crohn’s disease (CD), inflammatory bowel disease (IBD), or ulcerative colitis (UC)
Table 4C SKAT-O analysis for enrichment of rare, conserved variants in neutrophil function genes (NEUT)

We re-ran the SKAT-O analysis, adding common variants with CADD scores >10 to our list of rare variants. Including these common variants did not greatly impact the significance of genes associated with case status, likely because there were relatively few variants above the CADD score cutoff at 5% frequency or greater. However, including common variants strengthened the enrichment of variants in CD genes (p = 0.004; Supplementary Table 4A) and neutrophil function genes (p = 0.03; Supplementary Table 4B).

Exome Aggregation Consortium (ExAC) rare variant analysis

There was unsurprisingly a great deal of inflation when we performed Fisher’s exact tests comparing rare variant counts between the 368 pediatric IBD patients and aggregate allele frequencies for Caucasian populations in the ExAC database (Supplementary Fig. 3). We therefore limited our analysis to sites that made it past the stringent QC in our dbGaP analysis, and further filtered out sites in ExAC that were most significantly different from our dbGaP controls (see Methods). As seen in Fig. 3, genome-wide inflation was no longer apparent after applying these criteria. Shown in Table 5, six variants were genome-wide significant (p < 6E-07), with the most significant annotated to NOD2. Two other of the top 20 most significant variants were annotated to known IBD loci: one other in NOD2 and one in D2HGDH. Of our list of neutrophil function genes, only NOD2 was among the top 20 most significant rare variants.

Fig. 3
figure 3

Manhattan plot of p-values from comparing frequency of exome sequencing rare variants in pediatric IBD cases to ExAC after filtering out sites most significantly different between ExAC and our control data set

Table 5 Top 20 most significant sites in our rare variant Fisher’s exact tests

Pathway enrichment

According to analysis in ClueGO, the top 200 most significant genes in our list of rare variants were enriched in a few pathways (Table 6 and Fig. 4). Immune-response-related hits included negative regulation of the JAK-STAT cascade, modulation by host of viral transcription, and modification by host of symbiont morphology and physiology. Genes were also enriched in pathways involving ion transmembrane transport and negative regulation of axon extension. ToppFun analysis also highlighted genes involved in response to bacterium, regulation of antigen processing and presentation of peptide antigen, immune system development, and biological adhesion pathways (Supplementary Table 5).

Table 6 Significantly enriched pathways using the list of the top 200 most significant genes in our ExAC rare variant analysis
Fig. 4
figure 4

Pathway enrichment of the genes annotated to the top 200 most significant rare variants tested in our rare variant analysis

Discussion

Our findings echo important aspects of previous genetic and pathway enrichment analyses. Crohn’s-disease-associated loci had a strong showing in our results: two variants in NOD2 were the most significant in our dbGaP common variant analysis, and one site was significant in our ExAC rare variant analysis. NOD2 also emerged as significant in our gene-level SKAT-O analysis, and CD-associated genes as a group were also significant. This was not unexpected since the majority of our cohort were Crohn’s patients. Of the top 20 most significant common variants, 9 were within a single 100 kb region around CARD9 (Supplementary Fig. 2), a gene that has long been associated with IBD. This entire region looks equally associated with disease (OR ~1.5) in our cohort, reflecting that deep sequencing still cannot solve problems regarding fine mapping of causative variants without sufficient recombination.

We also found intriguing variants in genes not yet associated with IBD. KRTAP9-2 and KRTDAP, two of our top five common variant findings, are involved in keratinocyte differentiation, a theme that also emerged in our common variant pathway analysis. Keratinocytes are the most abundant component of the epidermis, playing an important role in immunomodulation at the interface between the body and environment. Capable of producing cytokines, these cells have been linked to a different inflammatory disease, psoriasis [19, 20]. Additionally, one recent study found that the interplay of hair follicle development, colonization by commensal microbiota, and local chemokine production in skin was necessary to establish immune tolerance to commensal microbes [21]; dysfunction in the skin environment could potentially impact this process and have systemic immune repercussions. These suggestive findings require replication in future, larger studies of pediatric IBD.

LAMA5, another top hit in our common variant analysis, encodes a subunit of laminin. Laminins are extracellular matrix proteins which are a major component of the basement membrane, a matrix of tissue that separates the epithelium, mesothelium, and endothelium from underlying connective tissue. Because of the important role of laminins in the integrity of this layer, there could be a role for LAMA5 in IBD pathogenesis. One study of transgenic mice overexpressing the LAMA5 mouse homolog found an attenuated response to DSS-induced inflammation [22]. The two most significant genes in our SKAT-O rare variant analysis after NOD2, VWA2 and HAPLN3, are also extracellular matrix components. In addition, the location and functions of the products of these genes are linked to integrins, which have emerged as important in large IBD GWAS [23]. And one recent, prospective study of more than 900 CD patients found that stricturing complications were associated with increased expression of extracellular matrix genes in ileal tissue at diagnosis [24]. Further studies are warranted to investigate the roles of these extracellular matrix proteins in disease etiology.

We were additionally interested in testing enrichment of rare variants in neutrophil function genes because children with inherited disorders of these classes of immune cells exhibit chronic intestinal inflammation similar to CD during the first decade of life [25, 26]. Similarly, loss of function in monocyte and/or macrophage antimicrobial pathways could be one mechanism of pediatric CD pathogenesis. Though we did not find a significant association, we did find a suggestive relationship in SKAT-O between rare, likely-deleterious variants in genes involved in neutrophil function and case status (p = 0.05). And when likely-deleterious common variants were also included, this association was significant (p = 0.03). Positive regulation of leukocyte-mediated immunity was also one of the most significant pathways in our common variant analysis, supporting further study into the role of phagocyte function and dysfunction in IBD.

Another important component of the immune system from our pathway analysis was complement; mutations in C2, C3, and CFB were among the top 200 most significant common variants associated with disease in our cohort. Though research into the role of complement has been somewhat lacking, evidence is growing for its potential relevance in disease pathophysiology (reviewed in [27]). A closely related theme, apoptosis, also appeared in several other significant pathways.

Ras signaling was another pathway of interest from our common variant analysis, and SOS1, one of the top hits in our rare variant SKAT-O analysis, is also a guanine nucleotide exchange factor for RAS proteins. In fact, this pathway was previously implicated by a large study drawing from over 30,000 cases and 50,000 controls in contributing to IBD etiology as part of growth factor signaling [28]. Because growth factor deficiencies have been found in patients with IBD, there has been substantial interest in their use as a potential therapeutic agent (reviewed in [29]). Other current targets of therapy that emerged in our analysis include interferon-gamma, a pro-inflammatory cytokine involved in intestinal homeostasis and linked to regulation of IL-23 [30], another cytokine associated not only with IBD but other inflammatory diseases. In our rare variant analysis, we found negative regulation of the JAK-STAT cascade, another important inflammatory pathway targeted by recent therapies [31], which underscores the importance of immune cell response to cytokine signaling in disease.

The primary limitation of this study is the lack of in-house controls for comparison to our cases. However, we performed stringent QC of our data to filter differences between data sets. We used the same processing pipeline for dbGaP as we used for our case data, and filtered to an ancestrally similar population. However, systematic calling differences between our pipeline and ExAC, such as calling or filtering of indels, could still be leading to inflation of p-values and odds ratios in our rare variant analysis.

We combined CD and UC to leverage the maximum sample size possible to gain further insight into the shared genetic architecture of IBD. However, CD-related variants were enriched in our results, likely because of our CD-majority cohort and the large effect size of associated loci including NOD2. We still found variants in HLA genes, which are most strongly linked to UC, in our results, but these sites did not reach genome-wide significance in our cohort. For example, HLA-A and HLA-C were among the top 200 most significant genes in our logistic case/dbGaP control regression, and were therefore used in ClueGO analysis.

While large genome-wide association studies have been performed in IBD, our study is the first to specifically investigate the contribution of rare, likely-damaging variants in pediatric-onset disease. Our findings provide further targets for exploring disease etiology—both at the gene and pathway level. Better understanding of the genetic architecture of IBD can hopefully improve disease prediction and treatment.

Subjects and methods

Ethical approval and recruitment of study participants

Subjects for WES were selected from patients enrolled in the Crohn’s and Colitis Foundation (CCFA) sponsored RISK cohort study and the NIH sponsored Emory African-American gene discovery study, for whom DNA had already been collected. RISK is the largest pediatric CD inception cohort in the world, with 1813 subjects younger than 18 years old with suspected IBD enrolled at 28 North American sites, including Emory University, from November 2008 to June 2012 (ClinicalTrials.gov Identifier: NCT00790543). All patients underwent baseline colonoscopy and histological confirmation of chronic active colitis/ileitis prior to diagnosis and treatment. Once standard and published guidelines were met, patients were diagnosed with CD, UC or IBD-undetermined (IBD-U). A consistent diagnosis of IBD was required during the one-year follow-up for inclusion into this study. At enrollment and during ongoing prospective follow-up, clinical and laboratory data were obtained for each enrolled patient and submitted to a centralized data management center. All patients were managed according to the dictates of their physicians, not by standardized protocols. The patient-based studies were approved by the Institutional Review Boards at each of the RISK sites. Consent was obtained from parents and adult subjects and assent from pediatric subjects age 11 and above.

Emory case sample collection, processing and exome sequencing

Genomic DNA was extracted from whole blood for a total of 567 pediatric IBD samples, of which 553 (97.5%) passed DNA QC. Library preparation and sequencing of the samples were performed at Broad Institute’s Genomics Platform, Cambridge, USA. The libraries were prepared according to the manufacturer's instructions using 1 μg of input DNA per sample. DNA was subjected to whole-exome capture with the SureSelect Human All Exon 50-Mb Kit (Agilent Technologies) following the standard protocols. Library validation was done with the KAPA Library Quantification Kit (KAPA Biosystems) and the whole-exome capture libraries were then sequenced on the Illumina HiSeq platform according to standard protocols.

Publicly available data sets

Database of genotypes and phenotypes (dbGaP) [16] data

We identified and downloaded control data from the Epi4K (accession phs000653.v2.p1) and ARRA (accession phs000298.v3.p2) studies. SRA files were converted to fastq format using NCBI’s SRA Toolkit [32].

ExAC (http://exac.broadinstitute.org/) [15, 17] data (version 0.3.1)

For this publicly available data set containing information on 60,706 individuals, we used liftOver to map all sites to hg38 for comparison with our data. We summed minor and total allele counts for the American, Finnish, and non-Finnish European groups and required a site to be typed in >90% of total chromosomes for these groups (at least 76,438 out of 84,930 chromosomes) for inclusion.

dbGaP (raw whole-exome sequencing) analysis

We mapped Emory and dbGaP exome sequencing fastq files to hg38 using PEMapper and called variants using PECaller [33]. We then used SeqAnt [34] version 2.0 [35] (Beta 3, https://seqant.genetics.emory.edu/) to get rsID numbers for plink and other annotation information for later analysis.

All following variant quality control (QC) was performed in PLINK 1.9 [36,37,38]. Starting with 866,411 variants in 1035 controls and 541 cases diagnosed with IBD before age 18, we filtered samples and variants using increasingly stringent completeness criteria until information for all remaining variants and samples was 99% complete. For each study individually (IBD, ARRA, Epi4k), we removed sites that were Bonferroni significant in a Hardy–Weinberg equilibrium test. We then performed a sex check of samples. Cases were removed if their sex was discordant with record review (N = 9); other mislabeled sexes were corrected. We checked sample relatedness and removed 8 controls and 10 cases who were second degree or more closely related to another study participant. Supplementary Table 1 shows characteristics for the 517 remaining IBD patients who passed this first round of quality control. We combined CD and UC patients because of shared genetic architecture of these diseases and relatively small sample size of either group alone.

To adjust for population stratification in our sample we used 10,913 common (minor allele frequency, a.k.a. MAF > 0.05) SNPs to calculate principal components (PCs) using EIGENSTRAT [39] and anchoring with HapMap controls as described by Anderson et al. [40] (Supplementary Fig. 1A). We removed outliers (those with values greater or less than 3 standard deviations away from the mean) for any of the top seven principal components (those which appeared meaningful with eigenvalues >2), recalculated principal components, and repeated outlier filtering with four meaningful PCs, leaving us with a final data set of 625 controls and 368 cases (Supplementary Fig. 1B; Table 1 shows basic characteristics for these participants). PCs were recalculated again without HapMap samples (Supplementary Fig. 1C) and the four principal components significant by Tracy-Widom tests were used as covariates in regressions.

As an additional filter, we removed variants that were most significantly different (top 2.5%) in Fisher’s exact tests comparing our dbGaP controls to ExAC.

Common variant analysis

We performed logistic regression for sites with MAF > 0.05 in plink with case/control status as outcome, genotype as predictor of interest, and sex and PCs as covariates. p-Values were corrected with genomic control.

SKAT-O analysis

We used the SKAT-O method within the SKAT package [41] in R [42] to analyze genes annotated to sites with MAF < 0.05 and evidence of pathogenicity with CADD score >10. SKAT-O is an approach that optimizes association tests by unifying burden and sequence kernel association approaches [43]. We tested for association of genes with case/control status for any gene with five or more rare variants. We also lifted over loci associated with IBD from Jostins et al. 2012 [9] and Liu et al. 2015 [10] to hg38, yielding 201 loci, and tested for enrichment of rare variants 250 kb upstream or downstream of CD, UC, or IBD loci as groups (Supplementary Table 2).

We also wanted to test whether variants were enriched in neutrophil function genes because strong ileal activation of the immune response including a strong signature for blood CD11b+Ly6-G+neutrophils (GSM854306, p < 6.5E-50) was found using clinical and RNA-Seq data from the CCFA RISK prospective cohort [44]. We next used the GSM854306 from immgen atlas (GSE15907) to retrieve all 409 blood CD11b+Ly6-G+neutrophil genes and combined this with a manually curated, literature-based list of 74 human neutrophil-related genes, including those known to cause CGD and GSD1b. We implemented these two gene lists in ToppCluster [45], cross-validating their association with neutrophil-related genes and pathways based on other annotations of critical neutrophil functions including priming, chemotaxis, adhesion, phagocytosis, oxidative burst, degranulation, microbial killing, and survival (GO, Mouse phenotypes, Diseases). Using this filtering we were able to decrease the original total of 463 neutrophil genes to 144 genes that are associated with CD and known to regulate key neutrophil functions (Supplementary Table 3).

ExAC (aggregate allele count) analysis

Rare variant analysis

Using the same set of variants as in the dbGaP analysis (with sites most significantly different between dbGaP and ExAC filtered out), we used Fisher’s exact tests to compare rare variant sites (MAF < 0.05) between our IBD cases and ExAC. Genomic control was used to correct p-values.

Pathway enrichment analysis

To test for pathway enrichment, we used the ClueGO plugin version 2.3.3 for Cytoscape version 3.4.0. We performed right-sided hypergeometric tests for enrichment of level 3 to 8 biological process GO terms (using the Human GO database from 25 January 2017) with Benjamini–Hochberg p-value correction for multiple tests. GO Term Fusion was used to reduce pathway redundancy. For common and rare variants, the top 200 most significant genes were used to interrogate pathway enrichment in our sample. This threshold was picked so that ClueGO input did not have duplicate genes and was consistent across common and rare variant comparisons. All genes in the common variant analysis had p-values ≤0.01, while those in the rare variant analysis had p ≤ 0.002.

We also used ToppFun, from the ToppGene Suite of bioinformatics tools, to perform functional enrichment analysis. While we only used biological process terms with ClueGO, ToppFun pulls annotation information from GO, human and mouse phenotype data, gene expression, protein interaction and pathway databases [46].

Data availability

Raw sequencing data for individuals with inflammatory bowel disease included in this study are publicly available on dbGaP. Study accession: phs001076.v1.p1, URL: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001076.v1.p1.