Mutation spectrum of NOD2 reveals recessive inheritance as a main driver of Early Onset Crohn’s Disease

Inflammatory bowel disease (IBD), clinically defined as Crohn’s disease (CD), ulcerative colitis (UC), or IBD-unclassified, results in chronic inflammation of the gastrointestinal tract in genetically susceptible hosts. Pediatric onset IBD represents ≥ 25% of all IBD diagnoses and often presents with intestinal stricturing, perianal disease, and failed response to conventional treatments. NOD2 was the first and is the most replicated locus associated with adult IBD, to date. However, its role in pediatric onset IBD is not well understood. We performed whole-exome sequencing on a cohort of 1,183 patients with pediatric onset IBD (ages 0–18.5 years). We identified 92 probands with biallelic rare and low frequency NOD2 variants accounting for approximately 8% of our cohort, suggesting a Mendelian inheritance pattern of disease. Additionally, we investigated the contribution of recessive inheritance of NOD2 alleles in adult IBD patients from a large clinical population cohort. We found that recessive inheritance of NOD2 variants explains ~ 7% of cases in this adult IBD cohort, including ~ 10% of CD cases, confirming the observations from our pediatric IBD cohort. Exploration of EHR data showed that several of these adult IBD patients obtained their initial IBD diagnosis before 18 years of age, consistent with early onset disease. While it has been previously reported that carriers of more than one NOD2 risk alleles have increased susceptibility to Crohn’s Disease (CD), our data formally demonstrate that recessive inheritance of NOD2 alleles is a mechanistic driver of early onset IBD, specifically CD, likely due to loss of NOD2 protein function. Collectively, our findings show that recessive inheritance of rare and low frequency deleterious NOD2 variants account for 7–10% of CD cases and implicate NOD2 as a Mendelian disease gene for early onset Crohn’s Disease.

With the assumption that genetic risk has a disproportionate effect over environmental risk in early onset disease, recent studies have focused on pediatric IBD cases (diagnosed < 18y) [38][39][40] . Pediatric IBD patients comprise 20-25% of all IBD cases and are typically more clinically severe than adult-onset patients, often exhibiting disease of the upper GI tract, small bowel inflammation, and perianal disease as well as failure to thrive and poor clinical response 4,40 . Results from GWAS conducted in this group of severely affected patients indicate that associated loci in early onset IBD significantly overlap with adult IBD loci, including both the NOD2 locus and an additional 28 CD-specific loci previously implicated in adult-onset IBD [41][42][43] . As the mechanism for these "common" IBD susceptibility loci in the pathogenesis of early onset IBD remains unclear 44 , we performed whole-exome sequencing and rare variant analysis on a cohort of 1,183 pediatric onset IBD patients to elucidate the role of rare protein coding variation in IBD-associated genes, specifically NOD2, in this disease.

Subjects and methods
Samples. We obtained informed consent for all individuals included in this study or parental informed consent was obtained for minors under 18 years of age. For pediatric IBD, we studied a cohort of 1,183 probands with pediatric onset IBD (ages 0-18.5 years), including their affected and unaffected parents and siblings, where available (total samples = 2,704). Individuals were consented for genetic studies under an IRB-approved protocol by the Toronto Hospital for Sick Children, Canada as part of the NEOPICS initiative (https ://www.neopi cs.org/).
DiscovEHR participants are a subset of the Geisinger MyCode Community Health Initiative. The MyCode Community Health Initiative is a repository of blood, serum, and DNA samples from Geisinger patients that have been consented to participate in research and donate samples for broad research use, including genomic analyses that can be linked to de-identified electronic health record (EHR) information. DiscovEHR participants were consented in accordance with the Geisinger Institutional Review Board approved protocol, Study number 2006-0258.
Helsinki guidelines. All human experiments followed relevant guidelines and regulations according to the Declaration of Helsinki.
Exome sequencing. Sample preparation, whole exome sequencing, and sequence data production for both the pediatric IBD cohort and the DiscovEHR cohort were performed at the Regeneron Genetics Center (RGC) as previously described 45 . In brief, 1ug of high-quality genomic DNA was used for exome capture utilizing the NimbleGen VCRome 2.1 design. Captured libraries were sequenced on the Illumina HiSeq 2500 platform with v4 chemistry using paired-end 75 bp reads. Exome sequencing was performed such that > 85% of the bases were covered at 20 × or greater. Raw sequence reads were mapped and aligned to the GRCh37/hg19 human genome reference assembly, and called variants were annotated and analyzed using an RGC implemented cloud-based pipeline. Briefly, variants were filtered based on their observed minor allele frequencies at a < 2% cutoff using the internal RGC database and other public population control databases to filter out common polymorphisms and high frequency, likely benign variants in consideration of disease prevalence.

DiscovEHR statistical analyses.
For NOD2 locus-specific statistical analyses in the DiscovEHR cohort, individuals with ICD diagnoses in their EHR consistent with IBD and carriers of NOD2 variants were annotated and filtered using the same pipeline as for the pediatric IBD cases. Odds ratios for all genetic models (additive, recessive and genotypic) were calculated using Fisher's exact test with no covariates.
For large-scale association analyses, variants were annotated with snpEff using Ensembl 85 gene definitions 46 . Gene definitions were restricted to transcripts with annotated start and stop codons, totaling 19,467 proteincoding genes. Predicted loss-of-function (pLoF) variants were defined as any of the following: variants leading to a premature stop codon, loss of a start codon, or loss of a stop codon; single-nucleotide variants or indels disrupting canonical splice donor or acceptor sites; and frame-shifting indels predicted to result in premature stop codons. Phasing of putative compound heterozygotes was performed as previously described for this cohort 47 using a combination of familial relationship based phasing 48 and population allele frequency based phasing with EAGLE 49 . Biallelic pLoF and predicted deleterious missense variants with a MAF < 5% in the discovery set of 58,138 European ancestry individuals were aggregated at a gene level. Variants were aggregated for gene burden tests in two ways as previously described 45,50 : pLoFs only and pLoFs plus missense variants (M3) predicted to be deleterious (pdNS) by five different bioinformatic prediction algorithms for functional effects, namely SIFT 51 , LRT 52 , MutationTaster 53 , PolyPhen2 HumDiv, and PolyPhen2 HumVar 54 . Genotypes were coded as follows: homozygous reference as 0, heterozygotes as 1, and homozygous alternative or compound heterozygous as 2. PLINK 1.9 55 was used to run Firth logistic regression under both additive and recessive models using the ICD10 K50 phenotype, which is the ICD10 diagnosis code for Crohn's disease [

Results
An initial analysis of the exome sequencing data for pathogenic and expected pathogenic variants in genes known to cause monogenic forms of IBD in all probands from our pediatric IBD cohort identified 40 rare variants in 31 probands 56 . Additionally, we performed trio-based analysis of 492 complete trios using a proband-based analytical pipeline to identify all recessive (compound heterozygous and homozygous), X-linked, and de novo variants of interest in the affected probands. In our initial analyses, we identified 10 families with recessive (compound heterozygous or homozygous), rare variants (2% ≤ MAF) in NOD2, all with a diagnosis of CD. We observed that some of the rare variants in these probands were inherited in trans from previously-reported CD risk alleles, mainly the p.G908R missense variant. We identified two individuals who are compound heterozygous for the p.G908R risk allele in trans with a less common NOD2 CD risk variant (p.N852S) in one case and a novel truncating indel (p.S506Vfs*73) in the second case (Supplementary Table 1, Fam008 and Fam009). The observation of a CD-associated NOD2 risk allele in trans from other rare or novel alleles led us to survey the rest of the probands, including singletons and those part of incomplete trios, for recessive inheritance, either in a homozygous or compound heterozygous manner, of NOD2 variants, but expanding our allelic range to lowfrequency variants (2% ≤ MAF ≤ 5%). Through this approach we identified 108 probands with putative recessive NOD2 variants. Visual inspection of sequence reads and orthogonal confirmation through Sanger sequencing excluded 13 probands with variants inherited in cis from an unaffected parent or heterozygous variants that were initially called as homozygous due to low coverage of the region and skewed allelic balance. Of note, we identified 5 probands carrying p.L1007fs and p.M863V risk variants, 4 of which were confirmed to occur in cis and were inherited from an unaffected parent. The remaining case with p.L1007fs and p.M863V was a singleton and thus phase could not be determined. These two variants segregate in cis within the same haplotype, as confirmed by segregation within the trios and as previously observed 57 . Therefore, we excluded these 4 probands from our final count of recessively-inherited NOD2 variants. Similarly, we identified 3 probands from 3 complete trios segregating the p.S431L and p.V793M reported risk variants in cis inherited from an unaffected carrier parent; these probands were also excluded. Three additional probands were excluded on the basis of a re-evaluation of the phenotype that excluded a clinical diagnosis of IBD. Thus, we identified 92 probands with confirmed recessive NOD2 variants within our pediatric onset IBD cohort, none of which had variants of interest in known monogenic IBD associated genes. These included: 25 probands carrying homozygous variants, 41 probands with confirmed compound heterozygous variants, and an additional 26 singleton probands with putative compound heterozygous variants where phasing could not be performed (Supplementary Table 1, Supplementary Fig. 1). The majority of the compound heterozygous individuals (65/67) carry a known NOD2 CD-risk allele in addition to either another known NOD2 CD-risk allele or a novel NOD2 variant, including some truncating loss-of-function variants supporting loss or impaired function of NOD2 in the pathophysiology of CD 6 . In total, 92 of 1,183 (7.8%) of the probands in our pediatric onset IBD cohort conformed to a recessive, Mendelian inheritance mode for NOD2 rare and low frequency (MAF ≤ 5%) deleterious variants (Table 1, Fig. 1, Supplementary Table 1, and Supplementary Table 4).
The 92 pediatric patients homozygous for NOD2 mutations were predominantly male (71%) with a median age at diagnosis of 12.5 years (Supplementary Table 1). At diagnosis, 83% displayed diagnostic features of Crohn's disease. 23% of the cohort displayed a constellation of extra-intestinal manifestations, mainly large joint arthritis, chronic recurrent multifocal osteomyelitis, recurrent fevers, erythema nodosum, and pyoderma gangrenosum. Only 6% of the cohort showed significant perianal disease (namely fistulae and abscesses; skin tags and fissures were not considered as perianal disease) (Supplementary Table 1). Per the Montreal classification of IBD 58 , 44% of the overall cohort of patients presented with ileal disease at diagnosis (L1). 25% presented with ileocolonic disease (L3) and 10% displayed features of colonic inflammation only (L2). Isolated upper disease was only Table 1. Mutation spectrum of recessive NOD2 variants in an EO-IBD cohort. Common NOD2 variants refer to the three main low-frequency Crohn's Disease risk variants p.R702W, p.G908R, and p.L1007fs; Rare NOD2 variants refer to other low-frequency variants (MAF ≤ 5%). Q, quartet; T, trio; D, duo; S, singleton; Dx, diagnosis.  Table 1). Given the substantial contribution of recessive NOD2 variants to CD in our pediatric onset IBD cohort and the known contribution of NOD2 to adult CD, we next investigated the contribution of NOD2 recessivity in a large clinical population. For this, we examined a cohort of adult IBD patients from the Geisinger-Regeneron DiscovEHR collaboration 45 . A key feature of the DiscovEHR study is the ability to link genomic sequence data to de-identified electronic health records (EHRs). Within this cohort, we identified 984 patients (of 51,289 total sequenced DiscovEHR patient-participants) with a diagnosis of IBD, defined as having a problem list entry or an encounter diagnosis entered for two separate clinical encounters on separate calendar days for the ICD-9 codes 555* (Regional enteritis) or 556* (Ulcerative enterocolitis) or ICD-10 K50* (Crohn's disease [regional enteritis]) or K51* (Ulcerative colitis). For our analysis, we surveyed all instances of homozygous NOD2 rare and low frequency variants (MAF ≤ 5%); the same parameters applied to our pediatric IBD probands. Among patients with an IBD diagnosis, we identified 18 individuals who are either homozygous for the p.R702W risk allele (N = 10) or homozygous for the p.L1007fs allele (N = 8) ( Table 2, Supplementary Fig. 2). We did not identify any p.G908R homozygous individuals with an IBD diagnosis in this cohort. Next, we looked for instances of putative compound heterozygosity among these adult IBD DiscovEHR patients. First, we searched for occurrences of two or more of the three most prevalent NOD2 risk alleles (p.R702W, p.G908R, or p.L1007fs) in these individuals. We identified putative compound heterozygosity for the three main CD risk alleles, p.R702W/p.G908R (N = 6), p.G908R/p.L1007fs (N = 5), and p.R702W/p.L1007fs (N = 11) ( Table 2). We also observed instances of putative compound heterozygosity for each of the three main CD risk alleles along with either a rarer CD risk allele or a novel allele or two rare alleles in trans (N = 24), parallel to the findings in our pediatric IBD cohort.   Fig. 2). The other 32 were singleton cases where phase could not be confirmed. Overall, we identified 64 homozygous or putative compound heterozygous NOD2 variant carriers in the DiscovEHR IBD cohort, accounting for 6.5% of patients with an IBD diagnosis in this clinical population (Fig. 1, Supplementary Table 4). We were also able to evaluate longitudinal de-identified medical records for all patients within the DiscovEHR IBD cohort. According to their EHR data, 21 patients received diagnoses of both UC and CD. To clarify these diagnoses, we performed manual evaluation of EHR information (which includes demographics, encounter and problem list diagnosis codes, procedure codes, and medications) for all 64 homozygous or compound heterozygous NOD2 patients with an IBD diagnosis. Through this review, 6 homozygotes exhibited a conflicting diagnosis of CD, of which 5 were resolved as CD and 1 could not be defined; 16 compound heterozygotes exhibited a conflicting diagnosis of CD of which 6 were resolved as CD and 10 were resolved as UC (Supplementary Table 3). In total, we found that 17/18 (94.4%) of homozygous NOD2 individuals and 33/46 (71.7%) compound heterozygous had a diagnosis of CD and that 9.9% of all CD cases in this cohort could be attributed to homozygous or compound heterozygous variants in NOD2. We next investigated age of disease onset using the first recorded date of an IBD diagnosis in the EHR. We identified 6 carriers of recessive NOD2 variants (9.4% of our recessive NOD2 patients with IBD) who were diagnosed with IBD prior to 18 years of age. We also identified additional 11 carriers of recessive NOD2 variants diagnosed with IBD prior to age 30 years, which is at or below the average age of IBD diagnosis 59 and is consistent with earlier disease onset (Supplementary Table 3). Of note, our DiscovEHR cohort data extends to a median of 14 years (and maximum of 25 years) of electronically recorded medical information, concurrent with the adoption of the EHR by the Geisinger Health System. Since 72.4% of our cohort is currently over the age of 50 years, we cannot determine whether the age of onset for IBD occurred prior to the first electronically recorded date of an IBD diagnosis for many recessive NOD2 patients; thus it is possible that other individuals with homozygous or compound heterozygous variants in NOD2 might have had pediatric-onset disease that was not captured in the EHR.

NOD2 variant # EO-IBD probands Mean age (range) % CD Dx
Incidentally, our manual evaluation of the EHR data for these individuals also revealed that 75% of the IBD patients had a diagnosis record of anemia in their history. In about 58% of these cases the anemia diagnosis was given concurrent or before the first recorded diagnosis of IBD, with an average of 2.26 years prior. This observation is consistent with previous reports of anemia as an important yet underappreciated and undertreated comorbidity in IBD 60,61 , but also suggests that anemia may be an early indicator of IBD onset. Interestingly, 16 of 48 individuals homozygous for the p.L1007fs variant that do not have a diagnosis of IBD and for which we were able to review their EHR information had a diagnosis of anemia in their chart and 11 of them had diagnosis codes related to gastrointestinal complaints. To further assess whether NOD2 genotype status associated with other phenotypes, we performed a PheWAS analysis using all ICD codes recorded in the EHR of NOD2 homozygous and compound heterozygous individuals. This analysis showed that NOD2 recessivity significantly and specifically associates with Crohn's disease (Fig. 2).
Next, given the recessive inheritance of NOD2 variants observed in both our pediatric onset and adult IBD cohorts, we estimated the disease risk for the three main known CD risk alleles (p.R702W, p.G908R, and   Fig. 2). Additionally, we calculated the relative risk for the identified putative compound heterozygous (pCHET) individuals under a recessive model. We observed that the effect size for the compound heterozygotes was also significant (OR = 4.35 [2.80-6.75 95% CI], P-value = 8.14 × 10 -13 ), consistent with our previous observations (Table 3, Fig. 2). The calculated combined contribution of the 3 CD risk alleles under the different genetic models was as follows: additive  Figure 3. Graphical representation of Odds Ratio (OR) point estimates and 95% confidence intervals (CI) for the three main CD risk alleles (p.R702W, p.G908R, p.L1007fs) under additive, genotypic, and recessive genetic models (corresponding to values in Table 3). The dotted line in the Composite panel depicts the calculated CI with corresponding calculated OR for 2 alleles under an additive genetic model; of note the point estimate (2xOR) is outside of the 95% CI for the Composite genotypic homozygous and recessive models. Diamonds correspond to estimated OR values for these same variants in the IBD Exomes Browser 49 ; no confidence intervals are provided.  Fig. 3). Subsequently, we combined all heterozygous, homozygous, and phased compound heterozygous predicted loss-of-function (pLoF) and predicted deleterious missense variants in NOD2 with a MAF ≤ 5% including the 3 risk alleles to calculate the CD risk using a burden test under additive and recessive models. The pLoF only burden analysis was significant under both the additive (P-value 5.5X10 -20 ) and recessive (P-value = 2.67 × 10 -19 ) models, however the risk was much higher under the recessive model (OR = 20.74 [10.70 -40.20 (Table 4). Collectively, these analyses show substantially larger effects for NOD2 homozygotes and compound heterozygotes than heterozygotes only and indicate that the genetic contribution of NOD2 alleles, in a subset of Crohn's disease patients, is consistent with a recessive disease model.

Discussion
We use the term inflammatory bowel disease (IBD) throughout to encompass diagnoses of both Ulcerative Colitis and Crohn's disease in the DiscovEHR cohort, which is similar to the referral diagnosis of the pediatric patients where some had diagnoses of ulcerative colitis, Crohn's disease, or IBD unspecified (Table S1). Furthermore, prior to the release of ICD-10 codes, there was no specific diagnosis code for Crohn's disease, as it was coded as 'regional enteritis' (ICD-9 555), lending itself to confusion and misdiagnoses. The DiscovEHR IBD cohort is not intended to be a 'pure' Crohn's disease cohort but rather a representative sample of the adult population that is diagnosed with IBD. Both, the pediatric and adult cohorts reflect the clinical heterogeneity of patients diagnosed with IBD and the challenges of the clinical and molecular diagnosis of this disease.
Our observations are in line with previous analyses and meta-analyses of CD cohorts where individuals carrying any one of the main three CD associated risk alleles (p.R702W, p.G908R, or p.L1007fs) have 2-fourfold increased risk for developing CD 63 , whereas carriers of two or more of the same NOD2 variants have a 15-40 fold increased risk for developing CD 33,64,65 , exhibiting disease of the terminal ileum 34 , and earlier diagnosis (by an average of 3 years) 33 . Our observations support these studies but highlight a subset of IBD cases molecularly defined by recessive inheritance of NOD2 alleles that exhibit markedly increased risk for CD with significantly earlier age of onset (mean age of onset among recessive NOD2 carriers in the DiscovEHR IBD cohort: 43.4y; mean age of onset in the DiscovEHR IBD cohort: 51.5y; P-value: 4.0X10 -4 by unpaired t test).
Further, while we observe a low effect size for single allele carriers, based on our allelic effect size calculations for each of the 3 main CD risk alleles in our DiscovEHR cohort (Table 3, Fig. 3), we hypothesize that homozygous and compound heterozygous NOD2 individuals included in large IBD GWAS cohorts have likely contributed to a large proportion of the relative risk calculations for IBD, specifically for CD, under additive models, and that homozygous effect sizes have been largely underappreciated or underreported. It is possible that stratification or conditional statistical analysis of these large and heterogeneous cohorts based on NOD2 genotypes may increase power to detect other loci that contribute to IBD.
While our observations strongly support recessive inheritance of NOD2 variants as a driver of early onset Crohn's disease, we observed incomplete penetrance, as evidenced by homozygous or compound heterozygous NOD2 variant carriers that do not have a clinical presentation of IBD [65][66][67] . Penetrance and expressivity are two major genetic concepts that play into the onset of the phenotype and the clinical presentation of monogenic diseases 68 . In the case of IBD, penetrance is known to be incomplete and clinical presentation is extremely variable. Further, the contribution of additional environmental triggers that may enhance disease onset and/ or severity in an already genetically-compromised individual should not be underestimated, especially considering that the loss of epithelial barrier function occurring during IBD allows for host exposure to up to 10 14 gut microbiota 69,70 . Even in cases of monogenic IBD, such as IL-10 receptor deficiency [71][72][73] , intestinal flora are required for disease presentation in murine disease models [74][75][76] . Furthermore, variation in genes involved in NOD2-dependent signaling pathways, including XIAP [77][78][79] and TRIM22 80 , result in Mendelian forms of IBD. For XIAP, and most likely TRIM22, viral triggers are required for disease onset and progression, and XIAP mutations have variable penetrance, with only a small percentage of XIAP-deficiency patients developing CD (age of onset between 3 months and 40 years 64 ). As NOD2-deficient hosts are more susceptible to the pathogenic effects of a changing intestinal microenvironment 81 , the contribution of either discrete or continuous gene-environment exposures may further explain heterogeneity in onset and presentation of disease for genetically-sensitized recessive NOD2 carriers. www.nature.com/scientificreports/ Given the wide variability in clinical presentation of IBD, we cannot exclude the possibility that recessive NOD2 carriers exhibit subclinical phenotypes not formally diagnosed as IBD or that they may eventually develop IBD. It is additionally possible that recessive NOD2 carriers in the DiscovEHR cohort have a diagnosis of IBD that has not been captured in the EHR. Detailed investigation into the medical histories of recessive NOD2 carriers may shed light on this variable expressivity or incomplete capture of medical information. We also cannot exclude the possibility that recessive NOD2 carriers possess additional genes or alleles that either contribute to disease onset and severity or, alternatively, provide protection or reduced expressivity of the phenotype. Identification of these genetic modifiers warrants future investigation both to unveil additional IBD-risk associated loci for early onset UC and CD cases and to identify protective genes and alleles that can be used to derive therapeutic avenues for IBD treatment and management.
In summary, in a cohort of 1,183 pediatric and early onset IBD patients, we report recessive inheritance of rare and low frequency variants in NOD2 accounting for about 8% of probands. We assessed the contribution of NOD2 recessive inheritance in a broader, heterogeneous cohort of adult IBD patients, similar to those recruited for GWAS, and found that recessive inheritance of variants in NOD2 account for 6.5% of these IBD patients, including 9.9% of CD cases. Thus, recessive inheritance of rare and low frequency NOD2 variants explain a substantial proportion of CD cases in a pediatric cohort and a large clinical population, with significantly earlier age of disease onset. Consistently, both pediatric and adult CD exhibit a broad spectrum of clinical presentation, suggesting a shared etiology across age groups, at least in the subgroup defined by recessive NOD2-driven CD. Our findings indicate that deleterious NOD2 variants should be considered as strong predictors of IBD-CD onset and implicate NOD2 as a Mendelian disease gene for early onset IBD, specifically for a molecularly defined subset of Crohn's disease patients.