Exome Analysis of Rare and Common Variants within the NOD Signaling Pathway

Pediatric inflammatory bowel disease (pIBD) is a chronic heterogeneous disorder. This study looks at the burden of common and rare coding mutations within 41 genes comprising the NOD signaling pathway in pIBD patients. 136 pIBD and 106 control samples underwent whole-exome sequencing. We compared the burden of common, rare and private mutation between these two groups using the SKAT-O test. An independent replication cohort of 33 cases and 111 controls was used to validate significant findings. We observed variation in 40 of 41 genes comprising the NOD signaling pathway. Four genes were significantly associated with disease in the discovery cohort (BIRC2 p = 0.004, NFKB1 p =  0.005, NOD2 p = 0.029 and SUGT1 p = 0.047). Statistical significance was replicated for BIRC2 (p = 0.041) and NOD2 (p = 0.045) in an independent validation cohort. A gene based test on the combined discovery and replication cohort confirmed association for BIRC2 (p = 0.030). We successfully applied burden of mutation testing that jointly assesses common and rare variants, identifying two previously implicated genes (NFKB1 and NOD2) and confirmed a possible role in disease risk in a previously unreported gene (BIRC2). The identification of this novel gene provides a wider role for the inhibitor of apoptosis gene family in IBD pathogenesis.

. Proteins acting within the NOD signaling pathway. The recognistion of NOD1 and NOD2 of the bacterial peptidoglycan (PGN) promotes the formation of the multi-protein complex named inflammasome. The complex recruits the kinase receptor interacting protein 2 (RIP2), which is ubiquitinated by the ubiquitin ligases XIAP, BIRC2 and BIRC3 proteins. The polyubiquitinated RIP2 recruits the kinase TAK1 and TAK binding protein 1 (TAB1), TAB2 and TAB3 which leads to the activation of the MAPK kinases, p38 and c-Jun N-terminal kinase (JNK) through the activation of mitogen-activated protein kinase kinase (MKK). RIP2 polyubiquitinated also interacts with the IKK complex (IKKα , IKKβ and NEMO). The IKK complex mediates the phosphorylation of the IKKβ subunit of IKK by TAK1 and results in the phosphorylation and degradation of the NF-κ B inhibitor (Iκ Bα ) which results in the cytoplasmic release and translocation of NF-κ B dimers p65 and p50 in the nucleus to activate of the expression of the NF-κ B proinflammatory genes.
Scientific RepoRts | 7:46454 | DOI: 10.1038/srep46454  BIRC2 is a gene composed of 9 exons (black rectangles). BIRC2 encodes an inhibitor of apoptosis protein, which contributes to innate immune responses by acting as inhibitor of cell death. All the members of the inhibitor of apoptosis (IAP) gene family share three tandem specific motifs: BIR belonging to the zinc-finger domain that mediates protein-protein interaction, a CARD domain is involved in CARD-CARD mediated interaction; and a C-terminal RING domain conferring an E3-ubiquitin ligase activity. The RING domain of BIRC2, BIRC3, and XIAP is required for the ubiquitin activity of the IAPs. Studies have reported that the CARD domain in BIRC2 and BIRC3 act as an inhibitor of the ubiquitin ligase activity. Mutations within the BIR1 domain in BIRC2 alters molecular interaction with TNF receptor associated factor 1 (TRAF1) and TRAF2. The six mutations found by interrogating exome data from 136 pIBD and 106 controls are depicted using arrows and the corresponding protein change shown. have been prioritized within the 200 loci determined through adult studies and only less than half have been replicated in a small number of pediatric studies 15,16 .
To date, 51 genes have been associated with monogenic disease manifesting in an early onset IBD-like phenotype 17,18 . Homozygous mutations in the interleukin 10 receptor (IL10) gene and its associated receptor alpha and beta subunits (IL10RA and IL10RB) have been associated with children presenting very-early-onset IBD (VEO, age of onset < 6 years) 19,20 . The discovery of the disease causal mutations helped to personalize treatments inducing a sustained remission in the patients 19,20 . The investigation of a child with intractable IBD using whole-exome analysis by Worthey et al. 21 found a hemizygous mutation in the gene X-linked inhibitor of apoptosis (XIAP). The same mutation was confirmed in the asymptomatic mother. Based on these findings, this patient underwent hematopoietic stem cell progenitor transplantation with a resolution of symptoms and sustained remission following this targeted treatment approach 21 .
XIAP belongs to the in inhibitor of apoptosis protein (IAP) family (comprising XIAP, BIRC2 and BIRC3) that plays a role in regulation of the innate immune response through ubiquitin ligase activity, TNF survival, inflammatory and death signaling pathways 22,23 . IAP proteins mediate the downstream signaling of pattern-recognition receptors such as NOD1 and NOD2 after response to bacterial pathogens 24 .
The NOD signaling pathway, Fig. 1, is involved in gram-negative and gram-positive peptidoglycan recognition. NOD1 and NOD2 proteins are highly conserved cytoplasmatic receptors that sense microbial effectors. Activation of NOD receptors leads to downstream activation of multiple molecules including mitogen-activated protein kinases (MAPK) and nuclear factor kappa-light-chain-enhancer of activated B cells (NF-κ B) 25 .
In this study, we hypothesize that rare and private genetic variation across genes involved in the NOD signaling pathway may contribute to childhood onset IBD. We interrogate WES data to extract all genetic variation across the frequency spectrum in a pIBD cohort and evaluate the joint effect of rare and common variants with a gene-based statistical test (SKAT-O 26 ). We further validate our findings in an independent cohort.

Results
PCA procedure removed 10 cases and 20 controls reducing the final number of cases to 136 and controls to 106 within the discovery cohort (Supplementary Figure 1). The analysis revealed no outliers in the replication cohort. (Supplementary Figures 2 and 3).
Mutations were identified in either cases and/or controls in all but one gene (CCL5) from the NOD signaling pathway in the discovery cohort (n cases = 136 and n controls = 106). A total of 250 variants (Supplementary Table 2) that occurred in at least one individual (either case or control) across 41 genes were called in order to extract and create the VCF file for all 242 individuals. We observed 67 novel variants, 94 rare variants with a MAF 1000 genome project (1 KG) < 1%, 41 low frequency mutations (1% ≤ MAF 1 KG ≥ 5%) and 48 common mutations (MAF 1 KG > 5%). The majority of these variants would not have been detected or interrogated using array technology or traditional association studies.
Variants within the NOD2 gene. Across the 126 pIBD cases and 85 controls within the discovery cohort, we observed 31 mutations over 12 exons of the NOD2 gene. Of these, 26 had a MAF < 0.05 across the cohort (Table 1). Eight mutations were identified in or proximal to the caspase recruitment (CARD) domain, 16 in the nucleotide-binding oligomerization (NBD) domain and seven in the leucine-rich (LRR) domain (Fig. 2). In addition to the known IBD biomarkers, Arg702Trp, Gly908Arg and Leu1007fsinsC 3,27 , we observed two novel variants, 20 rare (MAF 1 KG < 0.01), two low frequency (0.01 ≤ MAF 1 KG ≤ 0.05) and four common mutations (MAF 1 KG > 0.05) ( Table 1). Ten of the 26 mutations were annotated as deleterious by SIFT and 13 are described in HGMD as pathogenic 28 . Twenty six (out of 31) mutations observed would not have been assessed in any GWAS due to their rarity.
Gene based burden of mutation testing in the discovery cohort. The gene-based test for assessing the combined association of novel, rare and common mutation with disease status showed significant evidence for association with four genes across the discovery cohort (BIRC2, NFKB1, NOD2, and SUGT1 see Table 2). NFKB1 (p = 0.005) and NOD2 (p = 0.029) are known IBD associated genes. SUGT1 is a previously unreported gene but has borderline significance only (p = 0.047). Combined variation in BIRC2 is more significantly associated (p = 0.004) with IBD in our discovery cohort than any other genes. This gene has not been previously implicated by association studies.

Replication of the gene based burden of mutation test in the validation cohort.
We aim to conduct a replication analysis of the four gene identified as significant in the discovery phase using a replication   cohort (n cases = 33; n controls = 111). A total of 13 variants were identified across the regions sequenced in the NFKB1, BIRC2, NOD2 gene. No variant was observed in SUGT1 in the replication cohort and therefore SKAT-O test was not conducted on this gene. SKAT-O test showed independent statistical association for BIRC2 (p = 0.041) and NOD2 (p = 0.045) but was not powered to detect significant association for NFKB1 (p = 0.223), Table 3. The gene based test on the combined discovery and replication cohort (n cases = 169 and n controls = 217) showed statistical association for NOD2 (p = 0.011), NFKB1 (p = 0.017) and BIRC2 (p = 0.030), Table 3.

Discussion
Since 2005 next generation sequencing (NGS) has proven to be an effective technology for the study of rare and low frequency mutations within disease-associated genes 29 . More than 100 types of Mendelian disorders have been studied using WES with a diagnostic rate of success of 25-30% 30 . This success represents a substantially higher rate than that afforded by classical clinical genetic testing such as karyotyping (< 5%) or array comparative genomic hybridization (~15-20%) 30 The combination of traditional genetic testing and WES/WGS technology has rapidly accelerated the discovery of new disease-associated genes underlying Mendelian traits: from an average of 166 per year between 2005 30 and 2009 to 236 per year between 2010 and 2014 30 , with the numbers increasing every year. WES/WGS has made gene discovery for all phenotypes feasible and cost effective 30 . The rapid growth and success of the next generation sequencing technologies in Mendelian traits has brought a great interest in their application to complex traits. WES and WGS have enable diagnosis and alternative treatment in patients with monogenic IBD 18 . In our study we applied WES and the SKAT-O statistical test on a discovery cohort of 242 individuals. We conducted the analysis with no assumption with regard to IBD diagnosis (CD, UC or IBDU) because in half of the families recruited in the study we observed mixed diagnoses reflecting the substantial genetic overlap between IBD subtypes. Although our data were derived from whole exome sequencing, we did not conduct SKAT-O on all gene across the exome due to our modest sample size. Instead, we targeted our analysis to all 41 genes across the NOD signaling pathway removing the requirement of an exome-wide significance threshold 31 . We chose to select the most significantly associated genes and to replicate their significance in an independent replication cohort. A limitation of the replication analysis was the use of data gleaned from different sources. Although an established method to take into account such differences is not yet available 32,33 , we minimized bias by analyzing only variants that occurred in the regions common to all capture kits.
Despite a modest cohort size, we detected significant association in four genes and replicated significant association for two genes (NOD2 and BIRC2).
NOD2 is the earliest gene implicated in IBD pathogenesis and the most strongly associated in association studies with IBD 34 . Polymorphisms within NOD2 are known to increase the risk of developing CD. 35 NOD2 patient carriers of one of the three allelic biomarker variants have an increased risk of developing CD: heterozygous carriers have a 2-4-fold increased risk of CD, while homozygous or compound heterozygous carriers have a 20-40-fold increased risk 34 .The association for NOD2 was solely driven by the three known biomarkers (Table S3). BIRC2 (Fig. 3) belongs to a gene family (XIAP, BIRC2 (also known as cIAP1) and BIRC3 (also known as cIAP2)) encoding three conserved proteins characterized by the presence of 1-3 baculovirus IAP repeat (BIR) motifs 36 . XIAP is located on the X chromosome while BIRC2 and BIRC3 are both located on chromosome 11. Several studies have demonstrated the importance of these genes in regulating the expression of proinflammatory cytokines, such as TNFα , through NF-kB and MAPK pathways primarily through their ubiquitin-ligase activity. XIAP, BIRC2 and BIRC3 are key players in regulating the NOD1 and NOD2 signaling pathway by directly promoting RIPK2 ubiquitylation and they facilitate activation of NF-kB pathway to promote cell survival 37 . Cellular studies on BIRC2, BIRC3 and XIAP deficient macrophages were defective for MAPKs and NF-kB activation 23,38 . This defect in the NOD signaling was also further observed in vivo in BIRC2, BIRC3 and XIAP knockout murine IBD models 38 . BIRC2 and BIRC3 are inhibitors of the Fas signaling cascade in human intestinal cell line 23 . The expression profile of BIRC3 was further investigated in 14 UC patients indicating an overexpression in colonic specimens during disease flares 39 . Additional studies on the interleukin (IL)-11 expression suggested a possible protective role of IAP, indicating that an over-expression of the IAP proteins could promote healing of the gut 40 . It is therefore feasible that mutations within these genes might impact gut healing and contribute to flares in IBD. Six variants within BIRC2 were observed in the discovery cohort across 15 cases and 4 controls. Three of these were novel (p.112_113del, p.S154A and p.G517E), two were rare (p.K516E and p.S318S) and one was low frequency (p.A506V,). Across the 15 cases (four with CD, four with IBDU and seven with UC), four were diagnosed aged < 6 years, seven had a positive family history for IBD and nine were diagnosed with a second autoimmune condition other than IBD. While our observed enrichment of variation within BIRC2 directly implicates this gene in pediatric IBD, further functional analyses are necessary for a comprehensive understanding of the role of individual variants in this protein and their wider impact on the signaling pathway. While mutations in XIAP are known to cause up to 4% of male early onset IBD, it is has been postulated that BIRC2 and BIRC3 might contribute to IBD pathogenesis by regulating the inflammatory cascade through their ubiquitin-ligase activity, our findings are the first to directly implicate this genes in pIBD 41 .
Novel drugs that mimic the natural endogenous inhibitor of the IAP (the mitochondria-derived activator of caspases, SMAC) have been proposed to suppress the pro-inflammatory immune response in the gastro-intestinal tract for patient with moderate to severe disease activity 42 . It is possible that patients harboring BIRC2 mutations may benefit from new treatments targeting IAP expression and function. Further studies are required to assess the role of targeted therapy in the clinical management of these patients. Conclusions A gene based burden of mutation test for association using sequencing data on a small cohort have supported the involvement of NFKB1 and NOD2 in the pathogenesis of IBD and have confirmed a role for BIRC2 in the pathogenesis of disease. This is the first study highlighting the role of BIRC2 in IBD through targeted exome sequencing.

Ethics statement. This study was approved by the Southampton and South West Hampshire Research
Ethics Committee (REC) (09/H0504/125) and University Hospital Southampton Foundation Trust Research & Development (RHM CHI0497).
This study was approved by the Institutional Review Board (IRB) at The Children's Mercy Hospital (IRB #15050179).
All methods were conducted in accordance with the relevant guidelines and regulations. Written informed consent was obtained for every participant.
Cases and samples. For the discovery cohort, patients were recruited through pediatric gastroenterology clinics at University Hospital Southampton (UHS), a regional center providing tertiary pediatric gastroenterology and endoscopy service for the Wessex region in Southern England. Written informed consent was provided by an attending parent or legal guardian for all pediatric recruits. All children aged < 18 at the point of diagnosis were eligible for recruitment to the study. The mean age of the cohort was 10.97 years (min 1-max 17 years). Diagnosis was established using the Porto criteria. Clinical data were recorded for each patient including family history of IBD and any history of autoimmune disease. We accessed control samples through our local database of germline exome sequence data for 126 unrelated patients with no inflammatory-related disease.
We used an independent replication cohort derived from the Children's Mercy Kansas City IBD cohort and the Critical Assessment of Genome Interpretation (CAGI, 2013) 43 dataset to validate significant results from the discovery cohort. The Children's Mercy Kansas City cohort consists of 43 whole-exome individuals of which 13 are independent IBD patients and 1 control; the CAGI dataset is composed of 66 whole-exomes datasets (in VCF format) of which 20 are unrelated adult CD patients and 8 are unrelated healthy controls. We merged 102 additional whole-genome control samples of British ethnicity from the 1 KG phase 3 dataset 44 resulting in the retention of 33 unrelated cases and 111 independent controls for subsequent analysis in the validation cohort.
Discovery cohort DNA extraction. Genomic DNA for each of the Southampton patients undergoing exome sequencing was extracted from saliva or peripheral venous blood samples collected in EDTA using the salting out method. DNA concentration was estimated using the Qubit 2.0 Fluorometer and α 260:280 ratio calculated using a nanodrop spectrophomter. The average DNA yield obtained was 150 μ g/ml and approximately 20 ug of each patient DNA was extracted for next generation sequencing.

Whole-exome sequencing data generation and analysis. For the Southampton discovery and
Children's Mercy Kansas City cohort, whole-exome capture was performed using Agilent SureSelect Human all Exon 51 Mb (versions 4 and 5) capture kits and TruSeq Expanded Exome and Nextera Expanded Exome capture kits. Capture technology is characterized by rapid progress, including new content and improved probe design, and we applied the optimal capture chemistry available at the time of sample sequencing. All samples were sequenced on the Illumina HiSeq 2000 and HiSeq 2500 platforms. As previously described 45 , fastQ raw data generated from Illumina paired-end sequencing protocol were aligned against the human genome reference 19 using Novoalign (2.08.02). SAMtools mpileup tool (samtools/0. 1.19) to call SNPs and short indels. Variants called with a read depth < 4 were excluded. The Phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base and is powered to discriminate between correct and incorrect base-calls. To minimize the false positive rate for the called bases, only variants called with high confidence (Phred score > 20) were retained for further analysis (99% base call accuracy). ANNOVAR (annovar/2013Feb21) was then applied for variant annotation. Genetic variants were annotated as "novel" if they were not previously reported in the dbSNP137 databases, 1000 Genomes Project (1 KG) and the Exome Variant Server (EVS) of European Americans of the NHLI-ESP project with 6500 exomes, or in the Southampton database of reference exomes. Resultant variant call files for each individual were subjected to further in-house quality control tests to detect DNA sample contamination and ensure sex concordance by assessing autosomal and X chromosome heterozygosity. Variant sharing between all pairs of individuals was assessed to confirm that subjects were not related. Sample provenance was confirmed by application of a validated SNP tracking panel developed specifically for exome data 46 .
For the CAGI subgroup of the replication cohort, whole-exome sequencing was performed using the TruSeq capture kit and sequenced on Illumina platforms. Alignment against the human genome (hg19) was conducted with BWA. PICARD was used to remove duplicate reads and GATK for genotype calling. The VQSR method was used to identify true polymorphisms in the samples rather than those due to sequencing, alignment, or data processing artefacts 43 . Gene selection. Genes involved in the NOD receptor pathway were extracted by interrogating the KEGG Pathway database 47 . The pathway (KEGG ID: hsa04621) is composed of 56 genes, of which 41 are intrinsic to NOD signaling. Gene names were cross-referenced with the HUGO webserver to confirm the approved gene symbol (Supplementary Table 1 Principle component analysis. Whole-exome sequencing data were available for 146 independent children diagnosed with IBD within the discovery cohort. Demographic data for the IBD cohort are shown in Table 4. In order to minimize bias for association analysis, we conducted a principle component analysis (PCA) using the SNPRelate 48 package on the discovery and validation cohort to exclude non-Caucasian samples. PCA was conducted on the whole discovery dataset merged with the 1,092 subjects from the 1,000 genome phase 1 dataset (20101123) in order to discriminate ethnic clusters. PCA was applied to 1363 samples with 305,950 biallelic SNPs. The same PCA procedure was conducted on the validation cohort using a combined set of CAGI and 1,000 genome phase 1 data (209,029 biallelic SNPs across 1158 samples) and on the combined Kansas and 1,000 genome phase 1 data (224, 786 biallelic SNP across 1134 samples) to discriminate ethnic clusters.
Variant calling and quality control. Next generation sequencing pipelines typically identify genomic locations at which any given sample differs from the human genome reference sequence on a case-by-case basis. After compiling the list of all variants identified in all cases and controls it was necessary to positively re-call the genotypic state (for the full set of all variants from all samples) in order to distinguish allelic genotypic status from missing data for each individual. The resultant genotypes were used for further analysis. Variants were excluded using vcftools if they deviated significantly from Hardy-Weinberg equilibrium status (p < 0.001) in the control group. Samples with a genotype missing call rate > 95% were also excluded. VCF files containing genotypic information for all cases and controls were merged and annotated.
To detect association between genetic variant and disease status, a gene-based test (the sequence kernel association optimal unified test 26 , SKAT-O) was performed using the EPACTS software package 49 in the discovery cohort. SKAT-O test was further conducted on the replication cohort to validate significant results from the discovery cohort.
Burden of mutation testing in the discovery cohort. SKAT-O statistical test was applied to further investigate the joint effect of rare and low frequency variants. Specifically, SKAT-O encompasses both a burden test and a SKAT test to offer a powerful means of conducting association analyses on combined rare and common variation as single variant tests are often underpowered due to the large sample size needed to detect a significant association.
To conduct the test, a group file with mutations of interests (synonymous, non-synonymous, splicing, frameshifts and non-frameshifts, stop gain and stop loss) was created for each of the 41 genes. SKAT-O was executed with the small sample adjustment, by using a MAF threshold of 0.05 to define rare variations within the sample size and using default weights 26 . Burden of mutation testing in the validation cohort. As the validation cohort comprises of whole-exome and whole-genome subjects, only variants falling within the consensus target region were considered. By limiting variants assessed to only those found in the genomic regions captured by both technologies, we limited the potential for bias when using data from two different capture technologies. Variant sites across the four genes requiring replication were used to generate a subset of the VCF file for each dataset. Ultimately, VCF files for all individuals were merged and annotated. SKAT-O testing was conducted using the same settings applied in the discovery cohort.
SKAT-O testing was further conducted using the same approach on the combined discovery and replication cohorts (n cases = 169 and n controls = 217).