Whole-exome sequencing in obsessive-compulsive disorder identifies rare mutations in immunological and neurodevelopmental pathways

Studies of rare genetic variation have identified molecular pathways conferring risk for developmental neuropsychiatric disorders. To date, no published whole-exome sequencing studies have been reported in obsessive-compulsive disorder (OCD). We sequenced all the genome coding regions in 20 sporadic OCD cases and their unaffected parents to identify rare de novo (DN) single-nucleotide variants (SNVs). The primary aim of this pilot study was to determine whether DN variation contributes to OCD risk. To this aim, we evaluated whether there is an elevated rate of DN mutations in OCD, which would justify this approach toward gene discovery in larger studies of the disorder. Furthermore, to explore functional molecular correlations among genes with nonsynonymous DN SNVs in OCD probands, a protein–protein interaction (PPI) network was generated based on databases of direct molecular interactions. We applied Degree-Aware Disease Gene Prioritization (DADA) to rank the PPI network genes based on their relatedness to a set of OCD candidate genes from two OCD genome-wide association studies (Stewart et al., 2013; Mattheisen et al., 2014). In addition, we performed a pathway analysis with genes from the PPI network. The rate of DN SNVs in OCD was 2.51 × 10−8 per base per generation, significantly higher than a previous estimated rate in unaffected subjects using the same sequencing platform and analytic pipeline. Several genes harboring DN SNVs in OCD were highly interconnected in the PPI network and ranked high in the DADA analysis. Nearly all the DN SNVs in this study are in genes expressed in the human brain, and a pathway analysis revealed enrichment in immunological and central nervous system functioning and development. The results of this pilot study indicate that further investigation of DN variation in larger OCD cohorts is warranted to identify specific risk genes and to confirm our preliminary finding with regard to PPI network enrichment for particular biological pathways and functions.


INTRODUCTION
The etiology of common neuropsychiatric disorders is believed to be multifactorial, with contributions from both environmental and genetic factors. 1 Studies have predominantly focused on common polymorphic variants, finding small effect sizes, and replication of positive findings has been difficult. 2 This raises the question of where the 'missing heritability' of complex diseases might be found. A substantial portion may indeed reflect gene-gene interactions and gene-environmental interactions that have not been taken into consideration in estimates of narrow sense heritability. 3 In addition, understanding the heritability of genetic diseases requires a more comprehensive assessment of human genetic variation, including rare variation, throughout the genome. 3,4 The emergence of next-generation sequencing platforms is facilitating comprehensive searches for both rare and common single-nucleotide variants (SNVs) across all genes in the genome via whole-exome sequencing (WES). 5 Although protein-coding genes constitute only about 1% of the human genome, they are estimated to harbor 85% of mutations with large effects in disease-related traits. 6 Several groups have identified genes conferring risk for intellectual disability, 7,8 schizophrenia 9,10 and autism [11][12][13][14] by applying WES in families with no previous history of these disorders or their related phenotypes-so-called sporadic or simplex families-and identifying recurrent de novo (DN) SNVs. The DN SNVs tend to be more common in patients than in controls or unaffected siblings, mainly when such variations are nonsynonymous and located in brain-expressed genes. 10,11,15 Obsessive-compulsive disorder (OCD) is a severe neuropsychiatric disorder, commonly having an early age of onset, and is characterized by the presence of obsessions (unwanted, intrusive thoughts) and compulsions (repetitive behaviors) that can become incapacitating. 16,17 Family, twin, segregation and linkage studies suggest a complex genetic etiology, complicating the confirmation of specific risk variants; consequently, the cellular and molecular mechanisms underlying OCD pathophysiology remain uncertain. 18 A meta-analysis of common variant genetic association studies of OCD found multiple polymorphisms with significant association. 19 Significant variants were located in serotonergic genes, and, in males only, catecholamine modulation genes. 19 The first published genome-wide association study (GWAS), querying common polymorphisms in a large cohort of OCD patients, did not find any variants reaching the genome-wide significance threshold, but top signals in several common variants were related to transcriptional regulation, cytoskeleton dynamics, ion channel assembly and gating, protein ubiquitination and degradation, and glutamate signaling. 20 The latest GWAS in OCD 21 also found no variants reaching genome-wide significance but showed significant overlap with top signals from the first GWAS. Furthermore, network analysis of top signals in these GWAS studies support the idea that genetic risk for OCD may cluster in certain biological networks or systems. 22 There have been few studies examining rare variants in OCD. [23][24][25][26][27][28][29][30][31] The first genome-wide investigation of rare copy number variation (CNV) in OCD and Tourette syndrome reported a 3.3-fold increase in large (4500 kb) deletions that overlap with CNVs reported in other neurodevelopmental disorders. An overall DN CNV rate of 1.4% was reported in OCD, slightly higher than the estimated frequency for controls, suggesting that rare DN variation may have a role in OCD pathogenesis. 32 To date, there have been no published studies examining rare coding SNVs across the genome in OCD.
In the current pilot study, we examined 20 simplex OCD parentchild trios using WES to detect SNVs. Given the recent success in other neuropsychiatric disorders, we focused on detection of DN SNVs in OCD, aiming to advance our understanding of the contribution of rare nonsynonymous DN SNVs to the disorder. Next, we mapped these SNVs onto a protein-protein interaction (PPI) network, hypothesizing that perturbations of integrated molecular networks through genomic and environmental influences can increase the risk for complex diseases such as OCD 33 and that topological properties of PPI networks represent molecular functional correlations among genes that are important for understanding disease biology. We then performed Degree-Aware Disease Gene Prioritization (DADA), ranking our SNVs against 'seed' genes identified from two OCD GWAS, predicting that top-ranked genes in this analysis would achieve greater connectivity in our PPI network. Finally, we asked whether all the genes in our PPI network were enriched in certain canonical biological pathways, in an attempt to enhance our understanding of OCD pathophysiology.

MATERIALS AND METHODS Subjects
This study was approved by the Research Ethics Committees of the University of São Paulo School of Medicine, as well as by the Brazilian National Commission of Research Ethics (CONEP, process number: 16756). All the participating subjects gave written informed consent.
The OCD patients, meeting DSM-IV criteria for the diagnosis, and their unaffected parents, were recruited at the Outpatient Clinic of the Obsessive-Compulsive Spectrum Disorders Program of the Institute of Psychiatry, at the University of São Paulo School of Medicine Hospital das Clínicas. The probands were evaluated by semi-structured and structured interviews included in the Brazilian Research Consortium on Obsessive-Compulsive Spectrum Disorders instruments, administered by trained clinicians (Supplementary Methods). 34 The parents were directly screened with the Structured Clinical Interview for DSM-IV Axis I Disorders; those with any Axis I psychiatric diagnosis were excluded.

Capture and sequencing
Exome capture, sequencing and variant detection were performed at the Yale Center for Genomic Analysis, as described previously 11 and summarized below.
The DNA samples from whole blood were enriched for exonic sequence with the NimbleGen SeqCap EZ Exome v2 capture library (Roche NimbleGen, Madison, WI, USA). The samples were sequenced using the Illumina HiSeq 2000 platform (74 bp paired-end reads; Illumina, San Diego, CA, USA). We multiplexed four samples during each capture reaction and sequencing lane, pooling parents and probands when possible.

Sequence alignment and variant calling
Short-read sequences were aligned to the human reference genome (hg19/NCBI 37) using the Burrows-Wheeler Aligner (http://bio-bwa. sourceforge.net/). 35 The aligned reads were trimmed to the exome target using an in-house script. The trimmed and aligned data were converted to a sorted binary format, and duplicates were removed using SAMtools (http://samtools.sourceforge.net/), 36 which was also used to identify the SNVs. SAMtools was selected for initial variant calling in order to replicate the bioinformatic pipeline used to analyze control subjects in the literature, which served as our comparison cohort for the overall rate of DN SNVs. 11 For subsequent analyses of PPI networks, pathways and disease gene prioritization, we also considered confirmed variants from a second alignment and variant calling pipeline which followed the GATK v3 best practices guidelines (Supplementary Methods). The purpose of adding variants from this second pipeline was to ensure discovery of the maximum number of DN variants for our downstream analyses. The variants from both pipelines were annotated against the UCSC gene definitions (http://genome.ucsc.edu/) for impact on the encoded protein (silent, missense, nonsense and splice site) and for population allele frequency using ExAC v0. 3. 37 Family relatedness check A panel of informative genotypes was used to perform identity-by-descent in each study subject using the PLINK software package (http://pngu.mgh. harvard.edu/~purcell/plink/). 38 Families were discarded if expected relationships did not confirm.

DN SNVs
SNVs were predicted to be DN if all the following criteria were met: (1) the variant was not predicted in either parent, (2) at least eight unique reads supported the variant in the proband and (3)

PPI network analysis
Next, we mapped interactions between genes harboring validated nonsynonymous (missense or nonsense) DN SNVs by constructing a PPI network. A network is a set of elements interacting with each other through pairwise interactions. In the case of biological PPI networks, the components are the gene proteins (nodes) that are connected to each other by links (edges) representing known physical interactions between two components. We used Cytoscape 39 and the iRefScape plugin 40 to create a PPI network. This network is based on databases of direct physical interactions between genes, with proteins as nodes and interactions as undirected edges. 41 We included PPIs that were present in any of the 10 databases consolidated within iRefIndex 40 (Supplementary Methods). We applied two methods of permutation, 'seed randomization' and 'network permutation,' using GeneNet toolbox for MATLAB 42 to show that the connectivity metrics of our network are different from networks generated by chance.

DADA analysis
To investigate the potential relevance of our DN SNVs in OCD, we applied Degree-Aware Disease Gene Prioritization (DADA; http://compbio.case. edu/dada/), a tool for performing network-based prioritization of candidate disease genes. 43 DADA analysis of our candidate genes harboring validated DN SNVs detected by WES were performed with the 'seed' genes (curated from other studies as likely to be associated with the disorder) from two OCD GWAS reported to date 20,21 (Supplementary Methods). Prioritization is made based on the proximity of interaction (protein-protein interaction) of these seed genes with our WES candidate genes, using the NCBI Entrez Gene Database, which integrates human protein-protein interaction network data from several other databases such as HPRD, BioGRID and BIND. 44 De novo mutations in obsessive-compulsive disorder C Cappi et al

Pathway analysis
To determine whether the list of all genes identified in our PPI network showed enrichment for specific biological pathways, we used Ingenuity Pathway Analysis (build version 355958M, content version 24718999; Ingenuity Systems, http://www.ingenuity.com/) to map genes to canonical pathways (Supplementary Methods). We also examined whether our confirmed DN missense variants in OCD were enriched for published lists of genes believed to contribute risk for autism, schizophrenia and intellectual disability 45 using DNENRICH. 46 Finally, we asked whether our PPI network gene lists showed enrichment for these disorders, calculating P-values in R (function phyper) based on a hypergeometric distribution. A summary of all the methods and analyses is shown in Figure 1.

Subjects
Our final analysis included WES data from 17 parent-child trios, each trio consisting of a child with OCD and their unaffected parents. Although we started with 20 trios, three were omitted from the analysis due to failing the family relatedness check.
Among the 17 OCD trios included in our analyses, the mean age at symptom onset was 8.6 (±2.7) years and the mean Y-BOCS score was 26       Using the GATK v3 best practices variant calling pipeline, we confirmed eight additional nonsynonymous (all missense) DN SNVs, which we included in downstream analyses, for a total of 20 nonsynonymous DN SNVs (Table 1, Figure 1). There were no genes that harbored more than one confirmed DN SNV among the OCD subjects.

PPI network and DADA analyses
To construct the PPI network, we selected the genes with nonsynonymous (missense, nonsense) DN SNVs detected by our GATK and SAMtools pipelines ( Figure 1, Table 1). Of the 20 genes harboring confirmed nonsynonymous DN SNVs, six genes (FAM5B, CCDC108, VCX2, MUC5B, ARHGAP6, SLC35G5) were not present in the PPI databases. Therefore, 14 genes served as the input for construction of our PPI network (Table 1). In the resulting PPI network of 320 nodes, two genes from our original inputs (WWP1, SMAD4) were found to be highly and independently interconnected with other non-neighboring genes (that is, they were found to be 'brokers'), displaying 40 and 187 interactions (edges), respectively; six genes were found to be 'bottleneck' genes (WWP1, SMAD4, CR1, AP1G1, MYO10, SNUPN) that connect different complexes or pathways in the network; and three were found to be 'bridge' genes (BAMBI, ABCE1, NDE1) with high information flow, located between highly connected modules. (Figure 2 Table 5). Network permutations showed that connectivity metrics of 'seed indirect degrees' mean and 'common connectors means' for our network are different from these metrics in the permuted networks (1000 permutations).
Next, we performed DADA analysis to rank our 14 nonsynonymous DN SNVs in OCD subjects, using top genes from two OCD GWAS 20,21 as seed genes (Supplementary Table 6) to rank our DN SNVs. Genes with the highest prioritization rank using seeds from the first GWAS 20,21 were BAMBI, followed by SMAD4 and WWP1. Using seed genes from the second GWAS, 20,21 highest prioritization rank was seen for SMAD4, followed by WWP1 and MYO10. Of these four genes ranked highly by DADA, two are considered broker genes (WWP1, SMAD4) in our PPI network, three are considered bottleneck genes (WWP1, SMAD4, BAMBI, MYO10), and one is considered a bridge gene (BAMBI; Supplementary Table 5).
Pathway analyses Finally, a pathway analysis was performed to determine whether the nodes in our PPI network are enriched in canonical pathways from the Ingenuity Pathway Analysis database. We performed two pathway analyses using different gene lists as input: (1) using all 320 nodes in the PPI network (Supplementary Table 5) and (2) using 37 nodes from the PPI network determined to be brokers, bridges or bottlenecks by measurements of topological centrality (Supplementary Table 5), as these nodes may carry greater functional significance. 49,50 The analysis of all PPI network nodes found significant enrichment for pathways related to transforming growth factor beta (TGF-β) signaling, bone morphogenic protein signaling and glucocorticoid receptor signaling (Table 3). Narrowing the input list to bridges, brokers and bottlenecks also yields enrichment in TGF-β and glucocorticoid receptor signaling. Functional network enrichment for these central PPI nodes includes embryonic development, cell-to-cell signaling, cell death and survival, and cellular function and maintenance (Supplementary Table 9). There was no significant overlap between these same gene lists and lists of risk genes for autism, schizophrenia and intellectual disability 45 (Supplementary  Table 10).

DISCUSSION
To our knowledge, this is the first reported WES study in OCD designed to search for rare DN SNVs across all the coding regions of the genome in parent-child trios. As with similar studies in other neuropsychiatric disorders, the DN SNVs we identified may include true OCD risk variants and point toward relevant gene networks and canonical pathways. The small number of subjects in the present study precludes confirmation of the involvement of specific genes or variants in OCD. Nevertheless, the finding that DN SNVs occur more frequently in OCD is an essential prelude to identifying specific risk genes through the identification of recurrent mutations (independent DN variants mapping within the identical gene or at the same chromosomal locus) in larger patient cohorts. Furthermore, the network analyses in this pilot study integrate prior findings from GWAS 20,21 to generate hypotheses of potentially relevant genes, biological pathways, networks and processes in OCD that can be tested in larger studies.
Our first analysis used a SAMtools variant detection pipeline to compare the rate of confirmed DN SNVs in our OCD samples versus published rates in controls and other disorders. Our observed per base pair per generation DN SNV rate (2.51 × 10 − 8 ) differs significantly from a rate previously reported in unaffected subjects (1.31 × 10 − 8 ; ref. 11) using an identical variant calling pipeline. Parental age does not seem to account for this difference, and we did not find any difference between our rate and those reported in schizophrenia, 10,46 autism 11,13,47,48 or intellectual disability 7,8 exome-sequencing studies ( Table 2).
In addition to the variants conformed using SAMtools, we confirmed eight additional nonsynonymous DN SNVs, predicted using a GATK best practices pipeline. We used all 20 nonsynonymous DN SNVs detected by both pipelines for downstream analyses. Given that protein gene products associated with disease have a higher likelihood of physically interacting, 41,51,52 TGFBR1, MAPK1, SMAD3, SKI, SMAD5, MAPK13, TGIF1, EP300, TGFBR2, JUN, RUNX2, SMAD4, NF4A,  TFE3, SMAD1, SMAD2, NRAS, SMAD9, FOXH1, CREBBP, SMAD6, SSMAD7, ZNF423, IRF7, RRAS2, PIAS4 we next constructed a PPI network starting with candidate genes harboring confirmed nonsynonymous DN SNVs in OCD. Two of our candidate genes were classified as 'brokers' (that is, highly and independently connected with non-neighboring genes) in this analysis, suggesting that they may be more relevant for disease (Supplementary Table 5). 52 WWP1 (WW domain containing E3 ubiquitin protein ligase 1) inhibits transcriptional activity induced by TGF-β, a member of a highly pleiotropic cytokine family that maintains immune homeostasis, directs lymphocyte differentiation and orchestrates aspects of embryonic development including neuronal migration and synapse formation. 53,54 SMAD4 (SMAD family member 4) codes for a signal transduction protein that is activated by TGF-β signaling during proliferation and differentiation of the central nervous system. 55 Furthermore, some of our candidate genes were classified as 'bottlenecks' (key connectors) in the PPI network (Supplementary Table 5), believed to be one of the most significant indicators of essentiality. 56 Aside from WWP1 and SMAD4, other bottleneck genes were CR1 (complement component (3b/4b) receptor 1 [Knops blood group]), an important member of the family of regulators of complement activation and a crucial multifunctional mediator of innate immunity; MYO10 (myosin X), which may have a role in neurite outgrowth and axon guidance; 57 and AP1G1 (adaptor-related protein complex 1, gamma 1 subunit), important for the formation of clathrin-coated vesicles to transport ligand-receptor complexes from the plasma membrane or from the trans-Golgi network to lysosomes. 58 Although AP1G1 has not been associated with any psychiatric disorder, it is noteworthy that the first genome-wide investigation of large, rare CNVs in OCD found a de novo CNV encompassing AP1GBP1, a related gene that also acts on clathrin-coated vesicles and may have a role in endocytosis. 32 To investigate the potential relevance of our PPI network to OCD, we conducted a DADA analysis to see whether prioritized genes would have more connectivity in our PPI network. In this analysis, using a seed gene list curated from the two GWAS in OCD, 20,21 two of the three highest ranked genes (WWP1 and SMAD4) were found to be brokers in the PPI network (Supplementary Table 7). The highest ranked gene in the DADA analysis using GWAS I (Supplementary Table 7) was BAMBI (bone morphogenetic protein and activin membrane-bound inhibitor), coding for a pseudoreceptor that negatively modulates TGF-β. 59 BAMBI is classified as a bridge gene in our PPI network and is among the genes regulated by top-ranking SNPs in the OCD GWAS I secondary analysis; 20 its protein product has been found to be selectively and significantly enriched in white matter progenitor cells, and can modulate their differentiation in oligodendrocytes or astrocytes. 60 Furthermore, it is notable that the top-ranked genes in the DADA analyses (BAMBI, SMAD4, WWP1, MYO10, AP1G1, ATP2B2) have negative Residual Variation Intolerance Scores, 61 indicating that in large exome-sequencing databases, they are found to contain less common functional genetic variation relative to the genome-wide expectation. Genes with negative Residual Variation Intolerance Scores are referred to as relatively 'intolerant' to genetic variation and more likely to be involved with neurodevelopmental disease 61 (Table 1, Supplementary Tables 7 and 8).
The canonical pathway analysis results from all the 320 nodes in our PPI network show enrichment for TGF-β signaling, bone morphogenic protein signaling and glucocorticoid signaling ( Table 3); narrowing this list to 37 nodes classified as bridges, brokers and bottlenecks (Supplementary Table 5) shows enrichment for TGF-β and glucocorticoid receptor signaling. TGF-β signaling is mediated by a family of structurally related cytokines and by SMAD proteins that act to control proliferation, differentiation, migration and apoptosis of many different cell types. Bone morphogenic proteins are members of the TGF-β protein family of extracellular ligands, 62 important in the neuronal protection against both apoptosis and excitotoxicity. 63 The fidelity of these pathways are crucial for normal nervous system development and their disruption has been suggested to underlie schizophrenia pathology. 63 Finally, glucocorticoids, a major subclass of steroid hormones, regulate a large number of immune, metabolic, cardiovascular and behavioral functions. Their major effects are anti-inflammatory via transcription induction of antiinflammatory genes and by repression of inflammatory genes. Furthermore, their anti-inflammatory actions are complemented by their ability to induce apopotosis of cells, including monocytes and T lymphocytes. Imbalances in this pathway have been suggested in OCD, anxiety and other neuropsychiatric disorders. [64][65][66] The major limitation of the current study is the small sample size. When studying rare variants, larger samples are needed to adequately show that certain genes or variants are associated with the disorder. Therefore, our study should be considered preliminary, requiring replication in larger OCD cohorts. Recruitment for a larger investigation of DN exome sequence variation in OCD trios is currently underway.
Although we are unable to pinpoint definitive risk genes or variants in a study of this size, we were able to show, for the first time, that (1) sporadic OCD families may have higher rates of DN coding SNVs, suggesting that further study of this type of variation in larger cohorts holds potential to identify risk genes; (2) a PPI network can be constructed on the basis of DN SNVs that appears to have relevance to OCD, judged against common variant findings from other genetic studies of OCD; and (3) PPI network genes show enrichment for biological pathways and functions involved with neurodevelopmental and immunological processes. These findings hold great promise for elucidating our understanding of OCD neurobiology and potential treatments, and deserve further scrutiny in larger cohorts.

CONFLICT OF INTEREST
The authors declare no conflict of interest.