Obsessive-compulsive disorder is a severe psychiatric disorder linked to abnormalities in glutamate signaling and the cortico-striatal circuit. We sequenced coding and regulatory elements for 608 genes potentially involved in obsessive-compulsive disorder in human, dog, and mouse. Using a new method that prioritizes likely functional variants, we compared 592 cases to 560 controls and found four strongly associated genes, validated in a larger cohort. NRXN1 and HTR2A are enriched for coding variants altering postsynaptic protein-binding domains. CTTNBP2 (synapse maintenance) and REEP3 (vesicle trafficking) are enriched for regulatory variants, of which at least six (35%) alter transcription factor-DNA binding in neuroblastoma cells. NRXN1 achieves genome-wide significance (p = 6.37 × 10−11) when we include 33,370 population-matched controls. Our findings suggest synaptic adhesion as a key component in compulsive behaviors, and show that targeted sequencing plus functional annotation can identify potentially causative variants, even when genomic data are limited.
Obsessive-compulsive disorder (OCD) is a highly heritable (h 2 = 0.27–0.65)1, debilitating neuropsychiatric disorder characterized by intrusive thoughts and time-consuming repetitive behaviors. Over 80 million people worldwide are estimated to suffer from OCD, and most do not find relief with available therapeutics1, underscoring the urgency to better understand the underlying biology. Genome-wide association studies (GWAS) implicate glutamate signaling and synaptic proteins2, 3, but specific genes and variants have not been validated. Isolating and characterizing such genes are important for understanding the biology and developing treatments for this devastating disease.
In mouse, genetically engineered lines have causally implicated the cortico-striatal neural pathway in compulsive behavior. Mice with a deletion of Sapap3 exhibit self-mutilating compulsive grooming and dysfunctional cortico-striatal synaptic transmission, with abnormally high activity of medium spiny neurons (MSNs) in the striatum. Resulting compulsive grooming is ameliorated by selective serotonin reuptake inhibitor (SSRI), a first-line medication for OCD4. Similarly, chronic optogenetic stimulation of the cortico-striatal pathway in normal mice leads to compulsive grooming accompanied by sustained increases in MSN activity5. Thus, excessive striatal activity, likely due to diminished inhibitory drive in MSN microcircuitry, is a key component of compulsive grooming. The brain region disrupted in this mouse model is also implicated by imaging studies in human OCD6.
Pet dogs are a natural model for OCD amenable to genome-wide mapping due to their unique population structure7. Canine compulsive disorder (canine CD) closely parallels OCD, with equivalent clinical metrics, including compulsive extensions of normal behaviors, typical onset at early social maturity, roughly a 50% rate of response to SSRIs, high heritability, and polygenic architecture8. Through GWAS and targeted sequencing in dog breeds with exceptionally high rates of canine CD, we associated genes involved in synaptic functioning and adhesion with CD, including neural cadherin (CDH2), catenin alpha2 (CTNNA2), ataxin-1 (ATXN1), and plasma glutamate carboxypeptidase (PGCP)8, 9.
Human genetic studies of related disorders, such as autism spectrum disorders (ASD), suggest additional genes. Both ASD and OCD are characterized by repetitive behaviors, and high comorbidity suggests a shared genetic basis6. Genome-wide studies searching for de novo and inherited risk variants have confidently associated hundreds of genes with ASD; this set may be enriched for genes involved in OCD10.
Focusing on genes implicated by model organisms and related disorders could find variants underlying OCD risk, even with smaller sample sizes. Researchers, particularly in psychiatric genetics, are wary of “candidate gene” approaches, which often failed to replicate11. Closer examination of past studies suggests this approach is powerful and reliable when the set of genes tested is large, and the association is driven by rare variation11. A study testing 2000 candidate genes for association with diabetic retinopathy identified 25 genes, at least 11 of which achieved genome-wide significance in a GWAS of type 2 diabetes, a related disorder12, 13. A targeted-sequencing study of ASD, with 78 genes, identified four genes with recurrent, rare deleterious mutations; these four genes are also implicated by whole-exome sequencing studies14. Candidate gene studies also replicated associations to rare variants in APP, PSEN1, and PSEN2 for Alzheimer’s disease15, PCSK9 for low-density lipoprotein–cholesterol level16, and copy-number variants for autism and schizophrenia10.
Detecting associations driven by rare variants requires sequencing data, which captures nearly all variants. Although whole-genome sequencing studies of complex diseases are still prohibitively expensive, it is feasible to target a subset of the genome. Sequencing also facilitates identification of causal variants, accelerating discovery of new therapeutic avenues17, 18. For example, finding functional, rare variants in PCSK9 led to new therapies for hypercholesterolemia19. One approach is to target predominantly coding regions (whole-exome sequencing). Although successful in finding causal variants for rare diseases20, this approach misses the majority of disease-associated variants predicted to be regulatory21. A targeted-sequencing approach that captures both the regulatory and coding variation of a large set of candidate genes offers many advantages of whole-genome sequencing, and is feasible when cohort size and resources are limited.
Here we report a new strategy that overcomes limitations of less comprehensive candidate gene studies and exome-only approaches, and identifies functional variants associated with increased risk of OCD. We start by compiling a large set of 608 genes (~3% of human genes) using studies of compulsive behavior in dogs and mice, and studies of ASD and OCD in humans. By focusing on this subset of genes, targeting both coding and regulatory regions, and applying a new statistical method that incorporates regulatory and evolutionary information, we identify four associated genes, including NRXN1, the first genome-wide-significant association reported for OCD.
We compiled a list of 608 genes using three strategies (65 were implicated more than once) (Supplementary Table 1 and Supplementary Methods):
263 “model-organism genes”, including 56 genes associated in canine CD GWAS and 222 genes implicated in murine-compulsive grooming.
196 “ASD genes” from SFARI database (https://gene.sfari.org/) as of 2009.
216 “human candidate genes” from small-scale OCD candidate gene studies (56 genes), family-based linkage studies of OCD (91 genes), and by other neuropsychiatric disorders (69 genes).
We targeted coding regions and 82,723 evolutionarily constrained elements in and around these genes, totaling 13.2 Mb (58 bp–16 kb size range, median size 237 bp), 34% noncoding22.
We sequenced 592 European ancestry DSM-IV OCD cases and 560 ancestry-matched controls using pooled sequencing, with 16 samples per bar-coded pool (37 “case” pools; 35 “control” pools). Overall, 95% of target regions were sequenced at >30× read depth per pool (median 112×; ~7× per individual; Supplementary Fig. 1), sufficient to identify variants occurring in just one individual, assuming 0.5–1% per base machine error rate.
We called 124,541 single nucleotide polymorphisms (SNPs) using Syzygy (84,216)17 and SNVer (81,829)23. For primary analyses, we focused on 41,504 “high-confidence” SNPs detected by both, with highly correlated allele frequencies (AF) (Pearson’s ρ = 0.999, p < 2.2 × 10−16; Supplementary Fig. 2). We see no significant difference between case and control pools, indicating no bias in variant detection.
We used three annotations shown to be enriched for disease-associated variation to identify likely functional variants in our targeted regions: coding, evolutionary conserved, and/or DNase1 hypersensitivity site (DHS)21, 24,25,26,27. We annotated 67% (27,626) of high-confidence variants, with 16% coding (49% of those were non-synonymous), 36% DHS, and 80% evolutionary conserved or divergent (Fig. 1a). We measured evolutionary constraint using mammalian GERP++ scores27; scores >2 were “conserved” and scores <–2 were “divergent”.
Gene-based burden analysis
To identify genes with a significant load of non-reference alleles in OCD cases, relative to controls, we developed PolyStrat, a one-sided gene-based burden test that controls for gene length (Supplementary Fig. 3a) and incorporates variant annotation. We used four variant categories: (i) all (Overall), (ii) coding (Exon), (iii) regulatory (variants in DHS), and (iv) rare (1000 Genomes Project28 AF < 0.01). Each category is further stratified by evolutionary status: (i) all detected variants; (ii) slow-evolving conserved (Cons); (iii) fast-evolving divergent (Div); and (iv) evolutionary (Evo). “Evo” is the subset of “all” variants annotated as either “conserved” or “divergent”. In total, PolyStrat considers 16 groups stratified by predicted function and evolutionary conservation.
PolyStrat p-values are corrected for multiple testing empirically using a permutation-based method that accurately measures experiment-wide statistical significance across correlated gene-based tests, while controlling for type 1 errors (Supplementary Methods). For most variant categories, quantile–quantile plots revealed good correspondence between observed values and the empirical null, with a small number of genes exceeding the expected distribution in a subset of the burden tests (Supplementary Figs. 3b and 4).
Five of the 608 sequenced genes (0.82%) show significant burdens of variants in OCD patients (Table 1; Fig. 1b), including two with excess coding variants (NRXN1 and HTR2A) and two with excess regulatory variants (CTTNBP2 and REEP3) (Fig. 2). REEP3 is the only gene with excess divergent (potentially fast evolving) variants. No genes had a significant burden of rare variants (Supplementary Fig. 4).
We validated the 46 SNPs contributing to significant gene-burden tests (7 in LIPH, 13 in NRXN1, 4 in HTR2A, 15 in CTTNBP2, and 7 in REEP3) by individual genotyping of 571 OCD and 555 control samples (98% of the cohort). Nine variants failed Sequenom assay design or had low genotyping rates. For the remaining 37, the genotyping and pooled-sequencing frequencies are nearly perfectly correlated (Pearson’s ρ = 0.999, p < 2.2 × 10−16; Supplementary Fig. 5; Supplementary Data 1).
We confirmed that our significant gene-burden test findings are not driven by population structure (Supplementary Methods) or linkage disequilibrium (LD), with one notable exception. We measured pairwise r 2 between SNPs contributing to the burden test in our top five genes, and found strong LD (r 2 > 0.8) between one pair, in LIPH. There was no strong LD in NRXN1, HTR2A, CTTNBP2, and REEP3.
Genes included from model-organism studies (263 genes) and larger ASD studies (196 genes) were significantly more associated than genes from human candidate gene studies (216 genes) (Kruskal–Wallis p = 5.6 × 10−15). This is consistent with previous work showing that genes found through smaller candidate gene studies replicate poorly11. It also suggests that, when a genome-wide study of the disease of interest is not available, targeting genes implicated in a model organism may be as effective as targeting genes implicated in a comorbid, phenotypically similar human disorder.
The five genes most strongly implicated in canine CD and murine-compulsive grooming (CDH2, CTNNA2, ATXN1, PGCP, and Sapap3) have significantly lower p-values than the other 603 sequenced genes (Wilcoxon unpaired, one-sided p = 2.6 × 10−4). The difference becomes more significant when only rare variants are tested (Wilcoxon unpaired, one-sided p = 3.2 × 10−5) (Fig. 1c). This is consistent with the hypothesis that severe disease-causing variants, rare in humans due to negative selection, may persist at higher frequencies in model organisms where selection is relaxed.
Applying the burden test across multiple genes with shared biological functions, we identified gene sets with high-variant load in OCD patients. We tested all 989 Gene Ontology (GO) sets that are at least weakly enriched (enrichment p < 0.1) in our 608 sequenced genes (Supplementary Data 2) and found two with high-variant burdens: “GO:0010942 positive regulation of cell death” (uncorrected p = 3 × 10−4, corrected p < 0.03) and “GO:0031334 positive regulation of protein complex assembly” (uncorrected p = 7 × 10−4, corrected p < 0.06). Overlaying the burden test results onto the GO network topology highlights functional themes linking the enriched gene sets: regulation of protein complex assembly and cytoskeleton organization; neuronal migration; action potential; and cytoplasmic vesicle (Supplementary Fig. 6).
Validation of candidate variants by genotyping
We genotyped the top 67 candidate functional variants from the five significant genes, including 42 rare SNPs (AF < 0.01), in the pooled-sequencing cohort (Fig. 3a). This yielded, after QC, individual genotypes for 63 SNPs in 571 cases and 555 controls (98% of the cohort; genotyping rate >0.94 for all SNPs). We see near perfect correlation with the pooled sequencing for both allele frequencies (Fig. 3b, c; OCD AF, Pearson’s ρ = 0.999, p = 2.7 × 10−89; Control AF, Pearson’s ρ = 0.999, p = 2.5 × 10−89) and the AF differences (Fig. 3d; OCD AF–control AF, Pearson’s ρ = 0.93, p = 4.8 × 10−28).
We genotyped these 63 SNPs in an independent cohort of 727 cases and 1105 controls of European ancestry, and found strong correlation with the first genotyping cohort for both AF (Fig. 3e, f; OCD AF, Pearson’s ρ = 0.999, p = 1.0 × 10−82; control AF, Pearson’s ρ = 0.999, p = 1.8 × 10−94) and AF differences (Fig. 3g; OCD AF–control AF, Pearson’s ρ = 0.4, p = 0.001). The risk allele from the first cohort is significantly more common in cases in the second cohort (Wilcoxon paired one-sided test for 63 SNPs, p = 0.005). More specifically, of 54 SNPs that had a higher frequency of the non-reference allele in cases in the first cohort, 61% also had a higher frequency of the non-reference allele in cases in the second cohort. The 33 SNPs that failed to validate in either of the two cohorts had smaller allele-frequency differences in the first cohort (one-sided unpaired t-test p = 0.02).
In summary, the allele-frequency analysis described above identified four genes: NRXN1, HTR2A, CTTNBP2, and REEP3. LIPH is excluded because its association is likely slightly inflated by LD and the genotyping in the second cohort did not reproduce as clearly. To validate the associations, we employed distinct strategies depending on whether the association was driven by coding (NRXN1 and HTR2A) or regulatory variation (CTTNBP2 and REEP3).
Functional validation of regulatory variants using electrophoretic mobility shift assay
For CTTNBP2 and REEP3, regulatory variants give a far stronger burden signal than does testing for either coding variants or all variants (Fig. 1b). Furthermore, the three largest effect variants in the combined cohort (1298 OCD cases and 1660 controls) alter enhancer elements in these two genes: chr7:117358107 in CTTNBP2 (OR = 5.2) and chr10:65332906 (OR = 3.7) and chr10:65287863 (OR = 3.2) in REEP3 (Supplementary Data 3). Using ENCODE and Roadmap Epigenomics data, we identified 17 candidate SNPs in CTTNBP2 and REEP3, likely to alter transcription factor-binding sites (TFBS) and/or disrupt chromatin structure in brain-related cell types26, 29. All 17 alter enhancers or transcription associated loci active in either the substantia nigra (SN), which relays signals from the striatum to the thalamus, and/or the dorsolateral prefrontal cortex (DL-PFC), which sends signals from the cortex to the striatum/thalamus (Fig. 3h, i). Both regions act in the CSTC pathway implicated by neurophysiological and genetic studies in OCD (Fig. 3j)30.
We functionally tested 17 candidate regulatory SNPs in REEP3 and CTTNBP2 (Table 2; Supplementary Fig. 7b). We introduced each into a human neuroblastoma cell line (SK-N-BE(2)) and assessed transcription factor binding using electrophoretic mobility shift assays (EMSA). Both DHS SNPs in REEP3, three of seven DHS SNPs in CTTNBP2, and one non-DHS variant in CTTNBP2 clearly alter specific DNA-protein binding (Fig. 4a, b). We see weak evidence of differential binding for one upstream DHS SNP in REEP3, two DHS SNPs in CTTNBP2, and one non-DHS SNP in CTTNBP2 (Supplementary Fig. 8).
The high rate of functional validation by EMSA demonstrates that screening using both regulatory and evolutionary information is remarkably effective in identifying strong candidate OCD-risk variants. In total, eight of 12 tested DHS SNPs (67%) show evidence of altered protein binding, despite testing a single cell line at a single time point under standard-binding conditions (Table 2). This includes two SNPs with high ORs in the full genotyping data sets that strongly disrupt specific DNA-protein binding (chr10:65332906 with OR = 3.7; chr7:117417559 with OR = 2.2). Two of five non-DHS SNPs (40%) also show altered binding, illustrating that DHS mark alone is a powerful but imperfect predictor of regulatory function. Both of these SNPs alter highly constrained elements (SiPhy score 8.7), whereas only one of the three non-DHS SNPs is constrained. Although this is a small data set, our results suggest that incorporating both DHS and conservation may identify functional regulatory variants with greater specificity, an observation consistent with previously published research31.
Validation of coding variants using ExAC
In contrast to the regulatory-variant burden found in CTTNBP2 and REEP3, NRXN1 and HTR2A showed significant PolyStrat signals when only coding variants are considered. Of 12 candidate coding SNPs in NRXN1, seven are missense (Table 3). Four of these are SNPs private to OCD cases, and the other three are rare (AF in controls 0.0009–0.0036). All seven change amino acids in laminin G or EGF-like domains important for postsynaptic binding, potentially affecting the involvement of NRXN1 in synapse formation and maintenance (Fig. 4c). Of the three candidate coding SNPs in HTR2A, two (one missense and one synonymous) are in the last coding exon, and one (missense) is the cytoplasmic domain with a PDZ-binding motif, potentially affecting binding affinity or specificity32.
We sought to improve our statistical power by combining our pooled-sequencing data with publicly available ExAC data33. Using only our data, the associations of CTTNBP2, REEP3, NRXN1, and HTR2A with OCD are experiment-wide significant, but do not reach the genome-wide significance threshold p < 2.5 × 10−6 (~20,000 human genes), with the strongest association, to NRXN1, at p = 5.1 × 10−5 (cohort 1 and 2; Fisher’s combined p). For the two genes with a burden of coding variants (NRXN1 and HTR2A), we used ExAC to assess variant burden in OCD cases compared with 33,370 non-Finnish Europeans. Such a comparison was not possible for CTTNBP2 and REEP3, for which associated variants are predominantly noncoding and thus not assayed in ExAC.
To assess the significance of the variant enrichment in each gene, we used an isoform-based test that incorporates a within-gene comparison to assess significance, effectively controlling for inflation due to the extremely large size of the ExAC cohort34 (Supplementary Methods). Of 542 genes with more than one isoform, we saw no significant difference between our control data and ExAC for over 90% (493 genes had corrected p > 0.05). Focusing on the subset of 66 genes with nominally significant PolyStrat scores, NRXN1 had the largest difference between cases and ExAC (χ 2 = 82.3, df = 16, uncorrected p = 6.37 × 10−11; corrected p = 1.27 × 10−6) and no difference between controls and ExAC (χ 2 = 10.5, df = 16, uncorrected p = 0.84) (Fig. 4d). No previous findings in OCD genetics have reached this level of significance despite >100 candidate gene studies35, a dozen linkage studies30, and two GWAS2, 3. HTR2A, while enriched for coding variants, had only two SNPs in cases, providing insufficient information for the isoform test.
The significant association of NRXN1 reflects an exceptional burden of variants in one of its 17 Ensembl isoforms. NRXN1a-2, which contains all 12 candidate coding SNPs, had the largest deviation between observed and expected variant counts, with a residual at least 1.4× higher than any other isoform (NRXN1a-2 = 22.3, NRXN1-001 = 16.3; median = 5.15). After adjusting for the residuals from the “null” control data and ExAC comparison, the NRXN1a-2 residual is still 1.3× higher (OCD residual/control residual NRXN1a-2 = 5.34, NRXN1-014 = 4.04).
By analyzing sequencing data for 608 OCD candidate genes, then prioritizing variants according to functional and conservation annotations, we identified four genes with a reproducible variant burden in OCD cases. Two genes, NRXN1 and HTR2A (Table 3), have a burden of coding variants, and the other two, CTTNBP2 and REEP3 (Table 2), have a burden of conserved regulatory variants. Notably, all four act in neural pathways linked to OCD, including serotonin and glutamate signaling, synaptic connectivity, and the CSTC circuit6, offering new insight into the biological basis of compulsive behavior (Fig. 5).
We used three independent approaches to validate our findings: (1) For the top candidate SNPs, allele-frequency differences from sequencing data were confirmed by genotyping of both the original cohort (Fig. 3d) and a larger, independent cohort (Fig. 3g). (2) For the two genes with a burden of coding variants (NRXN1 and HTR2A), comparison of our data to 33,370 population-matched controls from ExAC33 revealed genome-wide-significant association of NRXN1 with OCD. (3) For the two genes with the burden of regulatory variants (REEP3 and CTTNBP2), more than one-third of candidate SNPs altered protein/DNA binding in a neuroblastoma cell line (Fig. 4a).
Comparison of our approach to existing methods illustrates its unique advantages, and offers a deeper understanding of how its two key features—targeted sequencing, and incorporation of functional and conservation metrics—permit identification of significantly associated genes using a cohort smaller even than those that have previously failed to yield significant results.
Targeted sequencing captures both coding and regulatory variants, and both common and rare variants, at a fraction of the cost of whole-genome sequencing (WGS). For the modest-size cohort in this study, WGS would cost ~$2.5M, 40-fold more than our pooled-sequencing approach. Even without pooling, our targeted-sequencing costs fourfold less than WGS. Whole-exome sequencing would cost approximately the same as targeted sequencing, but misses the regulatory variants explaining most polygenic trait heritability21. By using existing information on OCD and related diseases to prioritize a large set of genes, then performing targeted sequencing of functional elements, our approach enhances causal-variant detection and thus statistical power, although it misses OCD-associated genes not included as candidates, and potential distant regulatory elements.
The capacity to detect associations to rare variants is especially critical for study of diseases that, like OCD, may reduce fitness, as negative selection limits inheritance of deleterious variants36. Genotype array data sets, and even imputed data sets, miss many rare variants. In our data set, 80% of variants driving significant associations have allele frequencies <0.05; one of the densest genotyping arrays available, the Illumina Infinium Omni5 (4.3M markers) contains only half of these variants (Supplementary Data 1)2, 3. In addition, 60% of our variants have allele frequencies <0.01, and would be missed even through imputation with 1000 Genomes and UK10K37.
Our new analytical method, PolyStrat, analyzes targeted-sequencing data capturing all variants, and leverages public evolutionary and regulatory data to increase power. PolyStrat first filters out variants that are less likely to be functional, then performs gene-burden tests. In contrast to gene-based approaches focusing on ultra-rare, protein-damaging variants, PolyStrat considers variants of diverse frequencies, gaining power to identify genes with excess variants in cases.
PolyStrat is particularly advantageous when applied to studies with smaller cohorts. By testing for association at the gene level, it requires statistical correction only for the ~20,000 genes in the genome. It increases power further by using targeted-sequencing data to capture nearly all variation, including variants with higher allele frequencies and/or larger effect sizes, in regions that are coding or evolutionarily constrained, and enriching for causal variants by removing ~33% of variants unlikely to be functional38. PolyStrat tests ~82 times more functional variants than PolyPhen2 (http://genetics.bwh.harvard.edu/pph2/), which focuses on protein-damaging variants (27,626 vs. 335 in our data).
Our PolyStrat results are consistent with expectations from simulations, which suggest that 200–700 cases should yield 90% power to detect associated genes with allele frequencies and effect sizes similar to our four genes39. Specifically, we would achieve 90% power to detect associations to NRXN1 (combined AF = 0.022, OR = 2.4) with ~600 cases, to HTR2A (combined AF = 0.03, OR = 1.56) with ~700 cases; to REEP3 (combined AF = 0.04, OR = 2.11) with ~200 cases, and to rare (AF < 0.01) variants in CTTNBP2 (combined AF = 0.003, OR = 4.7) with ~500 cases.
Previous research on the four genes identified by PolyStrat revealed that all are expressed in the striatum, a brain region linked to OCD (http://human.brain-map.org/). All four genes are involved in pathways relevant to brain function, and harbor variants that could alter OCD risk (Table 4).
NRXN1 encodes the synapse cell-adhesion protein neurexin 1α, a component of cortico-striatal neural pathway40, 41 implicated in ASD and other psychiatric diseases42, and functionally related to genes associated with OCD (CDH9/CDH10)3, 8, 9 and canine CD8, 9 (CDH2) (Fig. 5). NRXN1 isoforms are implicated in distinct neuropsychiatric disorders. The non-synonymous variants in the NRXN1a-2 isoform (Fig. 4c) may alter synaptic function by disrupting cellular localization or interactions with binding partners, including neurexophilins43. The five synonymous candidate variants in likely regulatory elements may affect protein folding by disrupting post-transcriptional regulation, seen in other neuropsychiatic disorders44.
The synaptic plasticity gene REEP3, also implicated in ASD45, encodes a protein that shapes tubular endoplasmic reticulum membranes found in highly polarized cells, including neurons46. The two EMSA-validated REEP3 variants change regulatory elements active in the cortico-striatal neural pathway (Fig. 3h) and bound by multiple TFs (Table 2) including GATA2, which may be required to actuate inhibitory GABAergic neurons47. Thus, variants disrupting GATA2 binding could change the balance between excitatory and inhibitory neurons in the CSTC circuit (Fig. 3j)30.
CTTNBP2 regulates postsynaptic excitatory synapse formation. All four EMSA - confirmed variants in CTTNBP2 alter epigenetic marks active in the key structures of the cortico-striatal neural pathway48 (Table 2; Fig. 3h), potentially affecting the expression of this critical gene. CTTNBP2 proteins interact with both proteins encoded by STRN (striatin), which approached experiment-wide significance in this study (uncorrected p = 0.0016, corrected p < 0.1; Fig. 1b) and the canine CD gene CDH2 (Fig. 5).
HTR2A encodes a G-protein-coupled serotonin receptor expressed throughout the central nervous system, including in the prefrontal cortex, and has been implicated in ASD and OCD35. A related serotonin-receptor cluster (HTR3C/HTR3D/HTR3E) is associated with severe canine CD49(Fig. 5). The three coding variants found in HTR2A may alter its binding affinity (Table 3)32, and one of the three, a rare missense variant (rs6308; AF = 0.004 in 1000G CEU population) is perfectly linked (D′ = 1; http://raggr.usc.edu) to a common variant (rs6314) associated with response to SSRIs50.
Taken together, our top four associated genes and our pathway analysis implicate three classes of neuronal functions in OCD, as described below.
First, synaptic cell-adhesion molecules help establish and maintain contact between the presynaptic and postsynaptic membrane, and are critical for synapse development and neural plasticity. NRXN1 encodes a cell-adhesion molecule predominantly expressed in the brain, and CTTNBP2 regulates cortactin, another such molecule, echoing earlier findings linking cell-adhesion genes to compulsive disorders in dogs (CDH2 and CTNNA2), mice (Slitrk5), and humans (DLGAP1, PTPRD and CDH9/CDH10)2, 3, 8, 51 (Fig. 5). In our pathway analysis, “regulation of protein complex assembly” and “cytoskeleton organization” were enriched for variants in OCD patients.
Second, OCD may result from an imbalance of excitatory glutamate and inhibitory GABAergic neuron differentiation30 (Fig. 3j), a process that involves both NRXN1 52 and REEP3 53 (Table 4), as well as PTPRD, a top OCD GWAS candidate3. We also find an overall burden of variants in genes regulating cell death and apoptosis (Supplementary Data 2) and in telencephalic tangential migration, a neuronal migration event which forms connections between the key structures of CSTC circuit54.
Third, SSRIs are the most effective available OCD treatment, suggesting a role for serotonergic pathways in disease. HTR2A encodes a serotonin receptor, and allelic variation in HTR2A alters response to SSRIs (Table 4)50. In addition, both REEP3 and CACNA1C, which score high in this study (Fig. 1), also significantly associate with schizophrenia and act in calcium signaling, a downstream pathway of HTR2A 55,56,57. Meta-analysis of >100 OCD genetic association studies found strong association to both HTR2A and the serotonin transporter gene SLC6A4 35. In dogs, a serotonin-receptor locus is associated with severe CD49.
Our findings suggest broad principles that could guide studies of other polygenic diseases. We discovered that genes associated in selectively bred model organisms are more likely to contain rare, highly penetrant variants. The five genes we found to be most strongly associated with compulsive behaviors in dog and mouse (CDH2, CTNNA2, ATXN1, PGCP, and Sapap3) were significantly more enriched for rare variants in human patients than the other 603 genes targeted, although they did not individually achieve significance (Fig. 1c). We propose that the enrichment of rare variants in humans reflects natural selective forces limiting the prevalence of severe disease-causing variants. Such forces are less powerful in selectively bred animal populations. Because risk variants identified through animal models are anticipated to be rare in humans, replication will require either family-based studies, or cohorts of magnitude not currently available.
We also find that the ratio of coding to regulatory variants is positively correlated with a gene’s developmental importance. Although single-gene p-values from PolyStrat tests are positively correlated across variant categories, as is expected given overlaps between different variant categories (Fig. 1a; Supplementary Fig. 9), this pattern breaks down for our four significantly associated genes. NRXN1 and HTR2A, which have burdens of coding variants, score poorly on regulatory-variant tests; CTTNBP2 and REEP3, which have burdens of regulatory variants, score poorly in coding-variant tests (Fig. 1b). This is consistent with the ExAC study showing that genes critical to viability or development do not tolerate major coding changes33. In that study, the authors infer that CTTNBP2 and REEP3 would be intolerant of homozygous loss of function variants (pRec = 0.99999015 and pRec = 0.953842585, respectively), whereas HTR2A (pRec = 0.225555783) and, most notably, NRXN1 (pRec = 5.13 × 10−5) would be far more tolerant. Our finding of enrichment for regulatory variants in CTTNBP2 and REEP3 suggests that these genes may tolerate variants with more subtle functional impacts, such as expression differences in specific cell types or developmental stages.
Technological advances in high-throughput sequencing bring increased focus on identifying causal genetic variants as a first step toward targeted disease therapies58. However, existing approaches have notable limitations. WGS is prohibitively expensive in large cohorts, whereas cost-saving whole-exome sequencing does not capture the regulatory variants underlying complex diseases21. Leveraging existing genomic resources can increase power to find causal variants through meta-analysis and imputation, but these resources are heavily biased towards a few populations. Without new approaches, advances in precision medicine will predominantly benefit those of European descent.
Here, we describe an approach that combines prior findings, targeted sequencing, and a new analytic method to efficiently identify genes and individual variants associated with complex disease risk. In a modest-size cohort of OCD cases and controls we find associations driven by both coding and regulatory variants, highlighting new potential therapeutic targets. Our method holds promise for elucidating the biological basis of complex disease, and for extending the power of precision medicine to previously excluded populations.
We designed and carried out the study in two phases. In the first, discovery phase, we performed targeted sequencing of 592 individuals with DSM-IV OCD and 560 controls of European ancestry, and tested association for OCD at single variant-level, gene-level, and pathway-level. In the second, validation phase, we employed three distinct analyses. (1) We genotyped both the original cohort and a second, independent cohort containing 1834 DNA samples (729 DSM-IV OCD cases and 1105 controls) of European ancestry, including a total of 2986 individuals (1321 OCD cases and 1665 controls) to confirm the observed allele frequencies in the discovery phase. (2) We compared our sequencing data with 33,370 population-matched controls from ExAC to confirm the gene-based burden of coding variants as well as allele frequencies. (3) We performed EMSA to test whether our candidate variants have regulatory function. Uses of biospecimens in this study were reviewed and approved by either the Broad's Office of Research Subject Protection, or the Partners HealthCare Human Research Committee. Informed consent was obtained from all subjects included in our study.
We targeted 82,723 evolutionarily constrained regions in and around 608 genes, which included all regions within 1 kb of the start and end of each of 608 targeted genes with SiPhy evolutionarily constraint score >7, as well as all exons22. For the intergenic regions upstream and downstream of each gene, we used constraint score thresholds that became more stringent with distance from the gene.
Pooled sequencing and variant annotation
Groups of 16 individuals were pooled together into 37 case pools and 35 control pools and bar-coded. Targeted-genomic regions were captured using a custom NimbleGen hybrid capture array and sequenced by Illumina GAII or Illumina HiSeq2000. Sequencing reads were aligned and processed by Picard analysis pipeline (http://broadinstitute.github.io/picard/). Variants and AFs were called using Syzygy17 and SNVer23. We used ANNOVAR25 to annotate variants for RefSeq genes (hg19), GERP scores, ENCODE DHS cluster, and 1000 G data.
SNP genotyping was performed using the Sequenom MassARRAY iPLEX platform. The resulting data were analyzed using PLINK1.9 (www.cog-genomics.org/plink2).
For each allele of the tested variants, pairs of 5′-biotinylated oligonucleotides were obtained from IDT Inc. (Supplementary Data 4). Equal volumes of forward and reverse oligonucleotides (1 pmol/µl) were mixed and heated at 95°C for 5 min and then cooled to room temperature. Annealed probes were incubated at room temperature for 30 min with SK-N-BE(2) nuclear extract (Active Motif). The remaining steps followed the LightShift Chemiluminescent EMSA Kit protocol (Thermo Scientific).
For gene-association/pathway-association, we used the sum of the differences of non-reference allele rates between cases and controls per gene as test statistic, and calculated the probability of observing a test statistic by chance from 10,000 permutations. Multiple testing was empirically corrected using “minP” procedure. See Supplementary Methods for details.
The code used in this study was obtained from R package Rplinkseq and PLINK1.9.
All data presented in this study are accessible at: https://data.broadinstitute.org/OCD_NatureCommunications2017/.
Pauls, D. L. The genetics of obsessive-compulsive disorder: a review. Dialogues Clin. Neurosci. 12, 149–163 (2010).
Stewart, S. E. et al. Genome-wide association study of obsessive-compulsive disorder. Mol. Psychiatry 18, 788–798 (2013).
Mattheisen, M. et al. Genome-wide association study in obsessive-compulsive disorder: results from the OCGAS. Mol. Psychiatry 20, 337–344 (2014).
Welch, J. M. et al. Cortico-striatal synaptic defects and OCD-like behaviours in Sapap3-mutant mice. Nature 448, 894–900 (2007).
Ahmari, S. E. et al. Repeated cortico-striatal stimulation generates persistent OCD-like behavior. Science 340, 1234–1239 (2013).
Ting, J. T. & Feng, G. Neurobiology of obsessive-compulsive disorder: insights into neural circuitry dysfunction through mouse genetics. Curr. Opin. Neurobiol. 21, 842–848 (2011).
Karlsson, E. K. & Lindblad-Toh, K. Leader of the pack: gene mapping in dogs and other model organisms. Nat. Rev. Genet. 9, 713–725 (2008).
Tang, R. et al. Candidate genes and functional noncoding variants identified in a canine model of obsessive-compulsive disorder. Genome Biol. 15, R25 (2014).
Dodman, N. H. et al. A canine chromosome 7 locus confers compulsive disorder susceptibility. Mol. Psychiatry 15, 8–10 (2010).
Sullivan, P. F., Daly, M. J. & O’Donovan, M. Genetic architectures of psychiatric disorders: the emerging picture and its implications. Nat. Rev. Genet. 13, 537–551 (2012).
Farrell, M. S. et al. Evaluating historical candidate genes for schizophrenia. Mol. Psychiatry 20, 555–562 (2015).
Sobrin, L. et al. Candidate gene association study for diabetic retinopathy in persons with type 2 diabetes: the Candidate gene Association Resource (CARe). Invest. Ophthalmol. Vis. Sci. 52, 7593–7602 (2011).
Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536, 41–47 (2016).
D’Gama, A. M. et al. Targeted DNA sequencing from autism spectrum disorder brains implicates multiple genetic mechanisms. Neuron 88, 910–917 (2015).
Bertram, L. & Tanzi, R. E. Thirty years of Alzheimer’s disease genetics: the implications of systematic meta-analyses. Nat. Rev. Neurosci. 9, 768–778 (2008).
Cohen, J. C., Boerwinkle, E., Mosley, T. H. Jr. & Hobbs, H. H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med. 354, 1264–1272 (2006).
Rivas, M. A. et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 43, 1066–1073 (2011).
Gutierrez-Achury, J. et al. Fine mapping in the MHC region accounts for 18% additional genetic risk for celiac disease. Nat. Genet. 47, 577–578 (2015).
Roth, E. M., McKenney, J. M., Hanotin, C., Asset, G. & Stein, E. A. Atorvastatin with or without an antibody to PCSK9 in primary hypercholesterolemia. N. Engl. J. Med. 367, 1891–1900 (2012).
Warr, A. et al. Exome sequencing: current and future perspectives. G3 5, 1543–1550 (2015).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Wei, Z., Wang, W., Hu, P., Lyon, G. J. & Hakonarson, H. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 39, e132 (2011).
Pruitt, K. D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–63 (2014).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Pauls, D. L., Abramovitch, A., Rauch, S. L. & Geller, D. A. Obsessive-compulsive disorder: an integrative genetic and neurobiological perspective. Nat. Rev. Neurosci. 15, 410–424 (2014).
Kheradpour, P. et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 23, 800–811 (2013).
Becamel, C. et al. The serotonin 5-HT2A and 5-HT2C receptors interact with specific sets of PDZ proteins. J. Biol. Chem. 279, 20257–20266 (2004).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Schneider, J. W. Caveats for using statistical significance tests in research assessments. J. Informetr. 7, 50–62 (2013).
Taylor, S. Molecular genetics of obsessive-compulsive disorder: a comprehensive meta-analysis of genetic association studies. Mol. Psychiatry 18, 799–805 (2013).
Park, J.-H. et al. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc. Natl Acad. Sci. USA 108, 18026–18031 (2011).
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proc. Natl Acad. Sci. USA 111, E455–64 (2014).
de Wit, J. et al. LRRTM2 interacts with Neurexin1 and regulates excitatory synapse formation. Neuron 64, 799–806 (2009).
Surmeier, D. J., Ding, J., Day, M., Wang, Z. & Shen, W. D1 and D2 dopamine-receptor modulation of striatal glutamatergic signaling in striatal medium spiny neurons. Trends Neurosci. 30, 228–235 (2007).
Sudhof, T. C. Neuroligins and neurexins link synaptic function to cognitive disease. Nature 455, 903–911 (2008).
Rujescu, D. et al. Disruption of the neurexin 1 gene is associated with schizophrenia. Hum. Mol. Genet. 18, 988–996 (2009).
Takata, A., Ionita-Laza, I., Gogos, J. A., Xu, B. & Karayiorgou, M. De novo synonymous mutations in regulatory elements contribute to the genetic etiology of autism and schizophrenia. Neuron 89, 940–947 (2016).
Castermans, D. et al. Identification and characterization of the TRIP8 and REEP3 genes on chromosome 10q21.3 as novel candidate genes for autism. Eur. J. Hum. Genet. 15, 422–431 (2007).
Blackstone, C., O’Kane, C. J. & Reid, E. Hereditary spastic paraplegias: membrane traffic and the motor pathway. Nat. Rev. Neurosci. 12, 31–42 (2011).
Kala, K. et al. Gata2 is a tissue-specific post-mitotic selector gene for midbrain GABAergic neurons. Development 136, 253–262 (2009).
Chen, Y. K. & Hsueh, Y. P. Cortactin-binding protein 2 modulates the mobility of cortactin and regulates dendritic spine formation and maintenance. J. Neurosci. 32, 1043–1055 (2012).
Dodman, N. H. et al. Genomic risk for severe canine compulsive disorder, a dog model of human OCD. Int. J. Appl. Res. Vet. Med. 14, 1–18 (2016).
Porcelli, S. et al. Pharmacogenetics of antidepressant response. J. Psychiatry Neurosci. 36, 87–113 (2011).
Shmelkov, S. V. et al. Slitrk5 deficiency impairs corticostriatal circuitry and leads to obsessive-compulsive-like behaviors in mice. Nat. Med. 16, 598–602 (2010).
Graf, E. R., Zhang, X., Jin, S. X., Linhoff, M. W. & Craig, A. M. Neurexins induce differentiation of GABA and glutamate postsynaptic specializations via neuroligins. Cell 119, 1013–1026 (2004).
Doly, S. & Marullo, S. Gatekeepers controlling GPCR export and function. Trends Pharmacol. Sci. 36, 636–644 (2015).
Marin, O. & Rubenstein, J. L. A long, remarkable journey: tangential migration in the telencephalon. Nat. Rev. Neurosci. 2, 780–790 (2001).
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
Schwarz, D. S. & Blower, M. D. The endoplasmic reticulum: structure, function and response to cellular signaling. Cell Mol. Life Sci. 73, 79–94 (2016).
The UniProt Consortium. UniProtKB—P28223 (5HT2A_HUMAN). http://www.uniprot.org/uniprot/P28223. Accessed 8th August (2016)
Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–522 (2016).
Niknafs, N. et al. MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum. Genet. 132, 1235–1243 (2013).
Chen, F., Venugopal, V., Murray, B. & Rudenko, G. The structure of neurexin 1α reveals features promoting a role as synaptic organizer. Structure 19, 779–789 (2011).
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44, D7–D19 (2016).
Chen, Y.-K., Y.-K., C. & Y.-P., H. cortactin-binding protein 2 modulates the mobility of cortactin and regulates dendritic spine formation and maintenance. J. Neurosci. 32, 1043–1055 (2012).
Lambe, E. K., Fillman, S. G., Webster, M. J. & Shannon Weickert, C. Serotonin receptor expression in human prefrontal cortex: balancing excitation and inhibition across postnatal development. PLoS ONE 6, e22799 (2011).
Jenkins, A. K. et al. Neurexin 1 (NRXN1) splice isoform expression during human neocortical development and aging. Mol. Psychiatry 21, 701–706 (2016).
El Sayegh, T. Y. et al. Cortactin associates with N-cadherin adhesions and mediates intercellular adhesion strengthening in fibroblasts. J. Cell Sci. 117, 5117–5131 (2004).
Chen, Y. K., Chen, C. Y. & Hu, H. T. CTTNBP2, but not CTTNBP2NL, regulates dendritic spinogenesis and synaptic distribution of the striatin–PP2A complex. Mol. Biol. Cell 23, 4383–4392 (2012).
We thank the participating individuals for their support, Eric S. Lander, Steven E. Hyman, Jessica Alföldi, and Kaitlin Samocha for valuable input; Leslie Gaffney for help with illustrations; Jeremiah M. Scharf for sample contribution and discussions; and Broad Genomics Platform for sample processing, sequencing, and genotyping. H.J.N. is supported by the AKC Health Foundation and Swedish Research Council, C.R. by the Swedish Research Council (K2013-61P-22168), K.L.-T. by the Swedish Medical Research Council and European Research Council, and E.K.K. by NIH NIMH (1R21MH109938-01). A Broad Institute SPARC grant supported part of this work.
The authors declare is no competing financial interests.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Noh, H.J., Tang, R., Flannick, J. et al. Integrating evolutionary and regulatory information with a multispecies approach implicates genes and pathways in obsessive-compulsive disorder. Nat Commun 8, 774 (2017). https://doi.org/10.1038/s41467-017-00831-x
American Journal of Medical Genetics Part B: Neuropsychiatric Genetics (2020)
Upsala Journal of Medical Sciences (2020)
Whole genome analyses reveal significant convergence in obsessive-compulsive disorder between humans and dogs
Science Bulletin (2020)
Disrupted pathways from the frontal-parietal cortices to basal nuclei and the cerebellum are a feature of the obsessive-compulsive disorder spectrum and can be used to aid in early differential diagnosis
Psychiatry Research (2020)
How obsessive–compulsive and bipolar disorders meet each other? An integrative gene-based enrichment approach
Annals of General Psychiatry (2020)