Introduction

Tourette syndrome (TS) is a chronic neurodevelopmental disorder characterized by several motor tics and at least one vocal tic that persist more than a year1. Its prevalence is between 0.6 and 1% in school-aged children2,3. Although TS is highly polygenic in nature, it is also highly heritable4. The population-based heritability is estimated at 0.75,6, with SNP-based heritability ranging from 21 to 58%4 of the total. The genetic risk for TS that is derived from common variants is spread throughout the genome4. The two genome-wide association studies (GWAS) conducted to date7,8 suggest that TS genetic variants may be associated, in aggregate, with tissues within the cortico-striatal and cortico-cerebellar circuits, and in particular, the dorsolateral prefrontal cortex. The GWAS results also demonstrated significant ability to predict tic severity using TS polygenic risk scores7,9. A genome-wide CNV study identified rare structural variation contributing to TS on the NRXN1 and CNTN6 genes10. De novo mutation analysis studies in trios have highlighted two high confidence genes, CELSR and WWC1, and four probable genes, OPA1, NIPBL, FN1, and FBN2 to be associated with TS11,12.

Investigating clusters of genes, rather than relying on single-marker tests is an approach that can significantly boost power in a genome-wide setting13. Common variant studies can account for a substantial proportion of additive genetic variance14 and have indeed produced a wealth of variants associated with neuropsychiatric disorders, which, however, lack strong predictive qualities, an issue commonly referred to as “missing heritability”15. Theoretical, as well as empirical, observations have long hinted toward the involvement of non-additive genetic variance into the heritability of common phenotypes. As such, pathway analyses could pave the way toward the elucidation of missing heritability in complex disease.

This approach has already proven useful in early genome-wide studies of TS. The first published TS GWAS, which included 1285 cases and 4964 ancestry-matched controls did not identify any genome-wide significant loci. However, by partitioning functional- and cell-type-specific genes into gene sets, an involvement of genes implicated in astrocyte carbohydrate metabolism was observed, with a particular enrichment in astrocyte-neuron metabolic coupling16. Here, we investigated further the pathways that underlie the neurobiology of TS, performing gene set analysis on a much larger sample of cases with TS and controls from the second wave TS GWAS. We employed both a competitive gene set analysis as implemented through MAGMA, as well as a self-contained analysis through a set-based association method (SBA). Besides highlighting a potential role for neuroimmunity, our work also provides further support for previously implicated pathways including signaling cascades and cell adhesion molecules.

Materials and methods

Samples and quality control

The sample collection and single variant analyses for the data we analyzed have been extensively described previously7,8. IRB approvals and consent forms were in place for all data collected and analyzed as part of this project. For the purposes of our analysis, we combined 1285 cases with TS and 4964 ancestry-matched controls from the first wave TS GWAS, with 2918 TS cases and 3856 ancestry-matched controls from the second wave TS GWAS. Standard GWAS quality control procedures were employed17,18. The data were partitioned first by genotyping platform and then by ancestry. The sample call rate threshold was set to 0.98, and the inbreeding coefficient threshold to 0.2. A marker call rate threshold was defined at 0.98, case-control differential missingness threshold at 0.02, and Hardy–Weinberg equilibrium (HWE) threshold to 10−6 for controls and 10−10 for cases. Before merging the partitioned datasets, we performed pairwise tests of association and missingness between the case-only and control-only subgroups to address potential batch effect issues. All SNPs with p-values ≤10−06 in any of these pairwise quality control analyses were removed. After merging all datasets, principal component analysis was utilized to remove samples that deviated more than 6 standard deviations and to ensure the homogeneity of our samples in the ancestry space of the first 10 principal components, through the use of the EIGENSOFT suite19. Identity-by-descent analysis with a threshold of 0.1875 was used to remove related samples, and thus to avoid confounding by cryptic relatedness. After quality control, the final merged dataset consisted of 3581 cases with TS and 7682 ancestry-matched controls on a total of 236,248 SNPs, annotated using dbSNP version 137 and the hg19 genomic coordinates.

We assessed the genomic variation in our data through PCA analysis to identify potential population structure (Supplementary Fig. 1 and Supplementary Table 1). The variation in our data was reduced to a triangular shape in the two-dimensional space of the first two principal components. One tip was occupied by Ashkenazi Jewish samples, the second by the Southern European samples, and the other by the North Europeans. Depicting geography, the Southern to Nothern axis was populated by European-ancestry samples. The first five principal components were deemed statistically significant (Tracy Widom test as implemented by EIGENSOFT, Supplementary Table 1) and were added to the association model as covariates, in order to avoid population structure influencing our results.

Gene sets

We collected neural-related gene sets from multiple studies on pathway analyses in neuropsychiatric disorders16,20,21,22,23,24. These studies relied on an evolving list of functionally-partitioned gene sets, focusing mainly on neural gene sets, including synaptic, glial sets, and neural cell-associated processes. We added a lymphocytic gene set also described in these studies23, in order to also investigate potential neuroimmune interactions.

In total, we obtained 51 gene sets, which we transcribed into NCBI Entrez IDs and subsequently filtered by removing gene sets that contained fewer than 10 genes. Forty-five gene sets fit our criteria and were used to conduct the analyses.

We examined two primary categories of pathway analysis methods, the competitive 25 and the self-contained test16,25. The competitive test compares the association signal yielded by the tested gene set to the association signals that do not reside in it26,27. In this type of test, the null hypothesis is that the tested gene set attains the same level of association with disease as equivalent random gene sets. In contrast, the self-contained test investigates associations of each tested gene set with the trait, and not with other gene sets, meaning that the null hypothesis in this case is that the genes in the gene set are not associated with the trait25,27. Therefore, for a competitive test, there should be data for the whole breadth of the genome, but this test cannot provide information regarding how strongly the gene set is associated with the trait28. We employ both methods for a comprehensive investigation into the neurobiological background of TS.

MAGMA on raw genotypes

We ran MAGMA26 on the individual-level genotype data using the aforementioned filtered gene set lists. MAGMA performs a three-step analytic process. First, it annotates the SNPs by assigning them to genes, based on their chromosomal location. Then it performs a gene prioritization step, which is used to perform the final gene set analysis step. We used a genomic window size of ±10 kb and the top 5 principal components as covariates to capture population structure. SNP-to-gene assignments were based on the NCBI 37.3 human gene reference build. The number of permutations required for the analysis was determined by MAGMA, using an adaptive permutation procedure leading to 11,263 permutations. MAGMA employs a family-wise error correction calculating a significance threshold of 0.00100496.

Set-based association (SBA) test

We conducted SBA tests on the raw individual genotype data, as described in PLINK25,29 and adapted in a later publication30. This test relies on the assignment of individual SNPs to a gene, based on their position, and thus to a pathway, according to the NCBI 37.3 human gene reference build. After single-marker association analysis, the top LD-independent SNPs from each set are retained and selected in order of decreasing statistical significance, and the mean of their association p-values is calculated. We permuted the case/control status, repeating the previous association and calculation steps described above, leading to the empirical p-value for each set. The absolute minimum number of permutations required for crossing the significance level is dictated by the number of gene sets tested. Testing for 45 gene sets requires at least 1000 permutations to produce significant findings. PLINK’s max(t) test recommends at least 64,000 permutations. We opted to increase the number of permutations to one million, the maximum that was computationally feasible, to maximize our confidence in the outcomes, given our large sample size.

We used logistic regression as the association model on the genotypes and the first five principal components as covariates on the genotype data to conduct the SBA test with the collected neural gene sets. Another repetition of this step was performed with a simple association test, to test for this method’s robustness to population structure. We proceeded to run the analysis on all samples, using all gene sets at a 10 kb genomic window size, the first five principal components as covariates, and one million permutations. Since the permutations were performed on the phenotypic status of the samples, and only served as a method of association of the trait with the gene sets, we also corrected the results by defining the significance threshold through Bonferroni correction at 1.1 × 10−3 (0.05/45).

Results

For the gene set association analysis, we ran PLINK’s self-contained set-based association method and MAGMA’s competitive association method, using the same 45 gene sets on the processed genotyped data of 3581 cases and 7682 ancestry-matched controls on a total of 236,248 SNPs. By performing both methods of analysis we aimed to obtain a global assessment of the gene sets’ relationship with TS.

MAGMA analysis identified one significant gene set (Fig. 1), cell adhesion and trans-synaptic signaling (CATS), which achieved a nominal p-value of 6.2 × 10−5 (permuted p-value of 0.0032). While the CATS gene set is comprised of 83 genes, MAGMA’s annotation step prioritized 72 of its genes for the gene set analysis. It involves 3290 variants that were reduced to 1627 independent variants in our data. Results were mainly driven by associations in the CDH26, CADM2, and OPCML genes as indicated by MAGMA gene-based analysis (Table 1). In the gene-based tests, CDH26 attained a p-value of 8.9526 × 10−6, CADM2 a p-value of 4.6253 × 10−4, and OPCML a p-value of 7.9851 × 10−4, neither crossing the genome-wide significance threshold for gene tests (2.574 × 10−6 calculated on 19,427 genes contained in the NCBI 37.3 version of RefGene).

Fig. 1: Results of gene set analysis as implemented by MAGMA.
figure 1

The gene set that crossed the significance threshold is depicted in red.

Table 1 Statistically significant result of MAGMA gene set analysis.

We next run SBA, which conducts an initial single-marker association step before performing permutations to calculate empirical p-values for the gene sets. This association step is performed on the total number of variants that are associated with the genes involved in the gene sets, leading to a subset of 25,630 variants in our data, which are then filtered based on their LD. Analysis identified three gene sets as significant (Table 2), the ligand-gated ion channel signaling (LICS) (P: 2.67 × 10−4), the lymphocytic (P: 3.5 × 10−4), and the cell adhesion and trans-synaptic signaling (CATS) (P: 1.07 × 10−3). Detailed results for all the tested gene sets are shown in Fig. 2.

Table 2 Statistically significant results of the SBA analysis.
Fig. 2: Results of gene set analysis as implemented by SBA.
figure 2

The gene sets that crossed the significance threshold are depicted in red.

The LICS gene set was the top-scoring gene set, including 38 genes and involving 683 variants, 66 of which were associated with TS. The gene set’s signal was primarily driven by variants residing in the genes of the γ-aminobutyric acid receptors GABRG1 and GABBR2, the HCN1 channel gene and the glutamate receptor gene GRIK4. This signal was driven primarily by an association with SNP rs9790873, which is an eQTL for HCN1 in tibial nerve, according to GTEx31. GABBR2 is represented by two top SNPs, that are LD-independent, and removing either of those SNPs from the gene set did not cause the gene set to drop under the significance threshold.

The lymphocytic gene set was the next top-scoring gene set, including 143 genes that translated to 799 variants in our data, with 50 of these variants associated with TS. Its signal was driven by a missense variant inside the FLT3 gene and an intergenic variant between NCR1 and NLRP7, followed by IL12A, HDAC9, CD180. The rs1933437 SNP is the top variant for FLT3, and is a possibly damaging missense variant32, located in the sixth exon of the FLT3 gene leading to a p.Thr227Met mutation. It is a very common variant and the sixth exon appears to be less expressed than downstream exons. Given the tissues in which this eQTL affects FLT3’s expression, we tested the lymphocytic gene set by removing FLT3 from it, to identify whether the lymphocytic gene set association was biased by the presence of FLT3. After removing FLT3, the lymphocytic gene set association statistic decreased slightly (P: 0.00012), driven mainly by NCR1/NRLP7.

The third significant gene set, CATS, consisted of 83 genes, including multiple large genes. CATS was identified by both SBA and MAGMA in our analyses, and both gene set approaches identified CDH26 as the gene with the lowest p-value. Both SBA and MAGMA also identified NCAM2, NTM, and ROBO2 as strongly associated with TS, with NTM represented by two LD-independent SNPs. CATS’s top SNP, rs1002762, resides in the CDH26 gene on chromosome 20, and is the top associated SNP in our data (P: 2.031 × 10−6) with an odds ratio of 1.178.

Notable results from the SBA also include the Astrocyte small GTPase mediated signaling (ASGMS) and the Astrocyte-neuron metabolic coupling (ANMC) gene sets, with a p-values slightly under the significance thresholds. These gene sets attained a p-value of 0.00137 and 0.001504, respectively.

Discussion

Seeking to elucidate the neurobiology of TS, we present here the largest study to date aiming to interrogate the involvement of sets of genes that are related to neuronal and glial function in TS. We analyzed data from our recently performed TS GWAS and conducted two distinct types of testing, a competitive, regression-based test (MAGMA) and a self-contained, p-value combining test (SBA). Self-contained tests investigate for associations with a phenotype, while competitive tests compare a specific gene set against randomly generated gene sets. We employed both methods to perform a comprehensive investigation of the biological background of TS.

A potential problem in pathway analysis is false SNP assignment to genes, which in turn may increase false results. In order to address this issue, most studies in the literature use short window sizes (10–20 kb) when assigning SNPs to genes. Here, we used a 10 kb window, paired with excessive permutations to avoid false assignments, that would introduce false positive results. There is evidence that long-range SNP effects could play a role, mostly associated with large insertion/deletion events that are not in the scope of this study and would likely hamper the analysis33.

MAGMA’s regression-based algorithm has been reported to account for gene size biases, as can be also discerned by the variable sizes of the top genes. MAGMA’s top prioritized gene, CDH26, is represented by 4 SNPs in our data, CADM2 by 42, while OPCML is represented by 210 SNPs, as it covers an extensive genomic region. We addressed such issues in SBA by setting a low r2 threshold and conditioning on any LD-independent SNPs that resided on the same gene.

The gene sets used in our study come from a line of function-based analyses, aiming to investigate neurobiological mechanisms in neuropsychiatric disorders. A previous pathway analysis using individual-level genotype data of the first wave TS GWAS identified genes involved in astrocytic-neuron metabolic coupling, implicating astrocytes in TS pathogenesis16. In this study, we took advantage of the increased sample size of the second wave TS GWAS and the mechanics of the two distinct methods to identify gene sets associated with TS that provide a novel insight into the pathogenesis of TS, and substantiate the role of neural processes in this neuropsychiatric disorder.

The ANMC gene set that contains genes involved in carbohydrate metabolism in astrocytes was the single identified gene set in the previous pathway analysis study on TS16, raising a hypothesis on a potential mechanism that involves altered metabolism of glycogen and glutamate/γ-aminobutyric acid in the astrocytes. In our study, the ANMC gene set scored slightly under the significance threshold.

Here, analyzing a much larger sample size we identified three sets of genes as significantly associated to the TS phenotype. Among them the LICS gene set, which involves genes implicated in ion channel signaling through γ-aminobutyric acid and glutamate. Several genes in the LICS gene set have been previously implicated in neuropsychiatric phenotypes. HCN1, a hyperpolarization-activated cation channel involved in native pacemaker currents in neurons and the heart, has been significantly associated with schizophrenia and autism34,35,36. GABRG1, an integral membrane protein that inhibits neurotransmission by binding to the benzodiazepine receptor, has yielded mild associations with general cognitive ability37 and epilepsy38, while GABBR2, a g-protein-coupled receptor that regulates neurotransmitter release, with schizophrenia39 and post-traumatic stress disorder40 in multiple studies. The GABA-ergic pathway has been previously implicated in TS, and recent advances showcased the possibility that a GABA-ergic transmission deficit can contribute toward TS symptoms41. GRIK4, encoding a glutamate-gated ionic channel, has shown associations with mathematical ability and educational attainment42 and weaker associations with attention-deficit hyperactivity disorder43. The γ-aminobutyric acid receptors and the HCN channel, are features of inhibitory interneurons44 and also identified in the brain transcriptome of individuals with TS45, adding to the evidence that the phenotype of TS could be influenced by an inhibitory circuit dysfunction, as has previously been proposed46.

Individuals with TS are reported to present elevated markers of immune activation45,47. In addition, a number of studies have implicated neuroimmune responses with the pathogenesis of TS48,49,50. We investigated neuroimmune interactions by interrogating association to a gene set designed by Goudriaan et al.23 to study enrichment in lymphocytic genes. Indeed, our analysis yielded a statistically significant signal. The FLT3 association coincides with the results of the second wave TS GWAS, in which FLT3 was the only genome-wide significant hit7. FLT3 and its ligand, FLT3LG, have a known role in cellular proliferation in leukemia, and have been found to be expressed in astrocytic tumors51. The rs1933437 variant in FLT3 is an eQTL in the brain cortex and the cerebellum31, and has also been implicated in the age at the onset of menarche52. Variants in FLT3 have attained genome-wide significance in a series of studies focusing on blood attributes in populations of varying ancestry, and our current insights into its role are mostly based on these associations with blood cell counts, serum protein levels, hypothyroidism, and autoimmune disorders52,53,54,55.

FLT3 could play a role in neuroinflammation as supported by its intriguing association with peripheral neuropathic pain. The inhibition of FLT3 is reported to alleviate peripheral neuropathic pain (PNP)56, a chronic neuroimmune condition that arises from aberrations in the dorsal root ganglia. Cytokines and their receptors have been at the epicenter of the neuroimmune interactions, with microglia contributing significantly to chronic phenotypes of such states57. FLT3 is a critical component for neuroimmune interactions, especially in the case of the development and sustenance of the PNP phenotype. Interestingly, pain follows sex-specific routes, with glia having a prominent role for pain propagation in males, while females involve adaptive immune cells instead58. These, paired with previous evidence of glial involvement in TS16, raise an interesting hypothesis for TS symptom sustenance, since FLT3 has been shown to be critical for the chronicity of neuronal dysregulations56.

Notably, FLT3 has a prominent role in the hematologic malignancies, with one-third of adult acute myeloid leukemia (AML) patients presenting with activating mutations in FLT3, and wild-type FLT3 being found overexpressed in hematologic malignancies. FLT3 is implicated in apoptotic mechanisms, with its mutations being associated with59 Warburg effect promotion, inhibition of ceramide-dependent mitophagy60, and induction of pro-survival signals, through downstream signaling cascades, including PI3K-Akt-mTOR, Ras/MAPK, and JAK-STAT. This mitochondrial role of FLT3 has been further reinforced by findings that associate it with increased post-transcriptional methylation of mitochondrial tRNAs in cancer61. As such, FLT3 is regarded a molecular target for therapeutic intervention62.

FLT3 is expressed in the cerebellum and whole blood, while FLT3’s top variant, rs1933437, is an eQTL for FLT3 on GTEx31 in various brain tissues, such as the cortex, the cerebellum, the hypothalamus, the frontal cortex (BA9), and non-brain tissues, such as the skin, the pancreas, and adipose tissues. In order to test the robustness of the lymphocytic association in our findings, we repeated the analysis after removing FLT3 from the lymphocytic gene set. The p-value of the gene set decreased, but still remained significant, due to the association in the NCR1/NLRP7 locus. Besides FLT3, the other genes included in this gene set are also quite intriguing to consider as potential candidates that could underlie the pathophysiology of TS. In the same vein with FLT3, common variants in NCR1 have also been significantly associated with blood protein levels63. HDAC9 has been significantly associated with androgenetic alopecia52,64, hair color52, and ischemic stroke65. These seem to follow previous knowledge, given that genes involved in ischemic stroke have been identified as a common component between TS and ADHD66, and that TS, similar to other neuropsychiatric disorders, demonstrates a distinct preference for males. CD180 has shown associations with general cognitive ability37.

The CATS gene set involves many cell adhesion molecules, with the top signals found in CDH26. CDH26 is a cadherin that regulates leukocyte migration, adhesion, and activation, especially in the case of allergic inflammation67. Cell adhesion molecules have been consistently implicated in phenotypes related to brain function, with the latest addition of the high confidence TS gene CELSR3, a flamingo cadherin, that was identified in a large scale de novo variant study for TS12. Their relation to TS has been well documented, with the notable examples of neurexins, contactins, neuroligins, and their associated proteins10,68,69,70. These genes were present in the CATS gene set but did not reach a level of significance in our analysis. This hints toward their possible involvement in TS mostly through rare variants10,68,69, a notion reinforced by findings in other neuropsychiatric disorders71,72.

Most of the genes contained in the identified gene sets in this study are involved in cognitive performance, mathematical ability, and educational attainment42. OPCML, CADM2, and ROBO2 have been implicated in neuromuscular and activity phenotypes, such as grip strength73, physical activity74, and body mass index52. ROBO2 has been associated with depression75, expressive vocabulary in infancy76, while CADM2 is associated to a multitude of phenotypes, including anxiety75, risk-taking behavior, and smoking77. NTM displays similar patterns of pleiotropy, associated with smoking52, myopia64, hair color78, anxiety75, asperger’s syndrome79, bipolar disorder with schizophrenia80, and eating disorders81. NCAM2 and NTM, similarly to the lymphocytic genes, have been significantly associated with blood protein levels82 and leukocyte count52, respectively. Many of these phenotypes are known TS comorbidities, presenting themselves commonly or less commonly in TS cases, and others are related to functions that get impaired in TS symptomatology.

The CATS gene set was identified in both methods indicating the involvement of cell adhesion molecules in transsynaptic signaling. Using genotypes with both methods as a means of identifying pathways instead of summary statistics, gave our study the edge of sample-specific linkage disequilibrium rather than relying on an abstract linkage disequilibrium pattern reference. Our current understanding for regional structures of the genome and the cis-effects of genomic organization will aid the refinement of these associations as well as help shape our understanding of the pleiotropic mechanisms in the identified loci potentially responsible for disease pathogenesis.

In conclusion, our analysis reinforces previous findings related to TS neurobiology while also providing novel insights: We provide further support for the role of FLT3 in TS, as well as the possibility for the involvement of the GABA-ergic biological pathway in TS pathogenesis. At the same time, our study highlights the potential role of glial-derived neuroimmunity in the neurobiology of TS opening up intriguing hypotheses regarding the potential for gene-environment interactions that may underlie this complex phenotype.