Introduction

An epileptic seizure is a paroxysm of symptoms and signs due to abnormally excessive or synchronous neuronal activity1. Seizures are classified based on their characteristics and electroencephalogram (EEG) as focal-onset seizures (which start in a specific brain region) and generalized-onset seizures (which are rapidly seen across bihemispheric networks)1,2. The utility of this seizure classification is that it categorizes epilepsy into syndromes and allows clinicians to make implications about disease etiology, trajectory, and response to medication. Clinical manifestations vary from whole-body convulsions with loss of consciousness (tonic-clonic seizures), to movements involving only part of the body with variable levels of consciousness (focal motor seizure), to a brief loss of awareness (absence seizure)1,2. Seizures can be provoked by head trauma, infection, or acute toxic-metabolic imbalance, or they can be spontaneous and unprovoked. Individuals who exhibit at least one unprovoked seizure with an enduring elevated risk of further seizures or who have the electroclinical features of one of a few specific epilepsy syndromes that can be diagnosed without recurrent seizures fulfill the criteria for a diagnosis of epilepsy1. Seizures and epilepsy are common in the general population. Neonatal seizures occur in 1.5% of neonates, febrile seizures in 2–4% of young children, and epilepsy in up to 1% of children and adolescents3. Seizures are common among individuals with neurodevelopmental disorders, affecting 21.5% of those with autism and intellectual disability and 8% with autism without intellectual disability4.

Copy number variants (CNVs), such as deletions and duplications, change the dosage of genomic segments and are established risk factors for various types of epilepsy5,6,7,8,9,10,11,12,13,14, seizures15, and neuropsychiatric disorders16,17,18,19. Large CNVs can affect multiple dosage-sensitive genes, leading to complex clinical presentations. To date, only one hypothesis-free genome-wide CNV association study (CNV-GWAS) has been reported for epilepsy20. This CNV-GWAS in 10,712 individuals with epilepsy and 6,746 controls identified three genome-wide significant CNVs20. High-resolution CNV screening has become routine in clinical molecular diagnostics, leading to greater detection of chromosomal abnormalities in patients21. Diagnostic CNVs can be identified in 1–4% of individuals with epilepsy and >10% of those with seizures and neurodevelopmental disorders13,20,21,22. However, the pleiotropy of pathogenic CNVs, partially driven by structural properties (size, fixed vs. variable breakpoints, number of affected genes), represents a significant challenge in the clinical interpretation of CNVs, limiting their utility for disorder classification, prognostication, and the development of precision medicine treatments that specifically target the critical pathogenic gene(s) altered by the CNV. The majority of pathogenic and likely pathogenic CNVs are greater than 1 megabase (Mb) in size, and it is often unclear which gene(s) or genomic element(s) affected by the CNV contribute to one or more disorders23,24. A well-powered seizure CNV discovery screen combined with detailed genotype-phenotype analyses could identify genomic segments that confer risk for seizures, identify clinical characteristics in affected patients and consequently guide genetic test interpretation.

Although many individuals with neuropsychiatric and developmental disorders have comorbid seizures, genome-wide CNV association analyses across epilepsy and seizure have yet to be reported. We hypothesized that genetic risk for seizures is shared in individuals with epilepsy diagnosed according to International League Against Epilepsy (ILAE) criteria1 and related neurological and neurodevelopmental disorders who also have seizures. Therefore, a joint analysis could add to the three epilepsy-associated CNV loci reported previously20. To explore this hypothesis, we performed a meta-analysis of GWAS studies comprising 26,699 individuals with diagnosed epilepsy or seizures and 492,324 controls. Since both definitions are based on the presence of seizures, we refer to individuals affected by either condition as individuals with seizures from here on forward. The effective sample size of this study (Neff = 101,302) provides adequate power to identify significant associations of risk CNVs that are present in the general healthy population, therefore, do not exhibit complete penetrance. However, the analytic setup restricts the frequency in the general population to up to 1% for quality purposes. We assessed the pleiotropy of any identified seizure-associated CNV in subsequent meta-analyses of epilepsy and 238,161 independent individuals affected by a range of 23 neuropsychiatric disorders. Finally, using a subset of the seizure cohort comprising 10,880 individuals with epilepsy detailed using 214,203 Human Phenotype Ontology (HPO) annotations25, we evaluated the clinical features characterizing carriers of each seizure-associated CNV.

Results

Discovery of 25 genome-wide significant seizure-associated CNVs regions

We performed a meta-analysis of 16,109 individuals with epilepsy and 8545 population controls (the Epi25 Collaborative cohort) with 10,590 individuals with seizures (not explicitly meeting diagnostic criteria for epilepsy) and 483,779 population controls, derived from an aggregated CNV dataset of 17 cohorts (neuropsychiatric disorders cohort) (see all cohorts of this study in Supplementary Table 1). The genome was scanned using 267,237 genomic segments of 200 kb size in a 10 kb sliding window approach26. After applying Bonferroni correction of the threshold for a significant association in the meta-analysis and fine-mapping, we identified 25 loci associated with seizures at genome-wide significance (P ≤ 3.74 × 10−6). All 25 loci are shown in Fig. 1 and detailed in Table 1. The 25 identified loci included 15 deletion CNVs (size range: 230 kb to 5 Mb) and ten duplication CNVs (size range: 290 kb to 8.9 Mb). All the genome-wide associated deletions found in this study consisted of the loss of one copy, while all duplications consisted of the gain of one copy. Three of the 25 seizure-associated loci (15q11.2-q13.3 dup, 15q13.2-q13.3 del, 16p13.11 del) had previous genome-wide statistical support for an association with epilepsy from our previous study20 that included 40% of the individuals with seizures of this study. All other identified CNVs (22/25, 88%) represent new genome-wide significant loci for seizures, with 10/22 (59%) loci previously implicated in neurological and psychiatric disorders, 6/22 (23%) specifically in epilepsy by studies without genome-wide statistical support, 2/22 (9%) reported in individuals without neurological or psychiatric disorders, and 4/22 (18%) not previously reported regions. We detailed in Table 2 all commonly reported disease phenotypes for the 25 identified seizure-associated loci. Our meta-analysis in seizure disorders was likely not powered enough to identify some of the known CNVs implicated in epilepsy (without genome-wide statistical support) associated with seizures (e.g., 1q21.1 del/dup). Reciprocal CNVs, defined by deletions and duplications associated with seizures involving overlapping genomic segments, were found at 15q11.2, 16p11.2, and 22q11.21. No overlap existed between the seizure-associated CNV regions identified in this study and the most recent SNP-based GWAS study in epilepsy27.

Fig. 1: Genome-wide meta-analysis identifies 25 CNVs associated with seizure disorders.
figure 1

Miami plot of the meta-analysis of the CNV genome-wide association analyses of (1) 16,109 individuals with clinically validated epilepsy vs. 8545 controls and (2) 10,590 individuals with seizure disorders vs. 483,779 controls. Dots represent -log10 of the meta-analysis P-values (PDEL and PDUP for deletions and duplications, respectively) of the cohort-specific Fisher exact tests for the enrichment of CNVs in cases vs. controls for each a 200 kb sliding window. Genomic regions that surpassed the Bonferroni-corrected threshold for significance (red line, α = 3.74 × 10−6) were annotated with the genomic band containing the signal. Deletions (top) and duplications (mirrored) are shown.

Table 1 Genome-wide significantly associated CNV regions and credible intervals
Table 2 Known disease genes in the credible intervals of the seizure-associated CNV regions

Fine-mapping and candidate genes

Out of the three CNV regions with previous genome-wide statistical support, our fine-mapping approach narrowed down the critical seizure-relevant region for the known 15q11-q13 duplication to the imprinted promoter/exon 1 region of SNPRN (Table 2, Supplementary Fig. 1). The SNRPN promoter/exon 1 region was suggested to regulate the imprinting of the critical region for Prader-Willi syndrome28,29. Overexpression of SNRPN, corresponding to the seizure-associated duplication of the region, was found to cause abnormal neural development in cultured primary cortical neurons30. Conversely, SNRPN knockdown was found in the same study to also cause subtle neuronal abnormalities, in line with reports of short SNRPN deletions in Prader-Willi syndrome31. For the other two CNV regions with previous genome-wide statistical support, we identified several genes with a brain phenotype in the minimal credible intervals. The 15q13.2-q13.3 deletion credible interval includes the haploinsufficient gene OTUD7A, shown to cause abnormal development of cortical dendritic spines and dendrite outgrowth in Otud7aDEL/+ mice32, and KLF13, shown to cause a layer-specific decrease of cortical interneurons in Klf13DEL/+ mice33. The 16p13.11 deletion credible interval includes two haploinsufficient genes: MYH11, implicated in cerebrovascular disorders34,35 that are a risk factor for seizures36, and MARF1, involved in cortical neurogenesis37.

Out of the six seizure-associated CNV regions previously implicated in epilepsy without genome-wide statistical support, we mapped the credible intervals of the two seizure-associated deletions at 1p36 to the first and third known critical regions for seizures within the phenotype spectrum of the 1p36 deletion syndrome38. Known disease genes in the credible intervals at 1p36 are DVL1 (Robinow syndrome39), TMEM240 (Spinocerebellar ataxia 2140), and SKI (Shprintzen-Goldberg syndrome41). In the credible intervals of the remaining CNV regions, we identified the following known disease genes: (i) the haploinsufficient KIF26B gene (Pontocerebellar hypoplasia42) as the only gene affected by the 1q44 deletion, and (ii) PRRT2 (self-limited familial infantile epilepsy, paroxysmal dyskinesia43) and the haploinsufficient TAOK2 gene (Autism44) at the 16p11.2 BP4-BP5 deletion syndrome locus. Of note, single nucleotide variants in PRRT2 are among the most frequent findings in clinical genetic testing of epilepsy45.

Among the ten seizure-associated CNV regions previously reported in other neurological and psychiatric disorders, we identified one credible interval suggesting a different causal gene than previously reported: an interstitial 9q34.3 duplication not encompassing EHMT1 that is considered as the causal gene based on one out of 22 reported 9q34.3 duplication carrier46. The top candidate gene within the credible interval identified by our meta-analysis is GRIN1, affected by 9q34.3 duplications in 21 of all reported carriers46. GRIN1 gain of function variants are known to cause a developmental epileptic encephalopathy, often with polymicrogyria47. In contrast, our fine-mapping analysis confirms TBX1 as the (known) causal gene for the 22q11.21 deletion/DiGeorge syndrome48. We also found LZTR1 (Noonan syndrome49) within the credible 22q11.21 deletion intervals. Other known disease genes in the credible intervals of the remaining CNV regions implicated in neurological and psychiatric disorders were: NPHP1 inside a 2q13 duplication (Autism and global developmental delay50,51), KANK1 (Cerebral palsy spastic quadriplegic 252) inside a small 9p24.3 DOCK8/KANK1 deletion, and NIPA1 (Autosomal dominant spastic paraplegia 653) inside the 15q11.2 BP1-BP2 deletion syndrome region.

Finally, we identified four novel CNV regions associated with seizures. Three out of four harbored known disease genes. The credible region of a non-canonical 16p13.3 duplication included STUB1. STUB1 gain of function was reported to cause early onset dementia syndrome54 and autosomal dominant ataxia with cognitive decline and autism55. The credible region of a non-canonical 17q21.31 deletion included BRCA1. BRCA1 mutations are well-known in cancer56, with BRCA1 as a possible mediator of glioma cell proliferation, migration, and glioma stem cell self-renewal57. The credible region of a novel 20q13.33 duplication included KCNQ2 and EEF1A2. KCNQ2 gain of function is known to cause neurodevelopmental disability and neonatal encephalopathy58,59. EEF1A2 gain of function was shown to cause neurodevelopmental disorders, including epilepsy and intellectual disability60.

Significantly enriched Gene ontology (GO) Biological Processes among all known brain-related disease genes in the credible intervals were: chordate embryonic development (GO:0043009 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0043009&searchtype=ontology]), sensory organ morphogenesis (GO:0090596 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0090596&searchtype=ontology]), mitotic G2 DNA damage checkpoint signaling (GO:0007095 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0007095&searchtype=ontology]), neural tube closure (GO:0001843 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0001843&searchtype=ontology]), negative regulation of Ras protein signal transduction (GO:0046580 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0046580&searchtype=ontology]), dendrite morphogenesis (GO:0048813 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0048813&searchtype=ontology]), and mitotic G2/M transition checkpoint (GO:0044818 [http://amigo.geneontology.org/amigo/search/ontology?q=GO%3A0044818&searchtype=ontology]). No GO Biological Process was significantly enriched when considering all genes inside all credible intervals, pointing to likely heterogeneous disease mechanisms of the 25 seizure-associated CNV regions. All credible intervals and known brain-related disease genes are detailed in Table 2, additional candidate genes of lower confidence are detailed in Supplementary Data 1, and all genes inside the credible intervals are detailed in Supplementary Data 2.

Most of the 25 identified risk CNVs are pleiotropic

We performed 23 meta-analyses of epilepsy with 23 other neuropsychiatric disorders (listed in Supplementary Table 2) in an additional 238,161 individuals with neuropsychiatric disorders and 492,324 controls to explore pleiotropy of the 25 identified CNVs. 24 out of 25 seizure-associated CNVs were significantly associated in at least one of the 23 meta-analyses with a neuropsychiatric disorder. The number of neuropsychiatric disorders with which a significant association was found and their greatest odds ratios are reported in Table 1. About two thirds (60%) of all CNVs were highly pleiotropic and showed significant associations with >10 epilepsy/neuropsychiatric disorder meta-analyses. The most frequently co-associated phenotype was “Neurodevelopmental abnormality” (HP:0012759 [https://hpo.jax.org/app/browse/term/HP:0012759]; associated with 36% of all seizure-associated CNVs).

Characterization of the clinical subphenotypes enriched in the carriers of each seizure-associated CNV in epilepsy patients with deep phenotypes

We performed phenome-wide association analyses for each of the 33 credible intervals identified across the 25 CNV regions to characterize the high-resolution clinical manifestations associated with each CNV. This analysis was performed on a subset of the Epi25 Collaborative cohort (Phenomic cohort, Supplementary Table 1) comprising 10,880 individuals with non-acquired epilepsy and deep phenotypic data (the clinical presentation of this cohort of 10,880 individuals and the frequencies of selected common and characteristic epilepsy phenotypes are provided in Supplementary Table 3). In the Phenomic cohort, 562 individuals (5.2%) carried at least one seizure-associated credible interval (N = 498 / 4.6% carried one credible interval, N = 64 / 0.6% carried 2–5 credible intervals). The most common credible interval (deletion at 2p21-p16.3) was carried by 114 (1.0%) individuals, and 18 credible intervals were found in at least 0.1% of the cohort (≥11 carriers). One CNV was not found (deletion at 9p24.3, containing a single credible interval). Across the 32 detected credible intervals and 1667 annotated HPO concepts, we identified 622 nominally significant associations (two-sided Fisher’s exact test, Supplementary Data 3). Given the large number of associations tested and that HPO annotations describing the same clinical feature at different levels of precision are highly correlated, we applied the minP step-down procedure to aid interpretation61, yielding 19 associations robust to multiple testing within each genetically defined group (minP-adjusted P < 0.05, Table 3, Figs. 2, 3, and Supplementary Fig. 2A–E).

Table 3 Significant individual CNV-HPO associations
Fig. 2: Genotype-first phenomic analysis in 10,880 individuals with detailed clinical data.
figure 2

For each CNV, the proportion of carriers and non-carriers annotated with each HPO concept is plotted. Those above the diagonal were enriched among carriers, and those below were depleted. Odds ratios are represented by dot size. The selected phenotypes labeled were prioritized according to statistical evidence and clinical breadth. Full results for all associations reaching unadjusted P < 0.05 are provided in Supplementary Data 3. SUDEP sudden unexpected death in epilepsy, CNS central nervous system, EEG electroencephalogram.

Fig. 3: Summary clinical signatures of CNVs in a deeply phenotyped epilepsy cohort.
figure 3

The percentage of carriers of the CNV with each broad phenotype is shown by the height of bars arranged on a polar axis, with two-sided 95% confidence interval error bars for these percentages derived from the binomial distribution using stats::binom.test(). For reference, dots indicate the percentage of the entire Phenomic cohort of 10,880 people with each broad phenotype (representing the prior probability of a person having the phenotype without genetic stratification). The binomial distribution two-sided 95% confidence intervals for a cohort size of 10,880 are no wider than 1.9% (not shown for clarity). “Craniofacial or skeletal dysmorphism” includes individuals with either “Abnormality of the head [HP:0000234]” (which excludes isolated brain structural abnormalities) or “Abnormal skeletal morphology [HP:0011842]”. “Motor, movement or muscular disorder” includes individuals with any of “Abnormal central motor function [HP:0011442]”, “Abnormality of movement [HP:0100022]” or “Abnormality of the musculature [HP:0003011]”, but not “Motor delay [HP:0001270]”, which is included in “Neurodevelopmental abnormality”. While “Neurodevelopmental abnormality” includes those with “Intellectual disability”, the latter is shown additionally as it is a neurodevelopmental outcome with particularly important socioeconomically important consequences. EEG electroencephalogram. Further CNV profiles are shown in Supplementary Fig. 2.

Carriers of deletions at 1p36.33 [0.91–1.51 Mb] (N = 25, 0.23% of the Phenomic cohort), 1p36.33 [2.02–2.49 Mb] (N = 17, 0.16%), or 15q12-q13.1 (N = 4, 0.037%), and carriers of duplications at 15q11.2-q13.3 (N = 46, 0.42%) were enriched with clinical features suggestive of developmental and epileptic encephalopathies, such as epileptic spasms and tonic seizures, epileptic encephalopathy, and other neurodevelopmental disorders, sudden unexpected death in epilepsy, and morphological abnormalities62. Features characterizing genetic generalized epilepsy were associated with deletions at 2p21-p16.3 (N = 114, 1.05%, generalized tonic-clonic and absence seizures), 15q11.2 (N = 56, 0.52%, eyelid myoclonia and absence seizures), 16p13.11 (N = 42, 0.39%, generalized tonic-clonic seizures), 15q13.2-q13.3 (N = 24, 0.22%, absence seizures) or 22q11.21 [20.65–21.54 Mb] (N = 6, 0.055%, juvenile myoclonic epilepsy-like features). Duplications at 16p11.2 (N = 8, 0.074%) were associated with non-epileptic seizures comorbid with epilepsy (OR = 81.5, unadjusted P = 4.82 × 10−4, minP-adjusted P = 0.0297), and showed a nonsignificant greater frequency of microcephaly (OR = 31.5, unadjusted P = 3.62 × 10−2, minP-adjusted P = 0.92) that replicates the mirror microcephaly/macrocephaly phenotype of the reciprocal 16p11.2 CNVs63.

We interrogated the phenotypic annotations of CNV carriers regarding the candidate genes prioritized in our fine-mapping analysis. MSH2 was prioritized as the candidate gene for the most common deletion in the Phenomic cohort (2p21-p16.3). Heterozygous loss of function variants of the haploinsufficient gene MSH2 cause Lynch syndrome 164, and complete knockout of paralog Msh2 in Ccm1+/- mice causes multiple cavernoma through a presumed second hit65. We found that carriers had a nonsignificant greater frequency of neoplasms (OR = 2.35, unadjusted P = 2.49 × 10−2, minP-adjusted P = 1.00) and cerebral cavernomata (OR = 5.23, unadjusted P = 6.58 × 10−4, minP-adjusted P = 0.157) than non-carriers. Carriers of the 1p36.33 [2.02–2.49 Mb] deletion overlapping the gene SKI had features (hypotonia, talipes equinovarus, abnormalities of the globe and nose, osteoporosis, global developmental delay, and Chiari malformation) concordant to the Shprintzen-Goldberg craniosynostosis syndrome caused by SKI41. All 15 individuals with duplication of 9q34.3 had focal-onset seizures that were rarely drug-resistant, without any individual annotated with a neurodevelopmental disorder or polymicrogyria despite the presence of the GRIN1, which can cause polymicrogyria when affected by gain-of-function variants47. Sixteen of 24 individuals carrying deletions at 15q13.3 [31.06–32.51 Mb] had generalized absence seizures (OR = 10.5, unadjusted P = 3.70 × 10−8, minP-adjusted P = 1 × 10−5), in line with the primary seizure type reported in carriers of the 15q13.3 deletion66. Finding generalized myoclonic seizures in half of the carriers of the 22q11.2 [19.67–19.96 Mb] deletion further confirmed TBX167, the known causal gene for the 22q11.21 deletion/DiGeorge syndrome48. Features suggestive of juvenile myoclonic epilepsy were also found among six people carrying deletions overlapping with the second credible interval at 22q11.2 [20.65–21.54 Mb] spanning the Noonan syndrome 10 locus containing in which a single individual was reported with seizures49. However, none of these six individuals had annotations beyond seizures and electroencephalography phenotypes that would support a multisystemic syndrome.

Finally, clinicians may want to know the frequency of broad clinical features among carriers of the CNV identified in their patients to improve the interpretation of its clinical relevance and to facilitate genetically stratified prognostication. Therefore, we prioritized 17 common, conceptually broad, and important epilepsy manifestations and comorbidities for visualization, including the co-occurrence of generalized-onset and focal-onset seizures that characterizes the combined generalized and focal epilepsy type62 (Fig. 3 and Supplementary Fig. 3A–E). The most common CNV, deletion at 2p21-p16.3, appeared to modestly increase the likelihood of a carrier having generalized epilepsy. However, a few CNVs had a profile dominated by core electroclinical features of generalized (for example, deletions at 15q13.2–15q13.3) or focal epilepsy (duplications at 9q34.3 [139.89–140.12 Mb]), with comorbid features being rare. Conversely, carriers of other CNVs had relatively high frequencies of neurodevelopmental disorders, epileptic spasms, and drug resistance suggestive of developmental and epileptic encephalopathy (deletions at 1p36.33). However, no CNV was found exclusively in people with a particular seizure type, and carriers of some CNVs appeared to have broad clinical features at frequencies indistinguishable from the cohort’s baseline (duplications at 19p13.3), suggesting some generic contribution to epilepsy risk across epilepsy types.

Discussion

In this study, we leveraged a substantial increase in sample size to identify novel seizure-associated CNVs when jointly analyzing 26,699 individuals with various types of seizure disorders against 492,324 population controls. We identified 25 novel loci with genome-wide significance for seizure disorders. In addition, all three previously reported epilepsy-associated loci at genome-wide level maintained genome-wide significance for seizure disorders in our meta-analysis that included the epilepsy cohort from the previous study20. Of the 25 seizure-associated loci, 16 were previously implicated in neurological and psychiatric disorders, including epilepsy. Five were flanked by known segmental duplications (SDs) or low copy number repeats (LCRs). Of note, our fine-mapping analysis confirmed the first and third known critical regions for seizures within the phenotype spectrum of the 1p36 deletion syndrome38, TBX1 as the (known) causal gene for the 22q11.21 deletion/DiGeorge syndrome48, and suggested the SNRPN promoter/exon 1 region as the causal element for seizures within the larger BP2-BP3 15q11.2-q13 duplication region. However, our study design did not support the assessment of whether the imprinting status of the duplicated region itself plays an additional role besides the previously suggested role of SNRPN promoter/exon 1 region in regulating the imprinting of the Prader-Willi critical region. Future studies that also include genomic screens of parents will shed light on this open question.

In a high-resolution phenomic analysis in a subset of 10,880 individuals from our cohort with epilepsy (from the Epi25 cohort), we identified 622 suggestive and 19 significant clinical associations informative for epileptologists among CNV carriers. This observation indicates that beyond contributing to the generic risk of seizures, several CNVs contribute to specific epilepsy types. Carriers of some CNVs tended to have features typical of developmental and epileptic encephalopathies with neurodevelopmental and non-seizure phenotypes. Conversely, carriers of others had phenotypes restricted to the core epileptic features of seizures and electroencephalographic abnormalities (both generalized and focal). Interestingly, reciprocal CNVs involving 22q11.21 seemed to produce opposite epilepsy types, with deletion and duplication carriers tending to have generalized and focal epilepsies, respectively. Dose-dependent effects of KLHL22 on DEPDC5 degradation are a possible explanation68. Overall, the high degree of pleiotropy among seizure-associated CNVs implies that these CNVs likely impair neurodevelopmental processes rather generically and contribute to the broad spectrum of neurodevelopmental disorders. According to the oligo-/polygenic inheritance model, CNVs may interact with the genetic background or environmental factors to generate the final disease phenotype. Interaction between CNVs and the polygenic background was recently demonstrated in carriers of the schizophrenia-associated 22q11.2 deletion69. Support for an oligogenic-CNV disorder model was also recently published70.

Genome-wide genetic screening for pathogenic CNVs is recommended as a first-tier approach for the postnatal evaluation of individuals with intellectual disability, developmental delay, autism spectrum disorder, multiple congenital anomalies, and prenatal evaluation of fetuses with structural anomalies observed by ultrasound71,72,73. It has previously been shown that CNVs confer significant risk towards epilepsy1,2,4,5,6,7,8,10,13,74, particularly for individuals with comorbid neurodevelopmental disorders such as intellectual disability21,74,75,76. In contrast to single nucleotide polymorphism SNP GWASs for epilepsy or seizures, where the risk of identified variants is small (OR < 2)77,78, the effect sizes of the 25 CNVs identified in this study are large (median OR = 11, range 2–53). Our high-resolution phenomic analysis of 10,880 individuals with epilepsy grouped by CNV carrier status illustrates the seizures, EEG and brain imaging findings, and neurodevelopmental and other co-morbidities associated with each CNV. This genotype-first approach complements the traditional single-phenotype, case-control paradigm by taking a simultaneous phenome-wide perspective in individuals deeply phenotyped according to standardized protocols before CNV discovery or genetic association tests. We found phenotypic evidence supporting associations between CNVs, broad markers of epilepsy types, and fine-grained phenotypes. The high-resolution phenotype associations that an epileptologist can recognize derived from the HPO phenotype association analysis and disease risk estimates from the meta-analysis for each CNV can enhance the interpretation of clinical relevance and pathogenicity following the American College for Genetics and Genomics Copy Number variant interpretation guidelines24.

Our study has several limitations. First, many of the patients with seizures included in this study have comorbid neurological and psychiatric disorders. Therefore, some of the identified CNV loci may be associated with other clinical phenotypes present in a high percentage of all cases. Second, we did not detect robust associations with two important outcomes in our HPO analysis, refractory drug response and sudden unexpected death. Sudden unexpected death in epilepsy is poorly suited to cross-sectional studies: it was annotated to only 4 of 10,880 individuals, far fewer cases than expected to occur with follow-up of this cohort of individuals requiring tertiary center care79. This emphasizes the open-world interpretation required for our results: in any study that is cross-sectional and of a disorder that has inherently variable phenotyping depth (epilepsy presentations can often be classified only incompletely)1,62, and which is characterized by some phenotypes that are age-dependent (such as some seizure types, autism, and intellectual disability), one should rarely assume that the absence of an annotation can be interpreted as the absence of that phenotype over the lifetime of the carrier. Thus, the proportion of individuals annotated with a phenotype is likely lower than the actual proportion manifesting it over their lifetimes80. Third, in contrast to conventional SNP-based GWASs, CNV-GWASs have major challenges in identifying the causal gene(s) impacted by the CNV. Among the 25 identified CNVs, deletions ranged from 230 kb to 5 Mb and duplications from 290 kb to 9 Mb, affecting 14.2 genes on average. CNV breakpoints in the current study are estimated from genotyped SNPs around the actual breakpoint. These breakpoint estimates are limited by the resolution of the genotyping platform used to call the CNVs. In fact, microarrays have many technical limitations, such as poor breakpoint resolution and limited sensitivity for small CNVs81. Newer technologies like whole-genome sequencing (WGS) will enable the assessment of a more comprehensive array of rare variants, including balanced rearrangements, small (exonic) CNVs82, short tandem repeats, and other structural variants83. However, some genomic regions harbor complex deletion/duplication/inversion rearrangements (e.g., 22q11.2184, 15q11.285) that can even show population stratification (e.g., 16p11.286). More accurate and complete (pangenome) references will be needed to determine the exact breakpoints of such complex rearrangements87,88, even in the case of sequencing-based CNVs discovery. Lastly, we performed joint epilepsy/seizures and cross-disorder meta-analyses in individuals with minimal clinical information. Future studies with access to rich clinical metadata, such as electronic health records, will likely identify additional seizure-associated CNVs. It is important to consider the inclusion criteria for this cohort and the definition of cases and controls when interpreting associations and their relevance to a patient. Our phenomic analysis cohort was performed using the years 1–3 data of the Epi25 Collaborative, predominantly recruited from academic epilepsy centers and of European ancestry (92.9%, see Online Methods). Additionally, we screened cases to exclude those with brain trauma, meningitis, or encephalitis. Thus, our clinical associations should be considered most valid in individuals of European ancestry with likely genetic or unexplained epilepsies attending specialist epilepsy centers. Future data analyses from subsequent years of Epi25 will provide data more applicable to other populations.

Large-scale collaborations that enable the aggregation of massive datasets have greatly advanced epilepsy and the discovery of genetic factors through GWASs. Here, we have extended this framework to CNV discovery by meta-analyzing epilepsy and seizure disorders, followed by additional meta-analyses in neuropsychiatric disorders and traits to explore pleiotropy. We also identified fine-grained genotype-phenotype associations and clinical profiles for each CNV. Our results will help refine promising candidate CNVs associated with specific epilepsy types and extend their clinical value. We are confident that applying this framework to even larger datasets has the potential to advance the discovery of all clinically relevant risk loci, ultra-rare high-risk CNVs missed by this study, and the underlying genes or functional elements.

Methods

Study cohorts

Each center’s ethics committees/institutional review boards approved data collection and use. For the Epi25 cohort, patients or their legal guardians provided signed informed consent/assent according to local IRB requirements; as samples had been collected over 20 years in some centers, forms reflected standards at the time of collection. For Epi25 Consortium samples collected after 25th January 2015, forms required specific language according to the NIH Genomic Data Sharing Policy.

Individuals with clinically defined epilepsy - Epi25 Collaborative

Individuals with ILAE-defined epilepsy (N = 16,109) were collected through the Epi25 Collaborative. The epilepsy diagnosis was performed according to clinical criteria (clinical interview, neurological examination, EEG, imaging data), following International League Against Epilepsy (ILAE) classifications89. All cohorts are detailed in Supplementary Table 1. All individuals of the Epi25 Collaborative cohort were selected to be of principal component analysis (PCA)-defined European ancestry. Ancestry-matched population controls (N = 8545) for the Epi25 arm of the study were recruited through (1) the Epi25 Collaborative, (2) a Broad Institute project on inflammatory bowel disease without reported epilepsy (part of the IBD Genetics Collaborative, IBDGC), (3) healthy individuals from the Genetics of Personality Collaborative (GPC), and (4) the THL Institute for Health and Welfare (subsample of the FINRISK study)90. Genotyping for all cases and controls was performed on the same genotyping array (Illumina Infinium Global Screening Array, GSA-MD v1.0) and at the same center (Broad Institute) as the epilepsy cases. For a detailed description, see ref. 20.

CNV calling and quality control - Epi25 Collaborative

We restricted our analysis to only autosomal CNVs due to a higher quality of calls and followed the quality control (QC) pipeline developed in our previous study20. In detail, QC was performed in two major steps (1) pre- CNV calling QC and (2) post-CNV calling QC. For pre-CNV calling QC, we excluded samples with a call rate <0.96 or discordant sex status. To select individuals of European ancestry, we filtered autosomal SNPs for low genotyping rate (<0.98), a high difference in the SNP minor allele frequency between cases and controls (>0.05), deviation from Hardy-Weinberg equilibrium (HWE) with P ≤ 0.001), and pruned the remaining SNPs for linkage disequilibrium (–indep-pairwise 200 100 0.2) using PLINK v1.991. We then performed a principal component analysis (PCA) of the Epi25 cases and controls using PLINK v1.991 and GCTA92. European individuals were defined as individuals clustering with the 1000 Genomes Project93 European samples. We created GC wave-adjusted LRR (Log-R ratio) intensity files for all samples using PennCNV, generated a custom population B-allele frequency file, and employed PennCNV’s CNV calling algorithms2,94 to detect CNVs in our dataset. The post-CNV calling QC included the following steps: (1) CNV calls of the same type (deletion or duplication) were merged if the number of SNP/intensity markers between them was <20% of the total number when both segments were combined; (2) CNVs supported by <20 markers, <20 kb long, and with a SNP density <0.0001 were excluded from subsequent analyses; (3) CNVs that overlapped other CNVs in ≥1% of all samples within the Epi25 dataset were excluded to remove potential platform-specific artifacts, (4) CNVs with >50% overlap with telomeric, centromeric, and immunoglobulin regions of the hg19 reference assembly were excluded; (5) CNVs with ≥50% overlap with reported common CNVs (allele frequency >1%) in two independent CNV reference catalogs (DGV Gold Standard Dataset95; DECIPHER Population Copy-Number Variation Frequencies96) were excluded. Finally, the probe-level intensity plots of all CNVs supporting the seizure-associated regions (Table 1) were visually inspected to exclude any remaining artifacts. The DGV Gold Standard and DECIPHER Population frequencies of the remaining CNVs are given in Supplementary Table 4.

Individuals with seizures or neuropsychiatric phenotypes - neuropsychiatric disorders cohort

A large CNV dataset from individuals with a range of neuropsychiatric disorders (including seizure disorders) was aggregated from 17 different sources by Collins et al.97. The contributors of each cohort provided the specific clinical phenotypes. The aggregated individuals were grouped into 54 partially overlapping disease phenotypes standardized through the Human Phenome Ontology98. The 54 different phenotypes of Collins et al.97 were obtained through a recursive hierarchical clustering that defined a minimal set of nonredundant primary phenotypes, each including a minimum of >300 samples in at least three independent cohorts, >3000 samples in total across all cohorts, and had less than 80% sample overlap with any other phenotype. Of the 54 phenotypes, we only selected neurological and psychiatric HPO-based phenotypes (N = 23, excluding Seizures, Supplementary Table 2). The architecture of these HPO-based phenotypes allows the identification of associations at different levels, from broad to narrow phenotypes, providing the opportunity to distill between pleiotropic and specific associations. This data set also included the Epi25 cohort from our previous CNV GWAS study20. This previous (outdated) Epi25 cohort was excluded from the neuropsychiatric cohort for cross-disorder meta-analyses in the present work. All the considered cohorts are listed in Supplementary Table 1. This aggregated CNV dataset comprised 248,751 individuals affected by at least one of 24 neuropsychiatric disorders, including 10,590 individuals with seizures and 483,779 population controls.

Quality control - neuropsychiatric disorders cohort

The CNV harmonization procedure for the Neuropsychiatric cohort is described in the Supplementary Materials of Collins et al.97 and included following steps: (1) CNV calls of the same type (deletion or duplication) were merged if their breakpoints were within ±25% of the size of their corresponding original CNV calls to avoid over-segmentation of large CNV calls; (2) CNVs not mapped to autosomes from the primary hg19 assembly were excluded; (3) Only CNVs between ≥100 kb and ≤20 Mb in size were considered; (4) CNVs that matched reported common CNVs (allele frequency >1%) in three independent CNV reference catalogs derived from genome sequencing (Abel et al.99; Collins et al.100; Sudmant et al.81) were excluded; (5) CNVs that overlapped other CNVs in ≥1% of samples within the same dataset or in any of the other array CNV datasets were excluded to remove potential platform specific artifacts; (6) We excluded all CNVs with ≥30% overlap with somatic hypermutable sites, segmental duplications, simple/low-complexity/satellite repeats, or N-masked bases of the hg19 reference assembly.

Genome-wide association analysis

We performed segment-based CNV burden analyses to identify genomic regions with a significant increase of CNVs in epilepsy cases compared to controls, separated by CNV type (deletion or duplication). We adopted a sliding window approach as introduced by Collins et al.26. The sliding windows model allowed association testing of all autosomes through 267,237 sliding windows characterized by a window size of 200 kb and a step size of 10 kb, corresponding to 13,339.6 non-overlapping windows. Each of these windows was required to have a low overlap with hypermutable sites, segmental duplications, simple/low-complexity/satellite repeats, and N-masked regions (>30%). For each of the genomic regions, we counted the number of overlapping CNVs separately for cases and controls for each CNV type (deletion or duplication). We required an overlap between the CNV and the genomic window of ≥10% to reveal the potential burden of small deletions or duplications (size ≥ 20 kb). We used the one-sided Fisher test as the test statistic for the CNVs collapsed for each segment. Cases/control CNV counts and the Fisher tests were performed using the CNV docker available at https://hub.docker.com/r/talkowski/rcnv and custom python (version 3.7.9) and R (version 3.6.1) scripts. The same procedure was applied to the cohorts of the neuropsychiatric disorder dataset, as detailed in Collins et al.26.

Meta-analysis and fine-mapping

Fixed-effects meta-analyses were performed using the metafor R (version 3.6.1) package with an empirical continuity correction101 and a saddlepoint re-approximation of the null distribution used for inference. The meta-analysis procedure is detailed in Collins et al.26. We meta-analyzed the effect sizes from 7 GWAS derived from the 17 cohorts of the neuropsychiatric disorder dataset with each segment-based P-value of the Epi25 dataset. The threshold for genome-wide significance was set to α = 3.74 × 10−6 after Bonferroni correction for multiples testing corresponding to the number of independent, non-overlapping 200 kb windows, calculated by merging all overlapping windows and dividing the sum of their sizes by 200 kb (effective N = 13,339.6 independent windows; P = 3.74 × 10−6)). To account for possible cohort-specific biases, we expected each segment to fulfill the following additional criteria: (1) at least two cohorts featuring nominal significant P-values (P < 0.05) for the given segment, and (2) a meta-analysis P < 0.05 after excluding the single most significant cohort. We then used a Bayesian algorithm102 to identify the minimal credible interval(s) that contained the causal element(s) or genes with 95% confidence, as in Collins et al.97. Finally, we explored the known biological function of all genes within the credible intervals and performed pathway analyses using Enrichr103,104 (https://maayanlab.cloud/Enrichr/). All resources used to investigate the knowledge basis of all seizure-associated CNV regions are described in Supplementary Table 5.

Detailed HPO characterization of Epi25 participants

To identify phenotypic associations with each of the CNVs within a cohort of individuals with epilepsy, we translated clinical data from years 1–3 of the deeply phenotyped Epi25 Collaborative international cohort into Human Phenotype Ontology (HPO, version released 2022-02-14) concepts, following our optimization of the HPO for epilepsy phenotypes105. We selected only individuals with CNV data and sufficiently detailed clinical data (as of 2022-01-25) to confirm the presence of seizures or epileptic encephalopathy with continuous spike-and-wave in sleep (EE-SWAS, an epilepsy syndrome in which overt clinical seizures may not always be observed). Categorical clinical data were mapped to HPO concepts using a data dictionary. Free text data were annotated with HPO terms manually (D.L.S. under the supervision of I.H. and R.H.T.)25. Quantitative data related to the gestational age, weight, and head circumference at birth were categorized to match HPO definitions using sex-stratified distributions from the INTERGROWTH-21th Project using the R growthstandards package (version 0.1.5)106.

We inferred all HPO concepts applicable to each individual from those translated from the clinical data by propagation, following the is_a relationships between HPO concepts as previously described107, using the R ontologyIndex package (version 2.7)108. We excluded HPO terms that carried no information in the context of this cohort (those that were annotated ubiquitously) and modified the relationships of others, tailoring them to this analysis (Supplementary Table 6). Phenotypes were annotated as being explicitly present or not, without annotating any phenotypes as being explicitly absent. Taking this open-world perspective is conservative, meaning that the proportion of individuals in a group annotated with a particular phenotype should be considered a lower limit while still allowing statistical testing of phenotypic associations and mitigating the risk of explicitly annotating a phenotype as absent when it was present but not recorded or the individual will manifest the phenotype at some point in the future80.

After excluding individuals with markers of acquired epilepsy that are unlikely to be part of the phenotype, such as significant brain trauma, encephalitis, or meningitis, 10,880 individuals from the genomic analysis had adequate phenotypic data available for analysis. Of these, 10,106 individuals are of European ancestry, 602 of East Asian ancestry, and 172 of African ancestry, according to PCA analysis. After propagation to infer generic phenotypic descriptors from specific ones, this cohort had 214,203 informative annotations (median = 17 per individual, range = 1–128), spanning a repertoire of 1667 phenotypic concepts. The frequency of annotation of all 1667 phenotypes is available in Supplementary Data 4.

Phenome-wide association analysis of CNVs

All association analyses and phenomic visualizations were performed in R. Associations between CNVs, and HPO concepts were calculated using the Fisher’s exact test (function fisher.test from the stats package). The tested phenotypes were all those 1667 HPO terms translated from clinical data that were informative (not ubiquitous) and are detailed in Supplementary Table 3. While this was a descriptive analysis, given a large number of tests performed ((29 groups of multiple individuals + 2 groups of a single individual) × 1667 HPO concepts = 51,677)), we sought to aid identification of the most robust associations. Bonferroni’s single step and Holm’s step-down adjustments are overly conservative given the dependence structure of propagated HPO annotations. For example, after full harmonization, annotations of Typical absence seizure [HP:0011147], Generalized non-motor (absence) seizure [HP:0002121], and Generalized-onset seizure [HP:0002197] will be highly correlated because an individual cannot have the first without the second or the second without the last as a result of there is_a relationships in the HPO. Therefore we applied the minP step-down procedure, which uses a permutation-based approach to control the family-wise error rate61. We selected 100,000 randomly generated groups of individuals from the Epi25 phenomic analysis cohort of size N, where N is the number of carriers of each CNV. Then for each of these groups, we calculated the two-sided Fisher’s exact test P-values for every one of the 1667 HPO concepts. We used the adj_Wstep function from the NRejections package (version 1.2.0) in R to perform the step-down procedure. This generated P-values corrected for the correlation-adjusted number of tested HPO annotations. We did not adjust P-values across CNVs because we were interested only in identifying those associations that were most robust in this descriptive analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.