Rare copy number variation in posttraumatic stress disorder

Posttraumatic stress disorder (PTSD) is a heritable (h2 = 24–71%) psychiatric illness. Copy number variation (CNV) is a form of rare genetic variation that has been implicated in the etiology of psychiatric disorders, but no large-scale investigation of CNV in PTSD has been performed. We present an association study of CNV burden and PTSD symptoms in a sample of 114,383 participants (13,036 cases and 101,347 controls) of European ancestry. CNVs were called using two calling algorithms and intersected to a consensus set. Quality control was performed to remove strong outlier samples. CNVs were examined for association with PTSD within each cohort using linear or logistic regression analysis adjusted for population structure and CNV quality metrics, then inverse variance weighted meta-analyzed across cohorts. We examined the genome-wide total span of CNVs, enrichment of CNVs within specified gene-sets, and CNVs overlapping individual genes and implicated neurodevelopmental regions. The total distance covered by deletions crossing over known neurodevelopmental CNV regions was significant (beta = 0.029, SE = 0.005, P = 6.3 × 10−8). The genome-wide neurodevelopmental CNV burden identified explains 0.034% of the variation in PTSD symptoms. The 15q11.2 BP1-BP2 microdeletion region was significantly associated with PTSD (beta = 0.0206, SE = 0.0056, P = 0.0002). No individual significant genes interrupted by CNV were identified. 22 gene pathways related to the function of the nervous system and brain were significant in pathway analysis (FDR q < 0.05), but these associations were not significant once NDD regions were removed. A larger sample size, better detection methods, and annotated resources of CNV are needed to explore this relationship further.


INTRODUCTION
Posttraumatic stress disorder (PTSD) has a substantial genetic component [1]. Recent large investigations of PTSD genetics have focused on common genetic variation [2,3], but rare and structural forms of genetic variation are hypothesized to be important contributors to the development of psychiatric disorders [4]. Rare and structural variation have not received substantial empirical study in the context of PTSD [5]. However, these forms of variation have been examined more thoroughly in association with other psychiatric disorders, where many investigations have specifically focused on copy number variants (CNVs) [6]. CNV associations have been identified for attention-deficit/ hyperactivity disorder (ADHD) [7], autism spectrum disorder (ASD) [8], depression [9,10], obsessive-compulsive disorder [11], and schizophrenia [12]. Many of the identified psychiatric associations involved specific CNVs that have been implicated in neurodevelopmental disorders (NDD) [9,10,13], but also the cumulative burden of CNVs across the genome and enrichment over specific pathways related to the brain and development of the nervous system [12]. Largely owing to lack of available data, there has been no major reported investigation of CNVs and PTSD. However, the recent availability of large sample size PTSD genetic data [2] and available techniques to leverage this data to identify CNVs [14], means that it is now possible to investigate the association between PTSD and CNV burden with an unprecedented level of discovery power.
We present an association study between CNVs and PTSD, conducted in a sample of 114,383 (13,036 cases and 101,347 controls) European ancestry participants from the Psychiatric Genomics Consortium-PTSD [2,15]. We detected rare (<1% population frequency) CNVs using algorithms [16][17][18] applied to the SNP genotyping array intensity data. Following this, we examined the impact of CNV on PTSD on multiple scales: genomewide CNV burden, enrichment over 46 neuropsychiatric gene-sets [15], CNV burden on individual genes, and CNV carrier status over 53 previously implicated NDD CNV regions [9]. We conclude by comparing the risk contribution from CNVs to the contribution of common variant polygenic risk scores (PRSs).

Participants and phenotyping
The study sample consisted of 114,383 (13,036 cases and 101,347 controls) participants across 20 cohorts from the Psychiatric Genomics Consortium -PTSD freeze 2 data collection. The Psychiatric Genomics Consortium for PTSD is a large scale international effort to investigate genomic contributions to PTSD via meta-analysis of diverse cohorts [2]. For a given PGC-PTSD freeze 2 cohort to be included in this investigation, genotype intensity data had to be available, so that CNV calling could be performed. To reduce the potential for population stratification, we only included subjects of genetically determined [2] European ancestry, the largest homogeneous subset of the data. Within each cohort, participants were assessed for PTSD using either clinical assessment, clinician administered inventory, or self-reported inventory (Supplementary Table 1). Methods of PTSD assessment varied across cohort, and included the BSSS, CAPS, DEQ, IES, NSA, NWS, PCL, PSS, SCID, TSQ, and WMH-CIDI. All cohorts provided a PTSD case/control status variable as determined from standard criteria. Where applicable, PTSD symptom scores were computed for each inventory following inventory specific protocols for symptom scoring. All participants provided written informed consent, and studies were approved by the relevant institutional review boards and the University of California San Diego Human Research Protection Program.

CNV detection
DNA was extracted from blood samples. All details regarding DNA extraction and genotyping processes have been published [2]. Participants were genotyped using Illumina arrays (Supplementary Table 1), with the exception that the UK Biobank (UKBB) cohort, which used Axiom genotyping arrays (ThermoFisher). Illumina genotype platform data was self-clustered in Genome-Studio 2.0 and exported as intensity data inputs for CNV callers (SNP name, chromosome, position, allele 1, allele 2, B allele frequency, log R ratio, X, and Y). Affymetrix platform genotype data clustering methods have been described previously [9], and log R ratio and B allele frequency data were downloaded directly from the UKBB database. For Illumina data, CNVs were called using PennCNV [17] and iPattern [16]. For Affymetrix data, CNVs were called using PennCNV and QuantiSNP [18]. For PennCNV calling, the population frequency of B allele files were generated using the data itself. Waviness correction was applied using a GC content model file generated from UCSC gc_model data (https://genome.ucsc.edu/cgi-bin/hgTables). For the Hidden Markov Model input of PennCNV, the pre-supplied files were used: hhall.hmm for Illumina data and affygw6.hmm for the UKBB data (https://penncnv.openbioinformatics.org/en/latest/user-guide/input/#hmmfile). iPattern calls were made using the default program settings, in batches of up to 196 samples. Batches were selected such that samples within a batch were genotyped on the same plate or genotyped at approximately the same time. QuantiSNP calls were made with 10 iterations of the EM algorithm, where the characteristic length used to calculate transition probabilities was set to 2,000,000. GC based correction was performed using UCSC gc_model files.

CNV quality control
CNV were quality controlled according to the PGC CNV calling pipeline [12]. To ensure that the analysis included a reliable set of calls, CNV calls produced by the different calling algorithms were intersected to produce a consensus set. CNVs called as gain by one method and loss by the other were also excluded from further analyses. Fragmented large CNVs in a locus were annealed if the gap length between them was less than 30% of the overall length of the annealed CNV. CNV quality metrics calculated by PennCNV were used to perform sample QC. Subjects were removed if their values for SD of log R ratio, B allele frequency, or waviness were > = Q3 + 3IQR, if >20% of any chromosome was copy number variant (aneuploidy), or if they had excessive CNV count (≥Q3 + 3IQR CNVs) or KB burden (≥Q3 + 3 IQR megabases). Participants who failed standard genotype QC were removed: sample missingness rates > 2%, excess heterozygosity, mismatch between self-reported sex and genetically determined sex, π relatedness coefficient > 0.2. We removed CNVs for any of the following reasons: 50% overlap with centromeres, telomere, immunoglobulin or T-cell receptor loci, >50% overlap with known segmental duplications, CNV frequency >1% (measured within the data) in cases and controls and <10 kb in CNV length or intersecting <10 probes.

CNV burden calculation
CNV burden was measured and evaluated for association with PTSD in multiple ways: The cumulative burden of CNVs was calculated as the genome-wide total distance (in megabases) spanned by CNVs. For each of the 53 NDD CNV regions, NDD CNV carrier status was determined as having at least 50% of the NDD CNV region overlapped by CNV. As a sensitivity analysis, two different overlap criteria (>0% or 100% overlap) were also evaluated. For gene-level CNV burden, first gene positions (GRCh37 human genome build) were downloaded from the UCSC table browser (https://genome.ucsc.edu/cgi-bin/hgTables). Genes were filtered to protein coding genes, based on having an "NM_" accession prefix in the National Center for Biotechnology Information reference sequence database [19]. For genes with multiple isoforms, the minimum start and the maximum end positions were used. For each CNV, the CNV was mapped to all genes it overlapped by at least one base pair. The CNV burden variable was then calculated for each gene, coded 1 if the subject carried a CNV that mapped onto the gene, and 0 otherwise. For gene-set analysis, a gene-set burden variable was calculated for each set tested, coded as the number of genes within the set overlapped by the CNVs. The gene-set analysis included 53 gene-sets, consisting of 23 gene-sets related to neurofunction or nervous system, 6 brain expression from BrainSpan consortium and 7 mouse phenotype negative control gene-sets from previous neurological disorders studies [12,20], a set of loss-of-function intolerant genes as defined by gnomAD v2.0 [21], and 16 brain-expressed gene-sets from human neocortex scRNA data [22]. A list of genes belonging to each set is included in Supplementary Table 2.

Statistical analyses
A two stage meta-analytic approach was conducted, where analyses were performed within each cohort separately then results were combined via meta-analysis. As all subjects belonging to a given cohort were genotyped using the same platform, this analysis was effectively performed stratified by platform, thus accounting for potential confounding due to CNV calling across platforms. Within each cohort, the association between PTSD and CNVs was tested using a regression model of PTSD as predicted by the CNV variable, five principal components calculated from genotype call data using Eigenstrat 6.0.2 [2] [23], and the log R ratio standard deviation sample quality metric from PennCNV. For the gene-set analyses, in order to follow the enrichment test model outlined by Raychaudhuri et al. [24] analyses also contained predictors for genome-wide total CNV count and genome-wide average length of CNVs. Cohorts with continuous PTSD symptom measures were analyzed using linear regression and cohorts with only case/control phenotypes were analyzed using logistic regression. Results across cohorts were meta-analyzed using fixed effects inverse variance weighted meta-analysis in the metafor [25] R package. For the meta-analysis, to account for the different PTSD measure scales used across cohorts, PTSD measures were scaled from 0 to 1 according to the theoretical range of scores of the assessment method (i.e., 0 = no PTSD symptoms, 1 = theoretical maximum possible PTSD symptoms), and case/ control estimates were interpreted as being the observed, censored variable for a latent symptom measure variable. Statistical significance was declared based on Benjamini-Hochberg false discovery rate (FDR) A.X. Maihofer et al.
q value < 0.05 calculated within a family of tests. To enhance interpretability of results, we also provide odds ratio effect estimates, via analyzing cohorts with continuous data using an ordinal logistic regression. For this analysis, odds ratio estimates were directly meta-analyzed across studies (i.e., not rescaled) using inverse-variance weighted meta-analysis. The statistical inferences made in this manuscript are however based only on the linear regression based results. To examine if outliers strongly contributed to the results of analyses of the 16p11.2 deletion and 2q13 deletion CNVs, linear regression was also performed using heteroskedasticity consistent (HC3) standard errors [26].
We estimated PRS for PTSD in all participants. SNP weights were obtained from the Million Veteran Program PTSD GWAS [3] of European ancestry participants, with weights adjusted using PRS-CS [27] under default parameters, with 1000 Genomes Phase 3 European data [28] used to model linkage disequilibrium. SNPs were filtered to common (minor allele frequency > 1%), strand unambiguous variants. PRS were computed as the weighted sum of risk alleles at each markers using the -score option in PLINK [29]. PRS were standardized to mean zero and unit standard deviation, such that the effects reported refer to PTSD risk relative to every unit standard deviation increase in PRS. The proportion of variance in PTSD explained by PRS and CNV was estimated as the difference in model r-squared values between a baseline model that included all relevant covariates and the model with additional terms for PRS and CNV. Standard errors for the proportion of variance explained were calculated using the formulae from Cohen et al. [30].

RESULTS
The PTSD CNV meta-analysis included 114,383 participants (13,036 cases and 101,347 controls) of European ancestry across 20 cohorts (Supplementary Table 1, Table 1). The method of PTSD assessment varied across cohorts (11 distinct methods), with most participants being assessed via a version of the PCL (N = 106,353). The majority of subjects (N = 113,320, 99%) were analyzed using PTSD symptom scores, the remaining subjects were analyzed using case/control status. 15  The median length of CNVs was 122,756 BP (range=10,000 to 9,911,819 BP) ( Supplementary Fig. 1). 60.1% of participants were carriers of at least one CNV (Table 1). Among CNV carriers, the average total span of CNV carried was 0.32 megabases (SD = 0.35), and the average of within subject average CNV lengths was 0.23 megabases (SD = 0.26).

Gene and gene-set analyses
We examined CNV association on the level of protein coding genes. 2880 genes harbored CNV with at least 0.01% frequency. We found that no gene was significant after multiple comparisons correction for the number of genes, in any strata (overall CNV, duplications, or deletions) (Supplementary Table 5). Following this we examined if CNV burden association with PTSD was enriched in any of 46 different gene-sets related to the brain and nervous system and 7 control gene-sets of non-brain tissue types. No control gene-set was significant. In contrast, 22 out of 46 sets related to the brain and nervous system were enriched in deletions (FDR q < 0.05) (Supplementary Table 6). Many of the top ranked genes in the significant sets overlapped with NDD CNV regions (Supplementary Table 7). As a sensitivity analysis, we removed CNVs overlapping or nearby (within 1 million base pairs) the NDD CNV regions (Supplementary Table 8), finding that no gene-set remained significant after this adjustment (all FDR q > 0.05).

DISCUSSION
We identified an association between the cumulative burden of CNVs and PTSD, which was largely driven by CNVs overlapping previously implicated NDD CNV regions. Two recent studies of CNVs in major depression [9,10] also reported associations with NDD CNV burden, with effect sizes comparable to ours. The modest to moderate effect sizes observed are consistent with PTSD and MDD being disorders with less severity of cognitive impairment, comparatively moderate heritability and a larger  environmental component. In terms of how CNV burden modifies depression risk, Kendall et al. [9] suggested that the majority of the total effect came from the direct effects of CNVs, with some evidence of additional mediated effects stemming from sociodemographic risk factors including physical health, smoking, alcohol consumption, educational attainment, and social deprivation. As PTSD has similar risk factors [31], NDD CNVs may influence PTSD risk via the same mediated mechanisms. We propose that some of the psychiatric and neurodevelopmental consequences of CNVs may also increase PTSD risk, as they represent PTSD risk factors [32] [33].
In examining the individual NDD CNVs, we observed a significant association of PTSD with the 15q11.2 BP1-BP2 microdeletion, one of the most frequently occurring pathogenic CNVs identified in humans [34]. This CNV is associated with alterations in brain morphology and cognition [35]. There is a wide variety of possible clinical manifestations, including developmental delays, intellectual disability, as well as behavioral and psychiatric problems, including ADHD, ASD and schizophrenia [36]. Under a less strict definition of NDD carrier status (>0% overlap with NDD CNV region), the 22q11.2 duplication region and 8p23.1 deletion regions were significant. The 22q11.2 duplication has a variety of deleterious impacts [37], but generally they are less severe than those observed in the 22q11.2 deletion [38]. The 8p23.1 deletion is associated with developmental delays, hyperactivity, and impulsivity [39]. Rather than any specific functional aspects of these CNVs having led to the significant associations that we observed, we suspect that their relatively high frequencies in the data made them among the most statistically powered to identify.
Pathway specific enrichment of brain regions and neurodevelopmental gene-sets has consistently been observed in genetic studies of psychiatric disorders [3,[40][41][42]. We have identified significant associations with several biological pathways related to the development of the brain and nervous system. Our pathway analysis was not significant once we removed the CNV overlapping NDD regions, possibly suggesting an outsized or central role of genes in NDD regions relative to other genes within the pathways. Genes in NDD regions are known affect the development of the brain and nervous system, likely through the disruption of core molecular pathways [43] [44].
The regions we have identified as significant in CNV analyses have not been implicated in GWAS of PTSD. These regions may represent a distinct element of the genetic contribution to PTSD risk that is not readily identified by common variant analyses, suggesting that rare variation analysis complements common variant analysis, as has been hypothesized for psychiatric phenotypes [4]. The effects of implicated CNVs were modest in magnitude, albeit higher than reported common variant effects, consistent with the hypothesis that rare variants have stronger effects than common ones [45].
In terms of population risk prediction, due to the limited number of CNV carriers, CNV burden predicted substantially less total variation in PTSD than PRS. The utility of determining carrier status, rather than population level prediction, is that CNV carriers may be a subset of individuals for whom a tailored health management strategy [46] applies. Indeed, CNV carrier status has been proposed as a tool in clinical decision making for psychiatric disorders, albeit one that will first require expansion of the clinical knowledge base of CNVs [47]. But it is unclear how much this will apply directly to PTSD, as we did not identify any highly penetrant CNVs.

Limitations
We focused only on rare (<1% frequency) CNVs larger than 10 kilobases in length due to the detection limits of array based CNV calling. However, small CNVs may have clinical importance [48,49]. Future investigation of the relationship between small CNVs and PTSD will likely require sequencing data, as the dense genotyping allows for the determination of CNV at a higher resolution than SNP genotyping arrays [50]. Thus we expect that CNV investigations will emerge as sequencing data becomes available from biobank resources [51]. We were unable to assess the impact of de novo CNV specifically, which would require caseparent trio data to identify. Yet, de novo variation is an important form of risk to investigate, as it occurs more often in cases than controls for ADHD, ASD, and schizophrenia [52]. PTSD genetic studies usually do not gather parent genotype data, implying that new data would need to be gathered in order to study this. We note that several of the cohorts investigated were from specially selected populations. The UKBB is known to be healthier than the general population of the United Kingdom [53]. As well, we analyzed several military populations, where good physical and mental health are required for enlistment. Due to carriers not having been selected for health reasons consequential to their carrier status, our study may have incorrectly estimated (or outright not detected) some effects of CNV on PTSD. Indeed, this may be why we specifically identified the 15q11.2 BP1-BP2 deletion and 22q11.2 duplication: As these CNVs have relatively milder impacts compared to some CNVs [54] [38], more seemingly unaffected carriers would exist in the investigated cohorts. We did not identify any particular genes where the presence of CNVs had a significant association with PTSD. The limited statistical power of low frequency variation [55] perhaps inhibited our ability to detect these genes. Therefore, we hypothesize that specific gene associations will emerge given greater sample sizes or analytic techniques more suited for this form of data, especially as we had positively identified specific gene-sets. We only tested for enrichment of gene sets related to the brain and nervous system, however, CNV may act on other relevant pathways; CNV are thought to have widespread phenotypic effects, such as on the immune system [56], which is also deeply implicated in PTSD development [57]. We did not examine non-European ancestry populations owing to insufficient sample sizes, but there is a clear need to include them in genetic research studies [58]. Collection of such samples is an ongoing aim of the PGC-PTSD [2].

CONCLUSIONS
We have performed, to our knowledge, the largest (N = 114,383 participants) investigation of the influence of CNV burden on PTSD risk, and furthermore, are the first to identify significant associations. Risk was enriched in regions that crossed over known NDD regions. In particular, we have implicated the 15q11.2 BP1-BP2 microdeletion. Larger sample size data, better detection methods, and annotated resources of CNV are necessary to explore these relationships further.

CODE AVAILABILITY
Code is available from https://github.com/nievergeltlab/cnv_freeze1 or by request to the corresponding author.