Introduction

Neuroticism is one of the main human personality traits1. It characterizes the disposition to experience negative affects, including anger, anxiety, self-consciousness, irritability, emotional instability, and depression1. Persons with elevated levels of neuroticism respond poorly to environmental stress, interpret ordinary situations as threatening, and can experience minor frustrations as hopelessly overwhelming1,2. Neuroticism is a trait, which is relatively stable across the life span3, and a heritable personality trait4, which is an important risk factor for psychiatric disorders5,6. The strong genetic correlation between neuroticism and mental health7,8,9 suggests that neuroticism may represent an intermediate phenotype, which is influenced by risk genes more directly than psychiatric diseases10. This implies that exploring the genetic contribution to differences in neuroticism can help understand the genetic architecture of psychiatric disorders.

Until recently, just a few GWAS loci for neuroticism were identified11,12,13. Several possible candidate genes have been suggested at these loci, including GRIK3, KLHL2, CRHR1, MAPT, CELF4, CADM2, LINGO2 and EP30014. Some of these genes have been reported to be associated with depression15,16, autism17,18, Parkinson’s disease19,20, schizophrenia21,22 and other psychiatric disorders.

Great progress in the genetic dissection of neuroticism was made in 2018, when 17016 and 1168 independent SNPs associated with neuroticism were identified using large samples of 449,484 and 329,821 people, respectively. These SNPs marked 157 loci, 64 of which were identified in both studies. The majority of the lead SNPs were located in intronic and intergenic regions, and only a few SNPs were coding. With the help of positional mapping, Nagel et al.16 identified 283 genes potentially influencing neuroticism and located at independently associated loci. The list of genes for neuroticism was expanded using genome-wide gene-based association and expression quantitative trait locus (eQTL) analyses, and chromatin interaction mapping. In total, 599 genes were included in this list and 50 of them were identified by all four methods. It was shown that these genes were predominantly expressed in six brain tissue types and were associated with seven Gene Ontology (GO) gene sets including neurogenesis, neuron differentiation, behavioral response to cocaine, and axon part16.

In our study, we focused on further identification of genes that are associated with neuroticism and on their detailed functional analysis. We concentrated on the genes whose variants are directly responsible for differences in neuroticism level between individuals. Protein-coding variants can modify the structure of the corresponding proteins due to amino acid substitutions or alternative splicing. The variants in the non-coding intragenic regions can regulate transcription and translation of these genes, protein complex formation or posttranslational modifications23,24. Such modifications may tilt the physiological balance from healthy to diseased state, resulting, for example, in bipolar affective disorder or Alzheimer's disease25. It has also been shown that the exome harbors > 98% of known pathogenic Mendelian variants26. Therefore, the prior probability of rare variants being causal is expected to be higher for exomic than intergenic variants. This prior probability, in turn, is known to dramatically affect the false positive report probability in association analysis27. Altogether, within-gene association signals are more interpretable than between-gene signals and less probable to be false-positives.

Using summary statistics obtained from UK Biobank data, we performed a gene-based association analysis investigating several sets of genetic variants that differ by their protein-coding properties. We used a procedure that we called ‘polygene pruning’ to guard against the influence of strong GWAS signals outside the gene. As a result, we have identified 190 genes associated with neuroticism, both known and new. We performed a bioinformatics analysis to compare the biological functions of the known genes confirmed and non-confirmed in our study, and the biological functions of the known and new genes, and to check if the identified genes can be considered as true gene candidates for neuroticism.

Results

The gene-based analysis of neuroticism

The strategy of our study and the main results are shown in Fig. 1. The Manhattan plot for the association signals obtained before and after polygene pruning is presented in Fig. 2a. The genes identified under the different scenarios after polygene pruning are shown in Supplementary Tables 14. The full list of 190 genes identified under at least one scenario is presented in Supplementary Table 5.

Figure 1
figure 1

Workflow schematic. GWAS summary statistics and correlations between genotypes were used as input data. Each set of SNPs (all, non-coding, exonic and nonsynonymous) was analyzed separately. The first step of our study is the gene-based association analyses performed using the SKAT-O, PCA, ACAT-V methods and their results in combination. Next, to guard against the influence of strong GWAS signals outside the gene, we performed polygene pruning for genes with a p-value < 2.5 × 10–5 obtained at least by one of the gene-based methods. After pruning, we repeated the gene-based analysis using the remaining SNPs. The genes identified using each set of SNPs were subdivided into known and new. All known genes confirmed in our study were compared with the rest of the known genes for their functional properties. We also compared the new and known genes identified in our study. For all identified genes, we performed an enrichment analysis; for the genes identified using protein-coding SNPs, a functional analysis of protein-coding variants was performed; for the genes identified using non-coding SNPs, SMR/HEIDI analysis was performed. Bold blue arrows point to the gene groups being compared and gene groups being under in silico analyses.

Figure 2
figure 2

The results of gene-based association analysis. (a) Manhattan plot showing the –log10 transformed p-value of each gene. (b) Proportion of genes identified by gene-based association analysis only (1), both gene-based association analysis and position mapping (2) and positional mapping only (3) among confirmed (light gray) and non-confirmed (dark gray) genes. (c) Venn diagram showing the overlap of known genes identified using different sets of SNPs. (d) Venn diagram showing the overlap of new genes identified using different sets of SNPs. (e) Proportion of genes identified using non-coding (dark gray) and protein-coding SNP sets (light gray) in the new and known genes.

Comparison with known genes

From among 599 previously suggested neuroticism genes from Supplementary Table 15 in a work by Nagel et al.16, we took a subset including protein-coding genes whose effect on neuroticism can be explained by the SNPs located within these genes. Such genes had been previously identified by gene-based association analysis or/and positional mapping. We added two genes identified by eQTL analysis, PLCL2 and ARHGAP15, because their expression was controlled by SNPs located within them. We defined this subset of 432 genes as a list of known genes (Supplementary Table 6).

From among 432 known genes, 152 overlapped with genes identified in our study and 280 did not. From among 280 non-overlapping genes, 190 showed the gene-based association with neuroticism before polygene pruning and lost association after it. From among the overlapping genes, two thirds were included in the list of the known genes by both gene-based analysis and positional mapping, while the proportion of non-overlapping genes was one-third (Fig. 2b). The reverse was observed for the proportion of genes included in the list by positional mapping (Fig. 2b).

Then we compared the known and new genes identified in our study. Among 152 known genes, eight were identified using protein-coding SNPs; 140, using non-coding variants (three were identified using both coding and non-coding SNP sets); and seven genes using only the all-variants scenario (Figs. 1 and 2c,e). From among 38 new genes, 15 were identified using protein-coding SNP sets and 24, using non-coding variants (one gene was identified using both coding and non-coding SNP sets) (Figs. 1 and 2d,e).

We have identified 23 genes using protein-coding SNP sets. Three of them showed association when only nonsynonymous SNPs were used. Nine were associated when using both nonsynonymous and exonic variation, with seven of them having the lowest p-values when using nonsynonymous variation only (Supplementary Tables 3 and 4).

Enrichment analysis of 190 identified genes

Analysis of differentially expressed genes showed a strong enrichment of genes expressed in the different brain structures (Fig. 3a and Supplementary Table 7). The most significant enrichment was demonstrated for the following brain tissues: the hypothalamus, cerebellar hemisphere, cerebellum, frontal cortex (BA9), cortex, anterior cingulate cortex (BA24) and nucleus accumbens.

Figure 3
figure 3

Enrichment of 190 genes identified by gene-based association analysis in different gene sets. (a) Tissue enrichment for differential expression. (b) Enrichment in GO biological processes. (c) Enrichment in GO cell components. (d) Representation of genes in the GWAS catalog.

To assess whether the identified genes converge on shared biological functions or pathways, we used FUMA for an enrichment test in the GO biological processes and GO cell components. The gene sets with a q-value ≤ 0.05 are shown in Supplementary Table 8. We detected 58 significantly enriched GO terms: 36 in GO biological processes and 22 in GO cellular components. The top 15 results are shown in Fig. 3b and c, respectively. Among the biological processes with the most significant enrichment were synapse organization, cell morphogenesis involved in neuron differentiation, synapse assembly, neuron migration, regulation of synapse structure or activity, axon development, neuron development and neurogenesis. For cellular component categories, the most significant findings were those for postsynapse, synapse, neuron-to-neuron synapse, synapse part, neuron part and neuron projection.

We estimated the overrepresentation of the identified genes in the GWAS catalog. The most represented gene sets were associated with psychiatric, cognitive and behavioral traits such as neuroticism, general factor of neuroticism, autism spectrum disorder, schizophrenia, cognitive ability, mood instability, feeling worry, and depression (Fig. 3d).

Functional analysis of protein coding variants

For 23 genes associated with neuroticism when using protein-coding SNP sets, we selected 76 SNPs with a p-value < 0.05 (Supplementary Table 9). The full description of the effect predicted for SNPs by VEP and FATHMM-XF is presented in Supplementary Tables 10 and 11. Eighteen SNPs in 16 genes were identified as potentially pathogenic variants in accordance with at least one algorithm (Supplementary Table 12). One variant, rs9267835, was defined as pathogenic by all four algorithms.

SMR and HEIDI analyses of non-coding variants

SMR analysis followed by the HEIDI test provided strong evidence for the pleiotropy of the SNPs within fourteen genes with the expression of 25 unique genes in two tissues: whole blood and peripheral blood (Supplementary Table 13). For eight genes, we detected significant pleiotropic signals with the expression of more than one gene at the same locus. For the SNPs within six genes tagged by intronic variants rs75614054, rs11975, rs10896636, rs10005233, rs240769 and rs3996325, we found evidence of pleiotropy with the expression of the same genes where these SNPs were located (PTCH1, FAM120A, ZDHHC5, SNCA, ASCC3 and MAD1L1, respectively). For the SNPs in the other eight genes, we found evidence of pleiotropy with the transcription of adjacent genes.

Discussion

In the present study, we have proposed and applied a new approach for detection of genes associated with neuroticism. We have identified 190 genes associated with neuroticism due to the effect of within-gene variants rather than strong GWAS signals outside the gene. Thirty eight of these genes were new.

Our approach is different from those previously applied to study neuroticism in three distinctive ways. First, we used gene-based association analysis to identify neuroticism genes. Due to the simultaneous consideration of a set of genetic variants within a gene, gene-based association analysis has increased power for identification of common and rare genetic variants28,29. Given that, we have identified 38 new neuroticism genes.

Secondly, we used several different sets of SNPs representing different hypotheses of the genetic variants function mediating the association. This allowed us to distinguish two slightly overlapping groups among the associated genes. One group included the genes whose association is the consequence of the changes in their regulatory properties. The other group included the genes whose association is the consequence of the changes in the structure of the proteins these genes encode.

Finally, we used the polygene pruning procedure. The aim of using this procedure was to reduce the influence of strong association signals located outside the gene. Initially, we identified 459 genes and only 190 remained in the list of the neuroticism genes after polygene pruning. Therefore, we can conclude that the high proportion of gene-based association signals initially obtained in our study was inflated by strong GWAS signals outside the genes. We suggested that this conclusion can be applied, to some extent, to the 432 known neuroticism genes. If so, the genes that were confirmed in our study can be expected to differ in their functions from the genes that were not confirmed. We compared the enrichment of these two groups of genes in GO biological processes and cellular components and found differences between these groups. For GO biological processes, the top results for the confirmed genes included the processes occurring in the nervous system (Supplementary Table 8), while for the non-confirmed ones, the top results were obtained for less specialized processes, such as protein DNA complex subunit organization, chromatin assembly or disassembly, nucleosome organization, cellular protein containing complex assembly chromosome and chromatin organization (Supplementary Table 14). Similar results were obtained using GO cellular components (Supplementary Tables 8 and 14).

The simplest way to refine the initial list of genes would be to restrict it to the genes that are closest to the top GWAS signals. The closest genes are usually believed to have higher chance of being causal (though not guaranteed to be so)30. The question arises: how does the list of genes remaining after polygene pruning relate to the list of genes that are closest to the top signal at each locus defined by Nagel et al.16? The proportion of such closest genes among the genes identified in our study increased from 0.24 to 0.43 after polygene pruning. About 73% of the closest genes successfully passed the procedure. This proportion of the remaining, more distant genes was 31%. These results indirectly confirm the effectiveness of polygene pruning and have led us to conclude that the known genes confirmed in our study are more appropriate candidate genes than the non-confirmed ones.

Among the genes we have identified were both known and new. These two groups differed in the proportion of genes identified using protein-coding SNPs: 39% (15/38) and 5% (8/152) for the new and the known genes, respectively. This difference can be explained by a high impact of GWAS in the search for common genetic variants. The common variants identified by this method are more frequent at intergenic and intronic loci than at protein-coding ones31. Due to it, only a low proportion of the known genes was confirmed in our study using coding SNPs. Four (CRHR1, NOTCH4, NOS1 and AGBL1) of eight such genes included common variants with a p-value < 5 × 10–8. Only three (C12orf49, RPP21 and TRIM39-RPP21) of the 15 new genes included such SNPs. The same situation was observed for the genes identified using non-coding SNPs: 72 of 137 known genes and only one of 23 new genes included significantly associated common variants. This confirms that a significant number of the known genes was identified due to strong association signals of common variants. The proportion of the new genes with such variants is rather low. The majority of the new genes have been mentioned as genes associated with intelligence, education attainment, cognitive performance, well-being spectrum, alcohol dependence, worry, anorexia nervosa, autism spectrum disorder, schizophrenia, depression and Parkinson’s disease (Supplementary Table 15). Taking into account a high genetic correlation of neuroticism with mental health8,16,32,33, the new genes identified in our study appear to be reasonable candidates for neuroticism genes. Therefore, both known and new genes identified in our study can be considered as appropriate candidate genes.

For the genes identified in this study, we obtained additional information. For all these genes, we confirmed the independence of the association signal from the effect of the outside SNPs. We found 23 genes associated with neuroticism when using protein-coding SNPs. Three of them, ANKS1B, DEPTOR and PANK4, were identified using only the set of nonsynonymous variants; and nine genes, using both exonic and nonsynonymous variants. Interestingly, for seven of these nine genes, C12orf49, Me3, MUC22, MYO15A, RPP21, SLC25A37 and TRIM39-RPP21, the association signals obtained using nonsynonymous variants were higher than using all exonic variants. The VEP and FATHMM-XF methods detected 18 SNPs in 16 genes as potentially pathogenic variants in accordance with at least one algorithm (Supplementary Table 12). For three of these SNPs (rs10507274, rs35755513, and rs6986), an association with neuroticism had been detected previously8,16,33,34,35. Several variants have been described as being associated with different behavioral or psychiatric traits such as smoking initiation36, worry16, irritable mood33, feeling guilty33 and depression16 (Supplementary Table 16). These results can explain the association of some of the identified genes with neuroticism by the contribution of the potentially pathogenic variants. For 21 of 23 genes identified using protein-coding SNPs, we found indications of their association with behavioral, cognitive, social or psychiatric traits in the GWAS catalog (Supplementary Table 17). Seven genes, CYP21A1, MUC22, NOTCH4, RPP21, TAP2, TRIM31 and TRIM39-RPP21, were located at the depression loci described by Wray et al.37 (Supplementary Table 18).

Among 167 genes associated with neuroticism using non-coding SNPs, we have found six genes (PTCH1, FAM120A, ZDHHC5, SNCA, ASCC3 and MAD1L1), whose within-gene variation may affect the expression of the same genes. For the other eight genes (MAPT, YLPM1, STAG1, ARHGAP27, SETD1A, CLUH, ZCCHC14 and GGT7), we showed the pleiotropy of their within-gene variants with the transcription of the adjacent genes (Supplementary Table 13).

In conclusion, we have identified 190 genes associated with neuroticism, of which 38 are new. We have demonstrated that the genes confirmed in our study are more appropriate gene candidates for neuroticism than the non-confirmed known genes. The majority of the new genes are known as associated with behavioral, cognitive, social and psychiatric traits. A substantial number of the new genes is represented by the genes identified using the protein-coding SNPs. Many of these genes include potentially pathogenic variants.

Materials and methods

Summary statistics

For analysis of neuroticism, we used freely available summary statistics (p-values, betas) instead of individual phenotypes and genotypes. Summary statistics were obtained for 10,847,151 SNPs from a sample of 380,506 participants of the UK Biobank project (https://ctg.cncr.nl/software/summary_statistics). SNPs with MAF < 0.001 and INFO < 0.8 were excluded. Neuroticism levels were measured using the Eysenck Personality Questionnaire, Revised Short Form38, consisting of 12 dichotomous items (0, 1). The quantitative trait was defined as a sum of 12 items (for details, see Nagel et al.33).

Gene-based association analysis

The gene-based association analysis was performed using three methods: SKAT-O39, PCA40 and ACAT-V41 implemented in the sumFREGAT package42. Then the results of these methods were combined using ACAT-O41. See Supplementary Methods for a brief description of the methods.

Correlations between the genotypes of variants within each gene were estimated using the genotypes of about 265 thousand UK Biobank participants.

The results of the genome-wide gene-based association analysis are freely available in the ZENODO database (https://zenodo.org/record/3888340#.XuDTQGlS_IU).

Regions of interest

SNPs were annotated to genes based on dbSNP version 135 SNP locations and NCBI 37.3 gene definitions. We restricted our gene-based analysis to protein-coding genes. SNPs were annotated to a gene if they were located between its transcription start and stop sites. We used 1000 Genomes functional annotations for genetic variants to determine nonsynonymous, coding and non-coding variants.

We considered four scenarios differing from one another by a set of SNPs whose combined effect was tested by the gene-based association analysis:

  1. 1.

    all SNPs located between the transcription start and stop positions of a gene;

  2. 2.

    SNPs defined as non-coding (introns, 5′UTR and 3′UTR);

  3. 3.

    SNPs defined as exonic (coding);

  4. 4.

    nonsynonymous substitutions.

Two last sets belong to the protein-coding sets.

Polygene pruning

The purpose of this procedure was to reduce the influence of association signals located outside the region of interest by excluding SNPs that are in high LD with more significant SNPs outside the region. Here we call this exclusion procedure ‘polygene pruning’ to distinguish it from the classical LD pruning and to emphasize the polygenic nature of association signals of excluded SNPs. See Supplementary Methods for a detailed description of the procedure.

We performed polygene pruning only for genes with a p-value < 2.5 × 10–5 obtained by at least one of the gene-based methods.

After polygene pruning and re-analysis, genes were considered as identified at a p-value < 2.5 × 10–5 in a combined ACAT-O test.

Coding variant effect prediction

For this analysis, we considered only protein-coding SNPs with a p-value < 0.05. We used the FATHMM-XF method43 and three prediction methods (SIFT, PolyPhen and CADD) implemented in the Ensembl Variant Effect Predictor (VEP) tool44 to predict the impact of coding SNPs. We defined the status of a variant as pathogenic if at least one of the methods indicated this status: the SIFT p-value < 0.05 (the corresponding qualitative evaluation is “deleterious/deleterious low confidence”), the PolyPhen p-value > 0.5 (“possibly/probably damaging”), the FATHMM-XF p-value > 0.5 (“pathogenic”) and the CADD phred-like score > 20.

Tissue specificity and gene set enrichment analysis

Gene set enrichment and tissue analyses were carried out with GENE2FUNC implemented in FUMA45 (http://fuma.ctglab.nl/). We used genes identified in the gene-based analysis after polygene pruning as input data.

Enrichment of the identified genes in biological pathways and functional categories was tested against the gene sets obtained from the Gene Ontology (GO) Biological Process and GO Cellular Component. In addition, we tested sets of GWAS catalog-reported genes. Hypergeometric tests were performed to check if the genes of interest were overrepresented in any of the pre-defined gene sets. In this type of analysis, we used FDR (Benjamini–Hochberg method) for multiple testing correction as was recommended by FUMA. Statistical significance was determined at a q-value < 0.05.

The GTEx v8 54 tissue type data set was used for tissue specificity analysis. A set of the input genes was tested against each of the sets of differentially expressed genes using a hypergeometric test and Bonferroni multiple testing correction. Statistical significance was determined at an adjusted p-value < 0.05.

SMR/HEIDI analysis

Summary data-based Mendelian Randomization (SMR) analysis followed by the Heterogeneity in Dependent Instruments (HEIDI) test46 was used to study potential pleiotropic effects of genes identified using non-coding variants on gene expression levels in different tissues.

SMR analysis provides evidence for pleiotropy (the same locus is associated with two or more traits). It cannot define whether traits in a pair are affected by the same underlying causal variant, but this is accomplished by a HEIDI test, which distinguishes pleiotropy from linkage disequilibrium. It should be noted that SMR/HEIDI analysis does not identify which allele is causal, nor can it distinguish pleiotropy from causation.

Summary statistics for gene expression levels were obtained from Westra Blood eQTL (peripheral blood, http://cnsgenomics.com/software/smr/#eQTLsummarydata)47 and the GTEx version 7 database (53 tissues, https://gtexportal.org)48. We used only blood and tissues related to the brain (14 tissues from GTEx v7). The full list of tissues used in SMR/HEIDI analysis can be found in Supplementary Table 19.

We used the SMR/HEIDI test46 as implemented in the GWAS-MAP platform49. See Supplementary Methods for details.