## Introduction

Major depressive disorder (MDD) is chronic illness which affects 350 million people world-wide according to an estimate by the World Health Organization (WHO); it is characterized by depressed mood, diminished interest, impaired cognitive function, and somatic symptoms, such as disturbed sleep or appetite. The aetiology of MDD is multifactorial with a heritability estimated to be approximately 35%1,2. It is generally recognized that MDD is a common illness involving multiple common genetic variants with small to moderate effect size3. Indeed, several large cohort-based genome-wide association studies (GWASs) in recent years have identified signals which shed new light on our understanding of MDD, for example, implicating the presynaptic protein piccolo and alpha-1 subunit of a voltage-dependent calcium channel in the pathogenesis of MDD4,5 and shared genetic risk for MDD, bipolar disorder and schizophrenia6,7. However, current studies still fall far short of accounting for all of the genetic variation in MDD with robust replicated findings. One of the possible reasons could be that the majority of these studies chose a dichotomous phenotype such as diagnosis as their outcome measure, with the currently limited understanding of the disorder leading to heterogeneity in diagnostic ascertainment8. Of interest, using a polygenic risk score (PRS) for schizophrenia (SCZ), Whalley et al. (2016) identified a subgroup of patients with MDD that had a higher polygenic risk of SCZ than others; this subgroup of MDD patients also showed an attenuated level of distress and neuroticism9. Instead of a case-control design, some studies choose quantitative traits (QTs) related to illness to increase the power of the analysis. Quantitative variables have a higher information content than categorical variables; association studies using QTs can therefore increase the statistical power four to eight-fold, with a resultant proportional reduction of the required sample size10. For example, one study used hippocampal atrophy measured by MRI as a QT for Alzheimer’s disease in a GWAS of only moderate sample size and nonetheless identified novel candidate loci attaining genome-wide significance11,12.

Different kinds of studies have long indicated that anhedonia is a fundamental feature of MDD13,14. DSM-IV-TR defines anhedonia as diminished interest or pleasure in response to stimuli that were previously perceived as rewarding during a premorbid state15. Moreover, anhedonia has been shown to be able to predict a longer time to remission and fewer depression-free days16,17. Specifically, using the same dataset as in our present study, Uher et al. showed that out of the six disease dimensions (mood, anxiety, pessimism, interest-activity, sleep, and appetite), the interest-activity dimension (anhedonia) at baseline was the only dimension able to predict poor treatment outcome in the later time points17. Both twin and family studies demonstrate that 44% of anhedonia is attributable to genetic factors, especially additive genetic effects, and first-degree relatives of patients with MDD display anhedonia-related phenotypes when compared to controls18,19. Although different threads of evidence have validated anhedonia as a QT of MDD, no genetic or genomic study has yet been carried out to identify candidate loci associated with this key feature of MDD.

Our study used a dimensional score of anhedonia to conduct a GWAS and to estimate the heritability of this phenotype accounted for by common variants, aiming to shed new light on our understanding of MDD.

## Materials and methods

### Patient recruitment

Seven hundred and ninety-six people (296 males, 500 females) with unipolar depression of at least moderate severity according to ICD-10 (International Classification of Diseases, 10th revision, Mental and Behavioural Disorders, Research Criteria) and DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, fourth edition) criteria20 were recruited from eight European countries in the GENDEP project21. All patients were of European ancestry without a family history of schizophrenia, bipolar disorder or a current dependency on alcohol or drugs. For further details about the GENDEP project, see Uher et al.21,22.

### Phenotype definition

Uher et al. conducted factor analysis of depression severity data generated from three measures: the Montgomery-Asberg Depression Scale (MADRS); the Hamilton Depression Rating Scale (HDRS) and the Beck Depression Inventory (BDI). Although these measures had previously been widely used in studies of depression, prior to GENDEP, no study had used all three measures simultaneously. Six dimensions with continuous factor scores representing the different aspects of the psychopathology of depression were extracted from initial questionnaire estimates23. Of these six dimensions, the interest-activity score at baseline (which had higher information loadings from items in the three measures relevant to anhedonia, such as “inability to feel”, “lassitude” in the MADRS; “sexual interest” in the HAMD-17; “enjoyment” and “interest in people” in the BDI) was found to significantly predict response to treatment with the antidepressants used in the study17. In this analysis, we used the baseline interest-activity score as our outcome measure for GWAS.

### DNA extraction and genotyping

DNA was extracted from blood samples collected in ethylenediaminetetraacetic acid (EDTA) tubes using standard procedures24; genotyping was performed in the Centre National de Génotypage using the Illumina Human610-quad bead chip (Illumina) as described25.

### Genotype quality control and population stratification analysis

Standard steps were taken for quality control of genomic data in PLINK 1.0926 and data were excluded on failure to pass the following thresholds: consistency of gender information between genomic data and demographic data, SNP genotyping rate ≥ 95%, individual genotyping rate ≥ 97%, Hardy–Weinberg equilibrium test (P ≥ 0.001), minor allele frequency (MAF) ≥ 0.01. Furthermore, using both PLINK 1.09 and KING27, pairwise identity-by-state (IBS) was calculated and outliers or subjects showing unknown familial relationship with others (proportion IBD > 0.05) were subsequently excluded26,28.

### Population stratification analysis

Population stratification analysis was conducted using EIGENSTRAT29, which employs principal component analysis (PCA) to capture hidden population structure in genomic data. Prior to the analysis, data were pruned to make sure adjacent SNPs were in no more than weak linkage disequilibrium (LD) with each other (PLINK command: --indep-pairwise 50 10 0.5)30. This generated 20 principal components (PCs) which were controlled for as covariates in the subsequent association analysis.

### Imputation of missing genotypes using the 1000 Genomes dataset

Following quality control steps, imputation was carried out on genomic data. We employed IMPUTE2 + SHAPEIT2 to impute using the 1000 Genomes phase 3 dataset as the reference dataset31,32. Before imputation, the physical position of SNPs was updated using UCSC Liftover tool (https://genome.ucsc.edu/)33 to the haploid human genome build 19 (hg19). Following imputation, the same quality control steps were used to clean the resultant imputed data.

### GWAS using a linear mixed model (LMM)

In order to test for genotype–phenotype association while controlling for potential confounding factors such as population structure, family structure, and cryptic relatedness simultaneously, we used factored spectrally transformed LMM (FaST-LMM) for our association study34. In brief, the LMM log likelihood of the phenotype data, y (dimension n × 1; n denoting the cohort size), given fixed effects X (dimension n × d; d denoting the number of fixed effects in a single model, including the offset, the covariates, and the SNP to be tested), can be written as

$$LL\left( {\delta _e^2,\,\delta _g^2,\,{\boldsymbol{\beta }}} \right) = \,{\mathrm{logN}}\,\left( {{\boldsymbol{y}}\left| {\boldsymbol{X}} \right.\beta ;\,\delta _g^2{\boldsymbol{K}}\, + \,\delta _e^2{\boldsymbol{I}}} \right)$$
(1)

where N (r|m; Σ) denotes a normal distribution of variable r with mean m and covariance matrix Σ; K (dimension n × n) is the genetic similarity matrix; I is the identity matrix; $$\delta _e^2$$is the magnitude of the residual variance; $$\delta _g^2$$is the magnitude of the genetic variance; and β (dimension d × 1) denotes the weight of the fixed effects.

The “Fa” in FaST-LMM stands for factorization. Let S be genetic similarity matrix, as the covariance matrix of the normal distribution becomes a diagonal matrix S + δI (spectral decomposition), the log likelihood can be rewritten as the sum over n terms. Factorization dramatically increases the size of datasets that can be analyzed with LMM, and additionally enhances the speed and feasibility of the analysis.

In our analysis, we chose the continuous interest-activity score as our outcome measure, controlling for gender, age, years of education, recruitment centres and the first 20 PCs from EIGENSTRAT as covariates.

### Replication analysis using STAR*D

Following our initial findings, we used data from the Sequenced Treatment Alternatives to Relieve Depression Study (STAR*D) to replicate our primary results. Detailed information about the STAR*D including its demographic characteristics and genomic profile have been previously described35,36,37. In brief, 1351 patients with MDD were recruited with the phenotype being defined as the sum of items with corresponding content in baseline HAMD-17, QIDS-SR, QIDS-C and the research outcome assessor-rated 30-item Inventory for Depression Symptomatology17, with genomic profile including 7405247 SNPs after quality control and imputation38. Further, a linear model using PLINK 1.09 was chosen with age, gender, years of education, recruitment centre, and the first four population PCs being included as covariates.

### Gene-based and pathway analysis

Emerging evidence has suggested that disease- or trait-associated genetic variants identified from GWASs tend to be enriched in genic regions including multiple associated variants at a single locus39,40. Therefore, we utilized fastBAT which stands for a fast and flexible set-Based Association Test and the P values from the LMM analysis for gene-set testing41 to discover genes associated with the interest-activity score based on the aggregated effect of a set of SNPs (e.g., SNPs within or close to a gene) with their generated P values being adjusted using Bonferroni correction (0.05/22484).

### Biological interpretation, heritability, and genetic correlation estimates

In order to further understand the resultant signals and their associations with the interest-activity score, we chose loci with an association P value less than 1 × 105 and used DEPICT (Data-driven Expression Prioritized Integration for Complex Traits) to accomplish gene prioritization and tissue/cell type enrichment analysis with a false discovery rate (FDR) set as 1%42. Recent studies have shown that mutation-intolerant genes which are presumed to hold critical biological functions are enriched in rare variants in psychiatric disorders such as autism and intellectual disability (ID)43,44; this pattern also extends to both rare and common variants for schizophrenia45. To test whether it also holds for common variants in our MDD-related phenotype, we investigated the enrichment of genes harboring SNPs attaining an association P value ≤ 105 in the set of loss-of-function (LOF) genes characterized by the Exome Aggregation Consortium (ExAC), setting the constraint metric pLI ≥ 0.9 (probability of being LoF intolerant) according to their recommendation46.

Furthermore, aiming to detect phenotypic variance explained by common SNPs (hg) in our sample and to explore traits which shared a common genetic effect with the interest-activity score, we chose LDSc (LD score regression)47 from LD Hub—a centralized database of summary-level GWAS results for 177 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline for detection of hg and genetic correlation between target phenotype and multiple traits48.

### Association analysis with longitudinal change of anhedonia following treatment with antidepressants

The baseline interest-activity score from the study by Uher el al.17, chosen as the primary outcome measure in our GWAS, was found to significantly predict treatment response in both GENDEP and STAR*D. In order to investigate the potential association between the SNPs associated with the baseline interest-activity score and longitudinal change in the score, we summed up all the associated SNPs to calculate a unweighted PRS for each individual, then conducted association analysis between this PRS and the interest-activity score from week 1 to week 10 using a LMM. The fixed effects of the model included our predictor (PRS) and covariates (age, quadratic effect of age, gender, baseline interest-activity score, and centerid) while the random effect included a random intercept and a random time effect (slope). The PRS was generated using PLINK26 and the above-mentioned association analysis was implemented using the package “nlme” in an R environment49.

## Results

### Demographic characteristics and genome-wide association analysis

#### Demographic characteristics

Of 796 people with genomic data, 759 had a baseline interest-activity score derived from factor analysis (286 males, 473 females). The mean age was 42.05 (11.59), mean years of education 12.31 (3.12), mean baseline MADRS 28.90 (6.77), mean baseline HDRS 21.88 (5.24), and mean baseline BDI was 28.10 (9.76).

#### Genome-wide association analysis

After imputation and quality control, 1,313,135 SNPs (of which 789,990 were imputed with high-quality imputation, i.e., info > 0.6, LD pruning at R2 < 0.5) in 760 individuals remained in the present analysis and as shown in Fig. 1, all study subjects were of European ancestry with no gross population stratification.

Association analysis of interest-activity scores using LMM identified 18 SNPs that passed genome-wide significance (5 × 108) when including gender, age, years of education and 20 PCs of the population structure from EIGENSTRAT as covariates. The top SNP from the analysis, rs9392549, in an intronic region of PRPF4B (pre-mRNA processing factor 4B) located on chromosome 6, had a P value of 2.07 × 109 (Figure 2). Table 1 summarizes the top signals from the association analysis and Fig. 3 displays this as a circularized Manhattan plot. The genomic inflation factor (λ) was calculated as an index of any potential confounding effect in the analysis, and the results were consistent with potential confounding effects having been adequately covered (λ= 0.9958, Fig. 4).

The replication analysis using the STAR*D dataset indicated that while none of the associated SNPs found in the GENDEP dataset were replicated at a Bonferroni-adjusted significance level (0.03/18 = 0.0016); two of them, the top signal (rs9392549) and rs118190482 located in the intronic region of STAB2 (in LD with rs831431, R2 = 0.5), were nominally significant (P = 0.03 and 0.046 respectively, in Table 1).

### Gene-based and gene prioritization analysis

Gene-based association analysis indicated no gene was associated at gene-level significance (P value = 2 × 106). The gene with the strongest signal from the analysis was KITLG on chromosome 12 (KIT ligand, P value = 3.09 × 105).

Using DEPICT, one SNP, rs1001415, which is intronic in EFCAB2 (EF-hand calcium binding domain 2) on chromosome 1, was prioritized owing to sharing more similar biological functions with other associated loci, although the P value was only at a trend level (nominal P = 0.09). Interestingly, Westra et al. reported that rs1001415 is in high LD with a cis eQTL SNP (rs4658697) in an intronic region of a transcript (NM 001143943.1) of EFCAB250. Furthermore, gene-set analysis found one gene ontology item (GO:0008088), axon cargo transport, was over-represented by associated loci from our association analysis with a nominal P value being 1.15 × 105. Cell/tissue annotation analysis saw our associated loci were highly annotated in the MeSH first term of “hypothalamus” and the MeSH second term of “nervous system” (nominal P = 0.004). Although some results generated from DEPICT showed nominal significance, they failed to reach FDR. Nevertheless, our target genes were shown to be significantly enriched by the gene set (3203 genes) characterized by ExAC as mutation intolerant (P = 0.001).

### Heritability estimation and genetic correlation analysis

Estimation of hg showed that 69% of the phenotypic variance of the interest-activity dimension in our sample could be explained by common SNPs (hg= 0.69 ± 0.88). As shown in Table S1 and Figure S1, the genetics of the interest-activity score was highly positively correlated with Parkinson’s disease (PD) (rg = 0.83, se = 1.14), and with Alzheimer’s disease (rg = 0.43, se = 0.32). Moreover, its genetics was negatively correlated with that of the gray matter volume of nucleus accumbens (rg = −0.6492, se = 0.84), eczema (rg = −0.41, se = 0.44) and with subjective well-being (rg = −0.32, se = 0.47). This is consistent with a pleiotropic effect. However, the results should be interpreted with caution given that none of the P values generated from our genetic correlation analyses reached the statistical significance of 0.05.

### Association analysis between the PRS and longitudinal change of anhedonia up to ten weeks following antidepressant treatment

The association analysis showed that the PRS calculated based on the GWAS of baseline interest-activity score was significantly associated with longitudinal change of anhedonia following antidepressant treatment (β= 1.73, P = 0.0023). In order to evaluate if the top hit (rs9392549) from the GWAS of baseline interest-activity score solely drove the identified association, we conducted a secondary analysis using same model conditioning on rs9392549; the association between the PRS and longitudinal change of anhedonia remained significant (β= 1.64, P = 0.0091).

## Discussion

To the best of our knowledge, this is the first genome-wide association analysis of anhedonia in patients with MDD. We used a LMM to conduct the association analysis, which identified 18 SNPs of genome-wide significance, with the most significant being rs9392549 in an intronic region of PRPF4B on chromosome 6 (P = 2.07 × 109). Although no gene was significant on gene-set testing, gene prioritization analysis found one intronic SNP (rs1001415) in EFCAB2 to be significant with a trend (P = 0.09) and the associated loci showed enrichment for a particular gene ontology locus, axon cargo transport (GO:0008088). Furthermore, using LD regression, we showed that 69% of the variance in our phenotype was explained by common SNPs and the markers associated with anhedonia were positively correlated with PD and with Alzheimer’s disease, while being negatively correlated with nucleus accumbens gray matter volume.

The use of a LMM for the genome-wide association analysis is in contrast to the classic general linear model (GLM) in how population stratification or other sample structure issues are addressed. Such confounding factors are detected and addressed in GLM by using genomic control51, ancestry inference (analysis of populationstructure)52,53,54 and PCA29,55. However, these strategies fail to account for sample features such as family structure or cryptic relatedness; for population stratification owing to ancient population divergence, methods like genomic control are relatively weak56. Linear mixed modeling by contrast fits population structure as a fixed effect and a similarity matrix between individuals as the variance-covariance structure of the random effect57; such a method has been shown to yield more a conservative λGC compared to other approaches57,58. Using a similar statistical model, the CONVERGE consortium conducted a genome-wide association analysis in a large cohort of Chinese female patients with severe MDD, with two significant loci being identified and replicated in different samples59. These two loci (rs12415800 and rs35936514 on chromosome 10), however, were not replicated in our study given the rarer frequency of these loci in the European population.

One intronic SNP (rs9392549) in PRPF4B yielded the lowest P value in association with anhedonia (P = 2.07 × 109, replicated P = 0.03). PRPF4B, pre-mRNA processing factor 4 homolog B, is a kinase involved in mRNA splicing that is involved in biological pathways such as inositol phosphate metabolism60. Patients with MDD have been shown to have alterations in mRNA splicing, especially in that of neurotransmitter receptors61,62. For instance, in suicide victims with a history of major depression, adenosine-to-inosine RNA editing within the coding sequence of the serotonin 2C receptor (5-HT2C) pre-mRNA was significantly decreased and this effect was reversed by treatment with the antidepressant fluoxetine63. Additionally, inositol phosphate has been repeatedly implicated in the pathophysiology of affective disorders including MDD, with potential new treatments arising64,65,66. For example, a double-blind, controlled clinical trial in MDD indicated that the overall improvement in scores on the Hamilton Depression Rating Scale was significantly greater for inositol than for placebo after 4 weeks of treatment67.

One of two associated loci which were replicated with a nominal significance, rs831431 (P = 1.92 × 108, replicated P = 0.046) is a brain eQTL located in the intronic region of STAB2, which encodes stabilin 2. Stabilin 2 plays a critical role in angiogenesis68. According to BRAINEAC69, rs831431 significantly affects the expression of one of STAB2’s transcripts (tID = 3429159), especially in the thalamus (eQTL P = 0.01). Although the precise role of STAB2 in the pathogenesis of MDD or anhedonia still remains unclear, it could be hypothesized that deficits in neuroplasticity, potentially mediated by abnormal angiogenesis lead to dysfunction in pleasure-rewarding circuitry. This could be in a temporal-specific manner, analogous to the time-dependent gene expression that is commonly seen in genes related to neurodevelopment70.

Of the other associated loci, rs10498321 is in an intronic region of NPAS3. NPAS3, neuronal PAS domain protein 3, is a brain-enriched transcription factor, expression deficits in which can cause deficiency in neurogenesis, especially in the hippocampus71.

To date, NPAS3 has been mainly studied in schizophrenia and bipolar disorder72,73,74 and schizophrenia, especially with negative symptomatology, is another condition in which anhedonia may be a common feature; to our knowledge, this is the first report of an association between NPAS3 and a MDD-related phenotype. Intriguingly, one of the top signals (rs7973260)75 identified in a GWAS of depressive symptoms in a large cohort from the UK Biobank is in the 18 kb downstream of rs650466, quasi-replicating the current finding and highlighting the potential importance of this genomic region in understanding the biological mechanism of MDD.

Given the modest replication using STAR*D, we carried out a genetic correlation analysis between STAR*D and GENDEP by executing the “sumsum” command in PRSice76, which takes respective summary statistics as input. The result displayed in Figure S2 indicated that although the two datasets were significantly correlated with each other at multiple P-value thresholds (PT at 0.04, 0.05, 0.2, 0.3, and 0.5), the variance explained by each other (R2) was relatively small, which may at least partly explain the relatively weak replication signal in STAR*D. Although it has been widely thought that QTs underpinning the symptomatology of psychiatric disorders could increase the power of the identification of risk variants, the way in which QTs are established has been inconsistent. Of note, the QT of anhedonia was defined in contrasting ways in GENDEP and STAR*D owing to differential measures available. While our study provides an alternative approach for GWAS with limited sample size, it points to the importance of future efforts to validate different measures of QTs along the lines of the RDoC strategy77.

In our gene prioritization analysis, one intronic SNP (rs1001415) in EFCAB2 was found to be more similarly associated with other associated loci in terms of biological function. EFCAB2, EF-hand calcium binding domain 2, is located in SOR (smallest overlapping region) at 1q44 with three other genes: HNRNPU, FAM36A, NCRNA00201. Patients with microdeletions of this region display ID and seizures78,79, which implies a role in neurodevelopment and cognitive function. Of note, it is in high LD with one cis eQTL SNP (rs4658697); therefore, we suggest that future studies could use rs1001415 as a proxy for rs4658697 for the expression of EFCAB2. In addition, one gene set (GO:0008088, axon cargo transport) was over-represented by our associated markers. It is therefore possible that dysfunctional axon cargo transport affected by our identified genes in brain regions relevant for reward circuitry80,81 may be associated with impaired neurotransmitter release (dopamine, etc.), putatively leading to anhedonic symptoms.

Although the cross-phenotype LD score regression failed to generate a genetic correlation with a significant P value, it provided a trend worth further elaboration. Specifically, anhedonia in our study was positively correlated with PD (rg = 0.8). In fact, anhedonia independent of clinical diagnosis and PD are both dopamine-dependent processes and anhedonia is one of the most commonly observed non-motor symptoms in PD82,83. Moreover, anhedonia was negatively correlated with nucleus accumbens gray matter volume (rg = −0.6). The accumbens is a key structure in the reward circuit; structural and functional changes in the accumbens have been repeatedly implicated in substance abuse-related and MDD-related anhedonia84,85. Nonetheless, any inference from our current findings should be made with the caveat that due to the lack of statistical significance, potential type I error (false positive error) cannot be excluded.

Furthermore, the significant association detected between our PRS and the longitudinal change in anhedonia is of interest in that it appears to offer insight not only into the polygenic underpinnings of anhedonia but also into its change during treatment. This preliminary association analysis of the PRS generated by our association findings illustrates the potential of applying such a polygenic profile to better our prediction of treatment response. This could be further tested in the response to treatment of other disorders in which anhedonia is also a feature.

## Strengths and limitations

Strengths of our study include the LMM which controls for confounding factors such as population stratification and cryptic relatedness in a perhaps more robust manner than GLM. However, there are limitations. Firstly, the sample size for our study is modest. Generally, the majority of power calculations used for GWAS employ a case-control design; the use of an endophenotype such as anhedonia, a QT of complex disease biologically hypothesized to be closer to underlying genetic variation, should increase the power of association10. Many approaches for linear mixed modeling of GWAS are computationally challenging, which makes such methodology less popular for GWAS of large sample sizes. Our study provided another new association strategy for GWAS of modest sample sizes, although replication of significant signals in a larger independent sample is required.

Secondly, 16 out 18 SNPs identified in the association analysis have a MAF lower than 0.03. The MAF distribution of our genomic data indicated that 67% of alleles fall into the interval between 0.01 and 0.05 (Figure S3). Enrichment of signals in the lower bound of the MAF spectrum is methodologically recognized; we are aware that given the sample size, these associations may be false positives (a “winner’s curse”), as the number of individuals with a minor allele is very limited.

Thirdly, not all patients were drug-free at the time of recruitment (baseline), some medications such as antidepressants86,87,88 or benzodiazepines89 etc. might affect patients’ anhedonia level at the baseline.

## Conclusion

In summary, this first GWAS of anhedonia in MDD identified a number of SNPs attaining genome-wide significance. The top hits include loci such as NAPS3 which has been associated with schizophrenia, another condition in which anhedonia may be a prominent feature. It is therefore possible that our findings are relevant not only for anhedonia in MDD, but also for anhedonia in other neuropsychiatric conditions. Consistent with this, cross-phenotype correlation analysis gave suggestive signals for PD and nucleus accumbens size. We suggest that further genetic exploration of anhedonia in MDD and other disorders could be a new and productive avenue that could lead to new treatments for this disabling feature of many neuropsychiatric conditions.