Major depressive disorder (MDD) is a common, disabling and frequently recurrent mood disorder that affects up to one in six individuals worldwide during their lifetime.1 A number of antidepressant treatments exist, but due to modest rates of remission and substantial rates of recurrence, there is a pressing demand for new interventions with better efficacy. Identification of specific genetic risk factors could help to elucidate the neurobiological basis of MDD, which would facilitate the development of novel treatment and possibly even prevention strategies.2

To date, efforts to identify specific susceptibility genes in MDD have had limited success. Linkage studies have suggested several regions in the genome that may harbor susceptibility genes for MDD but generally without replication.3 Of note, two recent linkage studies4, 5 have reported genome-wide significant linkage to chromosome 3p26-3p25, with a peak signal near the metabotropic glutamate receptor 7 gene GRM7. Still, due to the small sample sizes and the discrepancy in the phenotype classification in the two studies, larger family-based samples will be needed to confirm this finding.6 Genome-wide association studies (GWAS) of MDD have likewise reported largely inconclusive results.2, 7, 8, 9, 10, 11, 12, 13, 14 The pathogenesis of MDD is likely to be characterized by extensive etiological heterogeneity,11, 15 complex interaction of genetic and non-genetic factors,16 as well as substantial polygenic contribution of individually weak effects,12, 17 all of which are difficult to capture with single-SNP (single nucleotide polymorphism)-based association tests typically used in current GWAS analyses.

Here we utilize a multi-locus, gene-set-based association analysis—also known as pathway analysis18 or functional gene group analysis19—that uses previous biological knowledge of molecular function and cellular processes to facilitate the genetic dissection of MDD. A number of statistical strategies and methods exist for conducting gene-set-based analysis,18, 20, 21 but the basic goal of the proposed methods is to look for sets of biologically related genetic variants that, in aggregate, show enriched association with a target phenotype beyond what would be expected by chance. Typically, well-established canonical pathways or functional annotations of genes and proteins are used to define these sets of biologically related genetic variants. Of note, recent applications of gene-set-based GWAS analysis have provided new biological insights into the genetic basis of several major psychiatric disorders, including schizophrenia,21, 22 bipolar disorder20 and autism.23, 24

Gene-set-based association analysis has three major advantages that complement standard single-locus-based GWAS analysis. First, gene-set-based analysis has improved statistical power to detect subtle and collective effects of multiple genetic variants over single-locus-based tests under various polygenic disease models.25 This is particularly noteworthy given that recent GWAS studies have indicated substantial polygenic contribution of thousands of common genetic variants to major psychiatric disorders, including MDD.17, 26 We presume that such polygenic risk variants underlying MDD would involve dysfunction of specific molecular networks or cellular pathways, as appears to be the case in other complex disorders.21, 23 Secondly, gene-set-based analysis is robust to the extensive genetic heterogeneity of complex disorders.20 Different combinations of genetic variants may alter the function or the activity levels of a given pathway or a molecular mechanism, making it difficult to replicate associations with specific genes or variants across multiple studies. Dysfunction at the level of biological pathways or molecular functions might be detected more consistently regardless of heterogeneity in individual susceptibility variants or risk genes. Finally, gene-set-based analysis allows direct insight into disease mechanism and biology by translating a list of statistically ascertained disease-associated genomic regions into sets of cellular functions or molecular pathways that may underlie disease pathogenesis.

In this work, we applied an integrative gene-set-based GWAS analysis on MDD that utilizes statistical text-mining analysis27 to prioritize the list of pathways and functional gene sets to be tested for aggregated association with MDD. We focused on biological mechanisms indexed by previous MDD candidate genes selected from biological studies of the pathogenesis of MDD. Although these genes have not individually demonstrated genome-wide association with MDD at a single variant/gene level,28 they have been implicated in the etiology of MDD in a variety of biological and pharmacological studies. As such, they may index biological pathways that are relevant to genetic risk. Application of the statistical text-mining tool enabled us to elucidate the key biological relationships among the MDD candidate genes, which were then used to compile a list of relevant Gene Ontology29 (GO) gene sets. We then tested the enrichment of these gene sets in a meta-analysis of our MDD GWAS data sets. Such prioritized gene-set-based analysis thus avoids the pitfall of reduced statistical power entailed by examining the full range (tens of thousands) of GO gene sets.20, 30

The primary goals of this analysis, thus, were: (i) to prioritize biological pathways or molecular functions (that is, gene sets) that might be involved in the pathogenesis of MDD using text-mining analysis; and (ii) to test enriched association of the compiled gene sets using a meta-analysis of three large MDD GWAS studies (4346 cases and 4430 controls in total).

Materials and methods

Study samples

NESDA/NTR. The NESDA/NTR sample consisted of 1738 MDD cases, mainly from the Netherlands Study of Depression and Anxiety, and 1802 controls mainly from the Netherlands Twin Registry.11 Cases met criteria for a lifetime diagnosis of DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, fourth edition) MDD, assessed with the Composite International Diagnostic Interview (CIDI), and did not have a primary diagnosis of schizophrenia or schizoaffective disorder, obsessive compulsive disorder, bipolar disorder or severe substance use dependence. Controls had no lifetime diagnosis of MDD or anxiety disorder and were also assessed to carry a low genetic liability for neuroticism, anxiety and depressive symptoms based on a factor score derived from longitudinal phenotyping measures. Genotype data from the Perlegen platform were obtained from dbGaP (, which consisted of 435,291 SNPs. All subjects were of self-reported western European ancestry.

QIMR. The Queensland Institute of Medical Research (QIMR) sample consisted of 1450 cases and 1703 controls from the Australian Twin Registry. The sample represents the QIMR cases and controls included in the analyses of the Psychiatric GWAS Consortium for MDD13 and overlaps considerably with the QIMR samples of the MDD2000+ study.12 MDD cases and controls were identified through psychiatric questionnaires administered to adult twins and their families recruited through the Australian Twin Registry. All cases met DSM-IV lifetime criteria for MDD based on the shortened CIDI or the Semi-Structured Assessment for the Genetics of Alcoholism-OZ interview instrument. The interviews were administered by trained telephone interviewers, closely supervised by a clinical psychologist. In cases where more than one family member met the criteria for being included in the control group, the individual with lowest neuroticism score was selected. Samples were genotyped as part of multiple QIMR projects31 on Illumina platforms (San Diego, CA, USA; 317K, CNV370-Quadv3 and Human610-Quad). A set of 270,799 genotyped SNPs common across the three Illumina platforms that survived quality control (QC) was carried forward to the imputation phase.

STAR*D. The Sequenced Treatment Alternatives to Relieve Depression (STAR*D) was a multisite, randomized clinical trial of outpatients with MDD that enrolled 4041 participants.10 Blood samples were collected from 1953 patients and 1204 psychiatrically screened controls from the National Institute of Mental Health (NIMH)-Genetics Repository.10 Cases met DSM-IV criteria for MDD and had a Hamilton Depression Rating Scale score of at least 14. Controls were screened for a range of psychiatric disorders based on DSM IV criteria and were excluded if they reported a previous diagnosis of schizophrenia, psychosis or bipolar disorder, or met criteria for a history of major depressive episodes. We included 1500 cases who were self-identified Caucasians. After quality-control procedures, genotyping data included 284,889 SNPs for 1223 MDD patients and 970 controls common across three Affymetrix (Santa Clara, CA, USA) genotyping platforms (Affymetrix Genome-Wide Human SNP 6.0, Affymetrix Genome-Wide Human SNP 5.0 and Affymetrix GeneChip Human Mapping 500K Array).

Data QC and imputation

QC procedures for the individual data sets are described in the original GWAS studies.10, 11, 12 In brief, subjects were excluded if they exhibit missing call rates >0.05, inconsistency in a genotype-based gender with a self-reported one, incompatible levels of autosomal heterozygosity (|F inbreeding coefficient estimates| >0.05), high identity by descent estimates with other subjects (PI_HAT >0.25) or are population outliers based on five nearest neighboring clustering on information by similarity (|Z-score| >3σ). All study subjects were of self-reported non-Hispanic European ancestry. SNPs were excluded if they exhibit missing rates >0.05, minor allele frequencies (MAF) <0.01 and the Hardy–Weinberg equilibrium deviation P<1 × 10−6 in controls. SNPs were further excluded if they show significantly different missingness between cases and controls (P<1 × 10−6) or non-random missingness based on genotypes (P<1 × 10−10). All QC steps were conducted using PLINK.32 To remove potential biases from discrepancies of SNP coverage across the three samples, we imputed non-genotyped SNPs using BEAGLE33 version 3.11 and Hapmap34 II CEU data (release 23, forward strand), except for the QIMR data set, which was imputed using MACH.35 Further details on the imputation procedure for the QIMR data set are given here.31 After QC, STAR*D, QIMR and NESDA/NTR include 1223/970, 1450/1703, and 1673/1757 cases/controls, respectively. Imputation was conducted within each sample. After filtering out SNPs with imputation quality scores <0.8, 1,930,980 SNPs remained across the three data sets. Our data analysis was based on 1,386,571 autosomal SNPs that exist in all of the three data sets.

Primary association analysis

PLINK32 was used to conduct single-SNP-based logistic regression of imputed allele dosage within individual samples (that is, NESDA/NTR, QIMR and STAR*D). To control for potential population sub-stratification, 10 multi-dimensional scaling analysis components were calculated using PLINK. The first four multi-dimensional scaling components showed inflated association with genetic data and thus were included as covariates in regression analysis. Meta-analysis was conducted using a fixed effect model on the dosage analysis results from the three data sets using PLINK.

Candidate gene set compilation

To compile a list of gene sets that are potentially most relevant to MDD pathogenesis, we analyzed a list of well-known MDD candidate genes using the text-mining analysis method GRAIL (Gene Relationships Among Implicated Loci).27 The detailed gene-set-compilation procedure is as follows: First, we assembled the list of 188 MDD candidate genes examined in five previous GWAS studies of MDD.2, 7, 10, 12, 28 The authors compiled their MDD candidate genes through literature search either because the genes have been implicated in susceptibility to MDD due to their biological function and/or association with MDD in previous candidate gene association studies. The list was fed to the GRAIL analysis to identify functional keywords that index the common biological relationships among the MDD candidate genes. The GRAIL analysis parameters were set as follows: genome assembly: release 22/hg18; HapMap population: CEU; functional data-source: PubMed text (December 2006); gene-size correction: on; gene list: all human genes within database. Basically, GRAIL quantified the functional relationships between the input candidate genes using a text-based similarity metric and extracted a list of functional keywords that occur significantly more often than would be expected by chance in the text literature describing the functionally related candidate genes. We used each of the identified keywords as a search term to retrieve relevant candidate gene sets from the GO resource (

Interval-based gene-set-enrichment analysis

To quantify the statistical significance of aggregated MDD association signals for candidate gene sets, we used a GWAS-specific set-based association analysis tool, INRICH36 ( The INRICH analysis procedure comprises three major steps: (i) linkage disequilibrium (LD)-based interval data generation to identify unique regions of association; (ii) empirical enrichment calculation using an interval-based permutation strategy; and (iii) second-step permutation for multiple testing correction at the gene-set level. First, PLINK LD-based clumping was used to generate a list of LD-independent-associated genomic regions from the meta-analysis results (clump-p1=0.005; clump-p2=0.05; clump-r2=0.5; clump-kb=250). Next, the enrichment statistics of each target gene set was calculated as a P-value that indicates the probability of observing a given number of overlaps between the MDD-associated genomic intervals and the reference genes mapped to the examined gene set under the null hypothesis of no-disease association. We defined gene regions as 20 kb up/downstream of the RefSeq transcription starting/ending sites for 17,529 genes on autosomal chromosomes based on the Human Genome Browser build hg18 (Note that recent expression quantitative trait loci studies have shown that 95% of common genetic variation affecting transcript expression resides within 20 kb of the transcription start and end sites of genes37). Finally, resampling-based second-step permutation was conducted to adjust the pathway-level empirical P-values for testing multiple candidate gene sets. Figure 1 illustrates our data analysis procedure.

Figure 1
figure 1

Multi-locus association analysis procedure for major depressive disorder (MDD). Our analysis consisted of two major steps: (1) candidate gene-set compilation and (2) enrichment association analysis. The gene-set-compilation step aims to prioritize groups of biologically relevant genes that are most relevant to MDD pathogenesis. First, a list of widely studied MDD candidate genes was assembled from previous GWAS (genome-wide association studies) studies. The text-mining tool GRAIL (Gene Relationships Among Implicated Loci) was used to identify biological relationships between the selected genes. We used the GRAIL biological keywords to select relevant gene sets from the Gene Ontology database. The second enrichment analysis step aims to identify gene sets that show aggregated association with major depression than would be expected by chance. In order to control for different gene sizes, SNP density and linkage disequilibrium (LD) across genomic region and LD-independent MDD association regions were generated using PLINK. Enrichment of the candidate gene sets was examined using GWAS-specific multi-locus association analysis tool, INRICH, based on the LD-independent regions. The second-step permutation strategy in INRICH was conducted for correcting the testing of multiple gene sets that are not independent of each other.


Primary association analysis

No SNP demonstrated genome-wide significant association with MDD in a single-SNP-based meta-analysis of NESDA/NTR, QIMR and STAR*D. Supplementary Figures S1 and S2 show the Manhattan and quantile-quantile plots for the meta-analysis of these data sets, respectively. The genomic control inflation factor λ (that is, the ratio of the observed median χ2 to that expected by chance) was 1.046, while λ1000 was 1.011 (that is, λ rescaled to a sample size of 1000 cases and 1000 controls38), indicating negligible inflation of association test statistics due to potential confounding factors.

Candidate gene-set compilation

Supplementary Table S3.1 summarizes the list of 188 MDD candidate genes assembled from five previous MDD studies.2, 7, 10, 12, 28 Using these genes as input, GRAIL quantified their functional relationships using a text-based similarity measure (Supplementary Tables S3.2 and S3.3), and identified 11 keywords that are commonly associated with the MDD candidate genes in the literature: ‘serotonin’, ‘dopamine’, ‘NMDA (N-methyl-D-aspartate)’, ‘glutamate’, ‘neuron’, ‘GABA’, ‘adrenergic’, ‘agonist’, ‘cGMP’, ‘synaptic’ and ‘phosphodiesterase’. Table 1 lists these 11 keywords and their associated MDD candidate genes. Using each of these keywords as a search term, we identified 178 GO terms (that is, gene sets) with at least three human reference genes in the domains of biological process, molecular function and cellular complex (Supplementary Table S3.4).

Table 1 List of text-mining-based biological keywords representing MDD candidate genes

Gene-set enrichment analysis

PLINK LD-clumping of the MDD meta-analysis results (4346 cases and 4430 controls in total) yielded 1477 genomic regions with independent association signals; 539 intervals among them overlap with annotated genic regions according to the NCBI Entrez RefSeq Gene database (hg18, Homo Sapiens). After gene-set-based association analysis and multiple testing correction, only one of the 178 target gene sets, the glutamatergic synaptic transmission set, remained significant (GO:0035249; empirical enrichment P=6.90 × 10–4; corrected enrichment P=2.89 × 10–2). This glutamatergic synaptic transmission set comprised of 16 genes, of which 6 genes include LD-independent MDD-associated genomic regions at a P-value<5 × 10–3. Varying the regulatory genic regions from 0kb, 20kb, 50kb to 100kb of up/downstream of transcription starting/ending sites resulted in similar enrichment results (data not shown). Table 2 summarizes the set-based association analysis results for the top 10 gene sets (see Supplementary Table S4.1 for the top 30 gene sets).

Table 2 Gene-set-based association analysis results of the top 10 gene sets for major depressive disorder


In recent years, genome-wide association analysis has transformed our understanding of the genetic etiology of various common and complex human disorders, but progress has been less rapid for studies of MDD. Along with several practical issues (for example, insufficient sample size, substantial phenotypic heterogeneity or large environmental factors), we presume that common genetic variants confer individually very small effects on risk of MDD, and therefore, standard single-locus-based association analysis is likely to have limited power to dissect the pathogenesis of MDD. Using a text-mining-based statistical analysis and the GO resource, we compiled a collection of 178 GO sets that represent potential biomolecular mechanisms underlying MDD pathogenesis. Multi-locus enrichment analysis on these gene sets uniquely implicated genes involved in glutamatergic synaptic neurotransmission (enrichment P=6.9 × 10–4).

A pathophysiological role of the glutamate system in MDD has been suggested consistently in pre-clinical and clinical studies. The glutamatergic system is a critical mediator of stress responses through regulating the hypothalamic–pituitary–adrenal axis function.39, 40, 41 In animal models of depression, alterations in glutamatergic neurotransmission proteins induce depressive-like behaviors.42, 43 Abnormal activity of the glutamatergic system has been attributed to impairments in synaptic and neural plasticity often observed in patients with severe or recurrent mood disorders.44 Post-mortem MDD studies have reported dysregulation of glutamate levels and glutamate signaling genes, as well.45, 46, 47 Perhaps the most convincing evidence comes from studies suggesting rapid antidepressant effects of glutamatergic interventions, including riluzole48 and, more recently, intravenous ketamine.49, 50, 51

Our analysis supports these previous findings by demonstrating the statistical enrichment of glutamatergic synaptic genes in a meta-analysis of over 8700 MDD cases/controls. The glutamatergic synaptic transmission gene set (GO:0035249) includes 16 genes (Table 3), of which six genes, SLC1A4, CACNA1A, GRM8, PARK2, UNC13A and SHC3, showed nominal association with MDD in our meta-analysis (P<5e-03). Two genes GRM8 and SHC3 have previously been implicated in depression-related phenotypes. The metabotropic glutamate receptor GRM8 inhibits presynaptic glutamate release.52 Nominal association was reported for GRM8 with trait depression in a recent GWAS meta-analysis of European and US samples.53 Multiple studies reported that GRM8 knock-out mice exhibit anxiety-related phenotypes.54, 55, 56 SHC3 is a signaling adaptor involved in the signal transduction pathways in neurons. It has been implicated in modulating hippocampal synaptic plasticity underlying learning and memory57 and associated with nicotine dependence.58

Table 3 MDD association results of 16 genes involved in glutamatergic synaptic transmission GO gene set

Some limitations should be noted when interpreting the current findings. First, although our analysis indicates statistically enriched association of glutamatergic synaptic genes, we were unable to determine how these genes confer risk of MDD. Functional analysis of glutamatergic genes in MDD is necessary to decipher the underlying disease biology. Secondly, our analysis resorts to predefined GO sets that are indexed by biological keywords representing previous MDD candidate genes. This type of targeted subset analysis increases statistical power to detect true biological gene sets if true sets are included in the examined list. However, other disease gene sets not related to previous MDD candidate genes could have been missed in our analysis. Another caveat is that by using established GO sets, previously unknown relationships among genes could not be examined. Lastly, the present work defines regulatory SNP regions to 20kb up/downstream of transcription starting/ending sites. Most common genetic variations affecting gene expression are known to reside close to transcription starting and ending sites, but the optimal boundaries for defining regulatory regions likely vary by genes. Nevertheless, we observed similar results for a range of gene region windows (0–100 kb up/downstream of coding regions). As regulatory regions are further refined, additional findings may emerge. For example, incorporating known expression quantitative trait loci or methylation quantitative trait loci data is a promising way to assign non-genic, functional SNPs to gene loci.

In conclusion, the present work provides novel evidence for a potential etiological role of glutamate-mediated synaptic neurotransmission in MDD. Further characterization of the molecular and cellular mechanisms of this glutamatergic synaptic system in the brain will be imperative to understand the disease biology of MDD. Given increasing evidence that modulators of NMDA receptor function has rapid antidepressant effects, glutamatergic systems also merit further investigation as therapeutic targets.