## Introduction

Major depressive disorder (MDD) is a complex disorder that is characterized by persistent dysphoria and is often accompanied by considerable morbidity1,2,3 and mortality2,4. Because MDD has a lifetime prevalence of almost 15%5, tends to start early in life6, and is often chronic7,8, it is the leading contributor to disability worldwide9,10. In comparison with other (psychiatric) disorders, discerning the biological basis of MDD has been difficult. Only very recently, a number of genetic variants were identified and replicated11,12. However, these variants had small effect sizes and explained only a small proportion of the disease risk.

DNA methylation is an epigenetic modification that provides stability and diversity to the cellular phenotype. Because methylation is dynamic in nature and can be altered by environmental factors, it can potentially account for key clinical features of MDD such as its episodic nature or mediate the effects of environmental risk factors such as stress13,14,15. Therefore, methylome-wide association studies (MWAS), which test a genome-wide set of methylation sites for association with an outcome of interest, are promising complements to genome-wide association studies (GWAS) of genetic variants. Of particular interest are methylation sites (CpGs) that are created or destroyed by single-nucleotide polymorphisms (SNPs). These sites, commonly referred to as CpG-SNPs, may show variation in both methylation and sequence, and may therefore convey information beyond either of the two data types alone. However, methylation-dependent association signals in CpG-SNPs are not captured by GWAS and are very poorly captured by a regular MWAS. Therefore, a specific CpG-SNP analysis is needed to detect these signals.

Regular GWAS studies detect differences in allele frequencies between cases and controls. In contrast, a CpG-SNP analysis tests whether groups of cases and controls with the same genotype show differences in methylation at these sites. Thus, these two analyses capture different signals. Similarly, while a regular MWAS detects differences in methylation it does not account for differences in genotype and will therefore often lack the statistical power to detect association signals for CpG-SNPs. A CpG-SNP MWAS remedies this by including information on the actual genotypes of each subject.

Furthermore, the link between sequence variation and methylation levels at these sites may allow CpG-SNPs to function as important cis-regulatory polymorphisms that connect genetic variation to variation in methylation. For example, the alleles present and the methylation levels observed at a specific CpG-SNP have been associated with a variety of regulatory functions16,17,18,19. In addition, in a high-density analysis of methylation quantitative trait locus (meQTL), CpG-SNPs were involved in the majority of all identified meQTLs20.

To study whether CpG-SNPs contribute to MDD disease risk, we used a sequencing-based approach that provides nearly complete coverage of all CpGs21,22, including close to 1 million CpG-SNPs. To further explore the MWAS findings and their potential relevance for MDD, we also tested for their overlap with results from recent GWASs.

## Materials and methods

### Description of the NESDA sample

DNA from blood was obtained from 1200 individuals from the Netherlands Study of Depression and Anxiety (NESDA). MDD was diagnosed using the DSM-IV-based Composite International Diagnostic Interview (CIDI version 2.1) that was administered by specially trained research staff23. In addition, to a current MDD diagnosis, cases had a score >14 on the IDS-SR3024, a 30-item self-report measure of depression symptoms. Controls had no lifetime psychiatric disorders and an IDS-SR30 score <14. The sample selection was further based on good quality GWAS genotype information available from a previous investigation25 (for a summary description, see the Supplementary Note). For further details about NESDA, and demographic and clinical characteristics of participants used for the present study, see Table S1. The study was approved by the ethical committees of all participating locations, and participants provided written informed consent.

### Assaying the methylome with MBD-Seq

We assayed the methylome using an optimized protocol for methyl-CG binding domain sequencing (MBD-Seq) that provides almost complete coverage of all CpGs in the genome21. In short, we used ultrasonication to shear genomic DNA into, on average, 150 bp fragments followed by enrichment with MethylMiner (Invitrogen) to capture the methylated fraction of the genome. The captured fragments were eluted and used to create a barcoded sequencing library for each methylation capture. Labeled sequencing fragment libraries were pooled in equal molarities and sequenced on a NextSeq500 instrument (Illumina). To ensure consistency in the sample preparation, MethylMiner captures and library constructions were both performed using Biomek NxP robotics (Beckman Coulter). Samples were performed in a randomized order and all labtecnical procedures were performed blind to any phenotype information. The sequence reads were aligned to the human reference genome (hg19/GRCh37) using Bowtie226.

### Data processing and quality control

Quality control and data processing (Supplementary Note) were performed using our RaMWAS Bioconductor package, which is specifically designed for large-scale methylation studies. After rigorous quality control of samples, reads, and CpGs, 1132 subjects (320 controls and 812 cases) with an average of 48.7 million reads per sample (=81.9% of all reads) remained. For each of these individuals, our dataset included commonly methylated high-quality methylation information for 21,869,561 CpGs27. Among these, 970,414 were common CpG-SNPs (CpGs created/destroyed by SNPs with minor allele frequency > 10%) that were used for MWAS. To identify the CpG-SNPs we used directly genotyped and imputed genotype information (Supplement) from the NESDA participants. The imputed SNPs were filtered by imputation R2 ≥ 0.9 and minor allele frequency ≥ 0.1 in cases and controls. Finally, an in silico experiment described elsewhere28 was used to remove CpG-SNPs in loci showing alignment problems.

### MWAS of CpG-SNPs

To test for association between the methylation level at each CpG-SNP and MDD, we performed a regression analysis with four sets of covariates. First, we regressed out 19 assay-related variables (i.e., potential technical artifacts) including the quantity of methylation-enriched DNA captured, sample batches, and peak location22. Second, we regressed out the demographic variables age and sex. Third, to avoid confounding due to cell-type heterogeneity, we regressed out blood cell type proportions as estimated by the methylation data29 using MBD-Seq “reference methylomes” we generated after isolating all common cell types in blood30. Fourth, principle component analysis was used to capture any remaining unmeasured source of variation. Specifically, using a scree test we selected the first principle component.

The MWAS was performed by fitting the following regression equation:31

$$Y = b_0 + b_1{\mathrm{CpG}} - {\mathrm{SNP}} + b_2\left( {{\mathrm{CpG}} - {\mathrm{SNP}} \times {\mathrm{MDD}}} \right) + b_3{\mathrm{MDD}} + b_4X_1 + \ldots + b_kX_k + E,$$

where Y are the CpG scores, b0 is the intercept of the regression line, b4bk the effects of covariates, and E are the residual effects. The CpG-SNP is coded as 0, 1, and 2, which corresponds to having 0, 1, or 2 copies of the SNP allele that creates/destroys the CpG relative to the reference genome. MDD is coded 0 for controls and 1 for cases. Figure 1 describes nine scenarios for how the regression lines change with alterations of the b2 and b3 parameters, when b1 is equal to 1. A non-zero value of parameter b1 indicates that the site is methylated with the amount of methylation being proportional to the number of CpGs (i.e., has a methylation quantitative locus or “meQTL effect”). Parameter b2 estimates the case-control difference at the CpG-SNP site that is proportional to the number of CpGs (i.e., the “CpG-SNP dose effect”). Parameter b3 captures case-control differences from nearby sites and thus do not depend on the number of CpG creating alleles of the SNP (i.e., a “local effect”). MBD-Seq assays the methylation of regions that are about the size of the sequenced fragments (~150 bp). Therefore, part of the differences observed at the CpG-SNP may reflect the effects of nearby CpGs resulting in non-zero values b3. In the overall association (i.e., “CpG-SNP MWAS”) we tested the null-hypothesis, H0: b2 = b3 = 0.

### Permutation of CpG-SNP MWAS to study the null distribution

To test if the lambda observed for the MDD CpG-SNP MWAS was caused by associations to the outcome variable, or if it was caused by that the test statistic distribution did not follow the theoretical null we used permutations. Using exactly the same dataset, we performed MWAS for 100 permutations of the MDD outcome variable and recorded the lambdas. Next, the observed association P-values from the MDD CpG-SNP MWAS were corrected for the average permutation-obtained lambda.

### Replication of cumulative MWAS signals by resampling

To study the significance of the cumulative MWAS signals, we used the “ramwas7riskScoreCV” function in RaMWAS. Specifically, the function uses elastic nets32,33,34 as implemented in the R Glmnet package. Elastic nets are akin to multiple regression analysis but suitable for our scenario where the number of predictors is much larger than the number of observations. Elastic nets were fitted by setting the alpha parameter to zero (i.e., ridge regression that retains all predictive sites in the model). To avoid overfitting, k-fold cross-validation is used35. That is, the sample was randomly partitioned into k = 10 equal sized subsamples. Of the k subsamples, k−1 are used as a “training” set to fit the elastic net and obtain weights for each predictive methylation site. The estimated weights are then used in the remaining “test” set to predict the outcome from the methylation data. By alternating the subjects used in the training and test sets, predictions are obtained for all subjects in the study. RaMWAS repeats the entire cycle of CpG-SNP selection through MWAS followed by estimation of prediction weights using elastic nets for each of the k-folds. Because both the selection of CpG-SNPs and estimation of their weights are not affected by the participants in the test set, we obtain unbiased predictions of the outcome for each subject. Furthermore, the score of CpG-SNPs is for an important part determined by the number of CpGs. To capture only effects associated with MDD, we removed the effect of the number of CpGs from the methylation score prior to conducting the “in sample” replication. By testing whether these methylation predictions are significantly correlated with actual MDD status, we performed an “in sample” replication of the cumulative MWAS signal.

### Permutation-based enrichment test of overlap

To perform enrichment tests of the overlap between datasets we used the “shiftR” R-package. shiftR first maps the two datasets to each other based on chromosomal location. In our analyses, no flanking regions were used. Thus, for SNPs we considered a single base position and for CpGs we considered two bases. Next, the P-values are used to cross-classify each mapped marker in the two datasets as being in the top or bottom. Based on the resulting 2 by 2 tables as input, shiftR tests the null hypothesis that the enrichment odds ratio equals 1. To perform these test, shiftR uses circular permutations36. Specifically, through fast bitwise operations, it shifts the mapping of the two datasets by a single random integer in each permutation. This approach to generate the empirical test statistic distribution under the null hypothesis preserves the correlational structure of the data. We used 1 million permutations for each test. Multiple thresholds can be specified to define “top findings” (i.e., for our analyses we used the top 1 and 5%). To account for this “multiple testing”, the same thresholds are used in the permutations where the test statistic distribution under the null hypothesis is generated from the most significant (combination of) thresholds.

### Three GWAS

Three independent (meta-analysis of) GWASs were recently reported for MDD or related phenotypes. Similar to the phenotyping in the NESDA sample, the 23andMe study12 and the study by the Converge Consortium37 determined phenotype status using information about current or prior MDD diagnosis. In contrast, the GWAS meta-analysis performed by the Social Science Genetics Association Consortium (SSGAC)11 studied depressive symptoms, which for the majority of the individuals (>105,000 individuals out of 161,460) were assessed based on self-reported frequency an individual had experienced feelings of unenthusiasm/disinterest and depression/hopelessness during the past 2 weeks. Thus, this assessment was not a clinical diagnosis of depression nor a validated method for assessing depression symptoms. In contrast, when SSGAC studied neuroticism, an MDD-related phenotype, the status for the majority of individuals was assessed using a validated questionnaire that applied different harmonized neuroticism assessment batteries (n = 63,661) and a 12-item version of the Eysenck Personality Inventory Neuroticism38 (n = 107,245). Therefore, for the purpose of comparison with our MWAS for MDD, we used the SSGAC GWAS meta-analysis results of neuroticism11.

For calculating the enrichment test statistic, shiftR classifies markers as being among the top vs. bottom results. However, from the 23andMe study, we could only get access to the P-values from the top 10,000 SNPs. To address this restriction we used SNPs retained in the multiple Psychiatric Genetic Consortia (http://www.med.unc.edu/pgc) studies after quality control. After removing the 10,000 top 23andMe SNPs, we assumed that these common and QC’ed SNPs were likely tested or were in LD with tested SNPs in the 23andMe study but yielded P-values lower than those of the top 10,000. The top 10,000 SNPs all had P-values < 10−5. To define a second threshold for the 23andMe study, we also selected the 745 SNPs with P-values < 10−8. To account for this “multiple testing”, the same two thresholds were used in the permutations. To maximize the compatibility of the analysis, all GWAS datasets were subjected to the same procedure as used for the 23andMe study.

## Results

### Methylome-wide CpG-SNP analysis

We utilized the methylation data in combination with genotype information from the same individuals to perform a MWAS involving 970,414 common CpG-SNPs. Permutations of the MWAS generated an average lambda of 1.02 with a 95% confidence interval from 1.0087 to 1.0321. Thus, as shown in the Q-Q plot (Fig. 2a), the slightly inflated lambda (lambda = 1.062) observed for the MDD CpG-SNP MWAS is likely caused by a combination of true associations and by that the test statistic distribution did not follow the theoretical null distribution. As it would be practically non-feasible (too time-consuming) to obtain permutation P-values for each site we instead control for the deviation in the theoretical null distribution. Thus, the P-values were corrected for the average permutation-obtained lambda (Fig. 2b).

The Manhattan plot (Fig. 2c) shows 27 suggestively significant loci (P < 1.00 × 10−5 after lambda correction) across the genome (Table 1). In Fig. S1, we show the regression plots for all the 27 sites. Twenty-five (92.6%) of the sites showed that the methylation levels were dependent on the number of CpG alleles (i.e., there was a significant meQTL effect) and 23 sites (85.2%) showed that this effect was different between cases and controls (i.e., there was a significant CpG-SNP dose effect). Thus, the associations observed for the two sites lacking CpG-SNP dose effects, as well as for the two sites that did not show significant meQTL effects, are likely caused by local effects from nearby CpGs.

Focusing on the 23 CpG-SNPs with both meQTL and CpG-SNP dose effects, we identified five sites (21.7%) with a positive dose effect. These sites showed a consistent pattern where the case-control difference gets bigger with more CpG-creating alleles but where the cases show higher methylation levels than the controls. The reaming 18 sites (78.3%) showed a negative dose effect. Thus, the negative dose effect occurred significantly more often (P = 0.0040) than expected by chance. A negative dose effect translates to (Fig. 1, right column) a case-control difference that gets bigger with more CpG-creating alleles and where the cases show lower methylation levels than the controls. As was shown in Fig. 1, the CpG-SNP MWAS associations (detected with MBD-Seq data) is in addition to a meQTL effect and a dose effect, also influenced by the local effect from nearby CpGs. This local effect can both enhance or diminish the CpG-SNP dose effect.

The deviation of the observed P-values from the main diagonal, observed in the Q-Q plot (Fig. 2b) after correction for artificial inflation, suggests multiple sites are potentially associated with MDD. To study the significance of the cumulative CpG-SNP MWAS signal for large portions of the top markers, we used a resampling approach that fits elastic nets and employs k-fold cross-validation to avoid overfitting and obtain an unbiased estimate of the cumulative effect across markers. Results showed that the cumulative association was significant (P = 4.01 × 10−8), with the signal coming from the top 15,000 markers.

### Overlap between CpG-SNP MWAS and GWAS

When comparing our MWAS results with three recent GWAS, a significant, or trend toward, enrichment was observed for all three GWASs when using the top 1% of results (the 1% threshold) for the CpG-SNP MWAS and the most stringent threshold for each of the three GWAS (Table 2). The highest enrichment was observed with the 23andMe study (P = 4.9 × 10−3, OR = 5.00) followed by SSGAC (P = 3.8 × 10−2, OR = 1.42) and Converge (P = 8.1 × 10−2, OR = 1.36). The overlap included 55 CpG-SNP sites (Table S2). The most significant site (P = 4.40 × 10−3) in the CpG-SNP MWAS that overlapped with the GWAS data was located in the Roundabout, axon guidance receptor, homolog 2 gene (ROBO2). The overlapping CpG-SNPs included 26 genes present in GO. These genes were overrepresented (P < 0.01) in 12 level-5 GO terms (Table 3). The most significant term (P = 4.57 × 10−4) was “Regulation of synapse organization” which, among other genes, included ROBO2.

## Discussion

Here we present the first MWAS of common CpG-SNPs (CpGs created/destroyed by SNPs with minor allele frequency > 10%) in MDD cases and controls. The methylation data were generated using a sequencing-based approach and involved 970,414 CpG-SNPs and 1132 individuals. Furthermore, we investigated the overlap of this study with recent GWAS for MDD, or related phenotypes. The MWAS suggested that multiple sites are potentially associated with MDD and resampling showed that the cumulative signal replicated. Furthermore, permutation-based enrichment tests suggested significant overlap with top findings from the MWAS and recent GWAS.

### Methylome-wide CpG-SNP analysis

The majority of the associated CpG-SNPs that expressed a significant meQTL effect and a significant dose effect in the MWAS showed a distinct pattern where methylation increased with the number of CpG alleles present, but where this increase was attenuated in MDD cases compared to controls. Thus, cases often showed less methylation than controls at the differently methylated loci. Many possible explanation may exist for this pattern. However, consistent with a general function of DNA methylation that protects the integrity of the genome by inactivating DNA elements39,40, this pattern would be in agreement with that a portion of potentially damaging mutations might not be properly silenced in MDD cases. Interestingly, the same pattern with less methylation observed in cases than in controls has previously been observed also in CpG-SNP studies for psychosis using both blood and brain tissue31.

### Overlap between CpG-SNP MWAS and GWAS

Many of the genes implicated by both the MWAS and the GWASs are of critical importance for neuronal function. Some of the overlapping gens have previously been associated with psychiatric disorders. For example, ROBO2 (roundabout, axon guidance receptor, homolog 2) is critical for the maintenance of inhibitory synapses in the adult ventral tegmental area, a brain region important for the production of dopamine41, and has been implicated in schizophrenia42,43,44 and bipolar depression45. ASIC2 (acid-sensing, proton-gated, ion channel 2) plays a role in neurotransmission46. DCC (deleted in colorectal carcinoma—netrin 1 receptor) upregulation in prefrontal cortex pyramidal neurons causes vulnerability to stress-induced social avoidance and anhedonia in mouse, and mutations in DCC have been associated with brain malformation47. Furthermore, DCC has been suggested to confer susceptibility to depression-like behaviors in mice and humans48 and was recently associated with mood instability, which has a strong genetic correlation to MDD49. In addition, the netrin 1 pathway, which involves DCC, has been identified as a candidate pathway for MDD50. Critically, both ROBO2 and DCC interact in opposing fashion and have strong roles in directing axon pathfinding in developing neurons51,52. In summary, several of the genes detected in the MWAS-GWAS overlap serve critical biological functions of likely relevance to MDD etiology.

The overlap between the CpG-SNP MWAS and GWAS cannot be explained by the allele frequency differences between cases and controls that produce GWAS signals. It is true that methylation levels will be higher in the group with the higher frequency of the SNP allele that creates the CpG-SNP. However, these methylation differences are fully accounted for by the effect of the SNP as a “covariate” in the model we used for the CpG-SNP MWAS. Indeed, performing a GWAS with only the SNPs that were included in the CpG-SNP MWAS showed a lambda of 0.995 without any strong association signals (smallest P-value = 5.28 × 10−6). Thus, the CpG-SNP MWAS and GWAS provide additional and independent lines of evidence for the involvement of these loci in MDD.

## Conclusion

In the first CpG-SNP MWAS for MDD, we identified 27 suggestively significant sites. A significant number of these sites showed a negative CpG-SNP dose effect with less methylation in cases than controls. Furthermore, the MWAS results were over-represented among findings from three recent GWASs, which for example added additional support for the involvement of DCC in MDD. As the analysis approach prevents the methylation results to be driven by allele frequency differences between cases and controls, these results show that MWAS and GWAS provide additional and independent lines of evidence for the involvement of these loci in MDD. In conclusion, CpG-SNP methylation studies of MDD can contribute novel and biologically relevant information that complements previous findings detected by regular MWAS or GWAS alone.

### Availability of data and materials

Following local IRB approval individual level methylation data will be made available via dbGap (submission in preparation).