Convergence of evidence from a methylome-wide CpG-SNP association study and GWAS of major depressive disorder

DNA methylation is an epigenetic modification that provides stability and diversity to the cellular phenotype. It is influenced by both genetic sequence variation and environmental factors, and can therefore potentially account for variation of heritable phenotypes and disorders. Therefore, methylome-wide association studies (MWAS) are promising complements to genome-wide association studies (GWAS) of genetic variants. Of particular interest are methylation sites (CpGs) that are created or destroyed by the alleles of single-nucleotide polymorphisms (SNPs), as these so-called CpG-SNPs may show variation in methylation levels on top of what can be explained by the sequence variation. Using sequencing-based data from 1132 major depressive disorder (MDD) cases and controls, we performed a MWAS of 970,414 common CpG-SNPs. The analysis identified 27 suggestively significant (P < 1.00 × 10−5) CpG-SNPs associations. Furthermore, the MWAS results were over-represented (odds ratios ranging 1.36–5.00; P ranging 4.9 × 10−3–8.1 × 10−2) among findings from three recent GWAS for MDD-related phenotypes. Overlapping loci included, e.g., ROBO2, ASIC2, and DCC. As the CpG-SNP analysis accounts for the number of alleles that creates CpGs, the methylation differences could not be explained by differences in allele frequencies. Thus, the results show that the MWAS and GWASs provide independent lines of evidence for the involvement of these loci in MDD. In conclusion, our methylation study of MDD contributes novel information about loci of relevance that complements previous findings and generates new hypothesis about MDD etiology, such as that the functional effects of genetic association may be partly mediated and/or enhanced by the methylation status in these loci.


Supplementary materials to:
Convergence of evidence from a methylome-wide CpG-SNP association study and GWAS of major depressive disorder

Genotype information
The NESDA participants were genotyped as previously described. 1 In short, the majority (95.2%) of DNA samples from the NESDA study were genotyped on Affymetrix 6.0 Human SNP array, while the remaining samples were genotyped on Perlegen-Affymetrix 5.0 array. In the quality control (QC) process, samples were excluded based on the following criteria: Affymetrix contrast QC < 0.4; missing rate > 10%; excess genome-wide heterozygosity or inbreeding levels (F < -0.075 or > 0.075); genotypes inconsistencies with reported gender; mendelian error rate > 5 standard deviations (SDs) from the mean of all samples; non-European/non-Dutch ancestry as indicated by principal component analysis.
SNPs were excluded for the following reasons: probes mapped badly against NCBI Build 37/UCSC hg19; minor allele frequency (MAF) < 0.005; missing rate > 5%; deviation from Hardy-Weinberg equilibrium (HWE) p< 1e-12; SNPs present in both arrays were cross-imputed using GONL reference panel 2 . After imputation SNPs were converted to best guess genotypes using Plink 1.90 3 and were removed if meeting the following more stringent criteria: a significant association with a single genotyping platform as compared to the other (p < 10 -5 ); an allele frequency difference > 10% with the GONL reference set; HWE p < 10 -5 , Mendelian error rate > 5SDs (N>40); imputation quality R 2 < 0.90.
The resulting data were then imputed using 1000G Phase 3 all ancestries reference panel via the Michigan Imputation Server 4 . Among the imputed SNPs, those retained for the present analyses met the following criteria: allele frequency difference < 5SDs of the mean of all SNPs with the reference set; HWE p > 10 -5 , Mendelian error rate < 5 SDs; MAF >0.01; R 2 >0.5.

Quality control and CpG score calculation of methylation data
As recently explained (Aberg et al submitted), we performed thorough quality control of reads, samples, and CpGs 5 . Of the 1,200 selected NESDA samples, 34 were excluded because the methylation enrichment (N=16) or library construction (N=18) failed. Reads aligning to loci without CpGs (non-CpGs) represent "noise" caused by, for example, alignment errors or imperfect enrichment leading to non-methylated fragments being sequenced. The average non-CpG to CpG coverage ratio was 0.012 (SD=0.025). Using a threshold of 0.05 to remove samples with high "noise" levels (N=10), left an average non-CpG to CpG coverage ratio of 0.010 (SD=0.005) in the remaining samples. For 10 samples, sequence variants called from the methylation data did not match the genotype information obtained from a previous GWAS of these samples 6 . This indicated that a sample swap or sample contamination may have occurred. As it was not possible to determine whether the problem occurred in the GWAS or MWAS data, we conservatively excluded all 10 samples from further analysis.
Finally, we used the R function 'pcout' in the 'mvoutliers' package (with the upper boundary for outlier detection set to 15, the scaling constant set to 0.5, and the boundary for final outliers set to 0.2) to identify multidimensional outliers using the first 15 principal components of the methylation data as input. Fourteen samples were identified as multidimensional outliers and omitted. This left a sample of 1,132 subjects for statistical analysis.
The mean number of reads for the 1,132 remaining samples was 59.4 million (SD=11.2 million) of which, on average, 99.1% aligned. Aligned reads were subjected to further quality control. Although reads often map to multiple genomic locations, in most cases, a single alignment can be selected because it is clearly better than other alignments. In the case of multi-reads, multiple alignments receive equally good alignment scores. When Bowtie2 encounters multi-reads, it uses a pseudo-random number generator to select a single primary alignment. Duplicate-reads are reads that start at the same nucleotide positions. When sequencing a whole genome, duplicate-reads typically arise from artifacts in template preparation or amplification. However, in the context of sequencing an enriched genomic fraction, duplicate-reads are increasingly likely to occur because reads originate from a smaller fraction of the genome. Therefore, only when more than 3 (duplicate) reads start at the same position, we reset the read count to 1 implicitly assuming these reads are tagging a single clonal fragment. This left an average of 48.7 million reads per sample (=81.9% of all reads).
To identify all common CpGs, we combined reference genome sequence (hg19/GRCh37) with SNP information from the European super-population on the 1000 Genomes project (Phase 3). To avoid analyzing sites that are CpGs in only a very small proportion of subjects, we excluded CpGs created/destroyed by SNPs that had a minor allele frequency <1%. This resulted in 27,916,990 CpGs.
Additionally, CpGs in loci prone to alignment errors, e.g., in repetitive regions, were eliminated prior to the analysis. To identify these CpGs, we used RaMWAS to perform an in silico alignment experiment outlined elsewhere that aligns all possible reads to the reference 5 . The vast majority of CpGs (89.3%) were located in regions that showed perfect alignment coverage. Only 1.3% of the CpGs showed evidence of alignment problems (defined as 15% or more reads from this locus not aligning properly) and were removed from further analyses. Finally, akin to filtering SNPs with low minor allele frequency, we eliminated 5,682,206 CpGs with average coverage less than 0.3. These sites may create false positive MWAS findings due to low power or statistical problems associated with analyzing sparse data. After all quality control, 21,869,561 CpGs remained.
Among these, 970,414 were CpGs that can be created or destroyed by common SNPs available for methylome-wide association study (MWAS) of CpG-SNPs.
Methylation scores were calculated by estimating the number of fragments covering the CpG using a non-parametric estimate of the fragment size distribution 7 . These scores provide a relative measure of the amount of methylation for each individual at that specific site.

Figure S1. Regression plots of the 27 suggestively significant CpG-SNP MWAS associations.
The methylation coverage (y-axis) is plotted for cases (red circles) and controls (blue squares) against the CpG count (x-axis). The fitted regression lines are indicated for cases (red) and controls (blue).

Figure S1 -continued. Regression plots of the 27 suggestively significant CpG-SNP MWAS associations.
The methylation coverage (y-axis) is plotted for cases (red circles) and controls (blue squares) against the CpG count (x-axis). The fitted regression lines are indicated for cases (red) and controls (blue).

Figure S1 -continued. Regression plots of the 27 suggestively significant CpG-SNP MWAS associations.
The methylation coverage (y-axis) is plotted for cases (red circles) and controls (blue squares) against the CpG count (x-axis). The fitted regression lines are indicated for cases (red) and controls (blue).