Obligatory and facilitative allelic variation in the DNA methylome within common disease-associated loci

Integrating epigenetic data with genome-wide association study (GWAS) results can reveal disease mechanisms. The genome sequence itself also shapes the epigenome, with CpG density and transcription factor binding sites (TFBSs) strongly encoding the DNA methylome. Therefore, genetic polymorphism impacts on the observed epigenome. Furthermore, large genetic variants alter epigenetic signal dosage. Here, we identify DNA methylation variability between GWAS-SNP risk and non-risk haplotypes. In three subsets comprising 3128 MeDIP-seq peripheral-blood DNA methylomes, we find 7173 consistent and functionally enriched Differentially Methylated Regions. 36.8% can be attributed to common non-SNP genetic variants. CpG-SNPs, as well as facilitative TFBS-motifs, are also enriched. Highlighting their functional potential, CpG-SNPs strongly associate with allele-specific DNase-I hypersensitivity sites. Our results demonstrate strong DNA methylation allelic differences driven by obligatory or facilitative genetic effects, with potential direct or regional disease-related repercussions. These allelic variations require disentangling from pure tissue-specific modifications, may influence array studies, and imply underestimated population variability in current reference epigenomes.

T he hunt for disease-implicated genetic sequences is a major focus of medical research because it may reveal precise molecular insights into disease pathophysiology. Genomewide association studies (GWAS) have successfully identified thousands of common genetic loci associated with human diseases and phenotypes 1 . However, linkage disequilibrium (LD) between variants, incomplete coverage and gaps in their functional annotations make it difficult to establish a firm causal and functional mechanism between the statistically identified variants and the phenotypes they associate with. Integration with diseaserelevant and tissue-specific functional indicators or epigenetic marks within these regions, such as DNase I hypersensitivity sites (DHSs) 2 , histone modifications 3,4 and DNA methylation variation 5,6 , can highlight candidate active variants. This dissection of GWAS signals enables progress from associated SNP to mechanistic understanding 7,8 .
Epigenetic variation in relation to genome sequence falls into three main categories: 'pure' if under no genetic influence, 'facilitated' when genetic polymorphism enable variability and 'obligatory', if sequence variants directly predict the epigenetic state 9 . Thus, defining how epigenomes vary with respect to genetic influence and the mechanistic role this plays over phenotypic expression can greatly increase understanding of genetic regulation in heath and in disease. Genetic influences on the epigenome, such as enhancer variation, are strong enough to be observed in only 19 diverse ancestry individuals 10 .
Here we report an analysis of the relationship between allelic dosage of genetic risk SNPs that have previously been significantly and reliably associated with human phenotypic variation (via the NHGRI-EBI GWAS Catalogue 1 ) and DNA methylation within the LD block harbouring it. These GWAS regions are not only robustly associated with human traits but are also known to be enriched for functionality 7,11 . We used genome-wide methylation-dependent immunoprecipitation second-generation sequencing (MeDIP-seq) data derived from peripheral blood and high-quality SNP array genotype data from 3128 samples of European ancestry. Therefore, we broadly capture DNA methylation differences between risk and non-risk GWAS haplotypes, or haplotype-specific DNA methylation (HSM) variation 12 . The underlying strong extent of genetic influences on the epigenome is increasingly acknowledged 13 . CpG density is fundamental in defining the background DNA methylome, along with cell-specific usage of available binding sequences for transcription factors 14 . Common population variation in CpGs is significant via SNPs within this dinucleotide (CpG-SNPs) as well as larger variants. Recent base-resolution data has further supported that regional epigenetic effects exist within the DNA methylome 15 . We took advantage of the fact that this immunoprecipitation-derived signal is strongly influenced by the number of methylated cytosines in the DNA fragment 16 to capture these regional allelic variation effects, and can be considered more akin to ChIP-seq than base-resolution analyses. This leads to a distinct signal of population variation in both facilitated and obligatory genetic effects on the DNA methylome. Allelic variations in genetic dosage effects on this DNA methylation score are allelic signal differences not epigenetic variability.  However, these may as well contribute to regional or neighbouring allelic epigenetic variation, with effects on traits, or even bias other analytical measures, including bisulphite arrays and sequencing.
In this study we identify robust alleic variation in DNA methylation signal within disease-related loci. This is contributed to, but not fully accounted for, by non-SNP variants. We show that these demarcated regions are, in fact, functionally enriched. Furthermore, CpG-SNP variation is important in this, through not only its fascilitative influence on signal, but also its impact on allele-specific DHS. This survey increases our understanding of common epigenomic variation, the extent of inter-individual genetic variation's influence on the epigenome, and its potential relevance with regard to genetic common disease-related variation.

DNA methylation variation by GWAS SNP risk carrier status.
We investigated 8093 GWAS phenotype-associated results (p < 1 × 10 −7 ) listed in the NHGRI-EBI GWAS catalogue 1 . Due to some SNPs being associated with more than one trait, this amounted to 5474 unique SNP by GWAS LD block analyses. In addition, because of GWAS results co-locating in the same LD block, this reduced to a total of 2709 blocks, which cover 22.1% of the entire length of the genome.
The total peripheral blood DNA methylome samples were split into three separate data sets ( Fig. 1 and 'Methods'). The first was the discovery analysis set (1DISC) which included 895 unrelated individuals. The second, follow-up data set (2FOLL, N = 1343) included 1343 siblings and additional time-point (longitudinal) samples from the individuals of 1DISC. The replication data set (3REPL) comprised 890 additional individuals, unrelated to any of the previous two sets. To identify differentially methylated regions (DMRs) between risk and non-risk GWAS haplotypes, the haplotype-specific DNA methylation (HSM) peak analysis assessed the linear relationship between the allelic count of the GWAS SNP and DNA methylation scores in 500 bp overlapping windows across the LD block. This included adjustments for critical covariates (Methods). Only HSM peaks retaining significance beyond multiple testing correction (p < 1.85 × 10 −8 , linear model or mixed effect model) in all three data sets and with a consistent direction of effect are reported here.
Our analysis is similar in concept but methodologically different to previous DNA methylation quantitative trait loci (mQTL) analyses that identify via array data significant SNP associations with DNA methylation 13,17,18 . We focus only on those genetic changes already robustly associated with disease via GWAS and, additionally, the use of MeDIP-seq data enables the entire related LD block regions to be interrogated, not isolated and predetermined CpGs. To identify obligatory or facilitated genetic effects on DNA methylation that are in strong LD with the GWAS SNP, we apply our analysis within the recombinationdefined LD blocks, were these signals strongly reside (Supplementary Fig. 1). Whilst 'pure' epigenetic changes may exist beyond these limits, our focused approach dramatically increases our study power to find these genetically driven variations.
We identified a consensus set of 16,060 500 bp windows, which were then merged due to overlapping and adjacent locations into the final set of 7173 distinct HSM peaks (Supplementary Data 1). The overlap between significant peaks identified in 1DISC with 2FOLL was 88.1% and of these 82.0% overlapped with 3REPL. The HSM peaks in total span over~5.86 Mb with an average genomic length of~0.82 kb each ( Supplementary Fig. 2).
HSM peaks strongly overlap with genetic variants. We subsequently investigated the overlap of these significant windows with known common genetic variants: copy number variants (CNVs), insertions and deletions (Indels) and short tandem repeats (STRs). 36.8% of the HSM peaks overlap these non-SNP variants or combinations of them. As would be expected there is strong proportional enrichment for non-epigenetic dosage effects driven by these CNVs (green), indels (blue), STRs (orange) and regions overlapping multiple variant types (purple) compared to their fractions both across the genome and within the GWAS LD block regions (Fig. 2, Fisher's exact p < 2.2 × 10 −16 ). The total count for overlap with the major variant classes with their p-value distributions is shown in Supplementary Fig. 3. Thus, whilst largescale variants, such as CNVs, do have an increased impact within these regions, they do not account for the entire signal as 63.2% are apportioned to the 'Other' category.
To further illustrate these risk haplotype-specific DNA methylation effects we display eight examples of the 5474 GWAS SNP by DNA methylation HSM results in Fig. 3. The total set of graphs and results for all analysed GWAS SNPs are available for browsing and download at http://www.epigenome.soton.ac.uk/ hsm/hsm.php.
An extremely strong methylated signal difference due to dosage effects of a CNV is clearly seen in the known SNP-tagged 40 kb and 8 kb large CNV deletions near the obesity-locus NEGR1 19 (green, Fig. 3a). The CYP24A1 multiple sclerosis locus 20 shows an HSM peak overlapping a common indel (blue, Fig. 3c), the NTN4 breast cancer locus 21 with an STR (orange, Fig. 3d) and the GNG11 locus associated with heart rate 22 with a CNV (green, Fig. 3e). A partial STR overlap can also be seen with the MX2 melanoma locus 23 (orange, Fig. 3h). However, the SNP rs3802842 associated with colorectal cancer (Fig. 3b) 24 , as well as the ABCA7 Alzheimer's disease 25 (Fig. 3f) and the TNFAIP3 rheumatoid arthritis 26 (Fig. 3g) loci, all possess strong signals that are not attributed to known CNV, Indel or STR co-localisations. The peak window differences in DNA methylation according to genotype of the GWAS SNP for these regions are shown in Fig. 4.   68 ; STR (Orange) =short tandem repeats from Willems et al. 69 ; eSTR (Fawn)=expressionassociated short tandem repeats from Gymrek et al. 70 ; Indel (Blue) from TwinsUK data set; multiple variant overlaps (Purple) and other category (Red) will comprise common SNPs which include CpG-SNPs   The y-axis denotes the -log 10 (p value) of the differential DNA methylation according to the genotypic status for the index (GWAS associated) SNP and the x-axis is the genomic position (in base-pairs) along the LD block. The HSM peaks are coloured in accordance with the class of polymorphisms they overlap with, i.e. copy number variant (CNV, Green), short tandem repeats (STR, Orange), expression-associated short tandem repeats (e-STR, Fawn), insertion-deletions (Indels, Blue), multiple variant overlaps (Purple) and other category (Red). a NEGR1 obesity locus; b intergenic colorectal cancer; c CYP24A1 multiple sclerosis; d NTN4 breast cancer; e GNG11 heart rate; f ABCA7 Alzheimer's disease; g TNFAIP3 rheumatoid arthritis and h MX2 melanoma. Underneath is the UCSC browser track for location with RefSeq genes, GWAS catalogue SNPs, DNase-I hypersensitivity clusters and combined chromatin segmentation tracks. The segmentation tracks are in standard colours (Red: Promoter; Light Red: Promoter Flanking; Orange: Enhancer; Yellow: Weak Enhancer; Blue: CTCF element; Dark Green: Transcribed Region; Grey: Repressed/Low Activity) Furthermore, using a combined analysis for all 3128 samples together for DNA methylation versus only GWAS allelic count, without other covariates, clearly mimics the initial discovery set result, but with more power (Supplementary Fig. 4). This starkly displays how strong this over-riding genetic effect is on this analysis technique. The complete combined result had consistent overlap with the HSM peak set (7163 of the 7173; 99.9%). These strongly genetically associated DNA methylation changes may suggest potential pathophysiological mechanisms differentiating risk and non-risk GWAS haplotypes, if not already known, within these respective trait-related loci.
CpG and CpG-SNP density is increased in HSM peaks. HSM peaks possess a higher CpG density (12.8 CpG/kb, or 2.55% of sequence is CpG dinucleotides) than both the background genome (1.84%) and the GWAS LD block regions (2.15%) (Fig. 5a, HSM peaks versus GWAS LD block regions, OR = 1.19, p < 2.2 × 10 −16 ). However, they are predominately not within CpG-dense 'CpG island' regions (see later). There is also an increased number of SNPs within HSM peaks (6.60 SNPs/kb) compared to the GWAS SNP LD block regions (3.91 SNPs/kb, Fisher's exact p < 2.2 × 10 −16 ). Due to hypermutability of methylated cytosines, CpG-SNPs are a significant proportion of all SNPs and we calculated that 32.7% of common SNPs (MAF ≥ 1%) within the GWAS LD Block Regions are CpG-SNPs. This gives a background density of 0.26% CpG-SNP sequence in these regions, however, within HSM peaks they are found at more than twice this density (0.57%, OR = 2.25, Fisher's exact p < 2.2 × 10 −16 , Fig. 5b).
A subset of 4482 HSM peaks, do not overlap with known CNV, Indel or STR (included in the miscellaneous 'Other' category for the purposes of our work). These are therefore expected to contain numerous SNPs, including methylation signal influencing CpG-SNPs 27, 28 . In fact, it is more than double, with 83.5% of SNPs being CpG-SNPs within these HSM peaks compared to the background of 40.1% in the 'Other variant' category windows within the GWAS LD block (OR = 7.66, Fisher's exact p < 2.2 × 10 −16 ). 92.6% of 'Other' HSM peaks contain a CpG-SNP with an average of 2.26 CpG-SNPs within these CpG-SNP containing peaks, thus indicating, as expected, that clusters of CpG-SNPs are strong drivers of HSM signal outside other known variants ( Supplementary Fig. 5).
HSM peaks are enriched for chromatin segmentation enhancers. We examined the overlap of the HSM peaks with Chromatin segmentation data from ENCODE in six tissues 4 (Fig. 6). Due to the known functional enrichment of GWAS regions when compared to genome, these HSM peaks show very strong enrichment for all for these defined loci (Fisher's exact all p < 1 × 10 −10 , Fig. 6a, b, Supplementary Data 2).
We then compared these regions specifically against their proportions within the GWAS LD blocks only. This was in order to test that the HSM peaks are not depleted exceptions within the functionally enriched GWAS regions. However, even comparing within the GWAS LD block regions (Fig. 6c, d), they show a small but significant enrichment for all categories, except for transcription start sites (TSS) and Promoter loci. This includes enhancer (Fisher's exact p = 5.58 × 10 −3 ), weak enhancer (Fisher's exact p = 6.01 × 10 −4 ) and CTCF loci (Fisher's exact p = 2.58 × 10 −8 ) as well as the large genomic regions of transcribed and repressed regions (Fisher's exact both p < 2.2 × 10 −16 ). Generalised MeDIPseq coverage differences were compared via the proportion of zero coverage across the differing functional units and indicated no significant influence on our enrichment calculations (Supplementary Table 1).
Overall, 748 HSM peaks (~10.4%) overlap enhancer signal from this combined segmentation in at least one of these six tissues, which rises to 1089 HSM peaks (~15.2%) if weak enhancers are also included.
HSM peaks are not functionally depleted. The HSM peaks were then overlapped with known genetic functional sets and 11 12 22     (Fig. 7a). Then compared to the average GWAS LD block regions themselves, the HSM regions show significant depletion in 6 of the 16 categories, (Fig. 7b), such as CGI shores, exons and CTCF loci. However transcripts, DHSs, all repeat classes, variably methylated regions (VMRs 29 ), and dynamic regions 30 all remain significantly enriched (all Fisher's exact p < 1 × 10 −4 ). Thus, there is strong enrichment across both of the base-resolution data sets used to identify regions of high DNA methylation variability, the VMRs, (54 DNA methylomes across 21 cell/tissue types 29 ) and 'Dynamic' regions (24 developmental and primary cells 30 ). This implies that genetic variation between samples may in fact be a significant contributor to the identified regions of increased DNA methylation variability within both these studies.
CpG-SNPs enriched in allelic DHSs. DNase I hypersensitivity sites identify accessible regions of the genome and therefore act as broad functional indicators 31,32 . As above, we found increased DHSs within HSM peaks. To further investigate the enrichment of CpG-SNPs we also identified in the HSM peaks, we explored SNPs that influence allele-specific DHSs. SNPs altering transcription factor (TF) binding were identified in a study by Moyerbrailean  HSM peaks are not specifically blood tisssue enriched. To interrogate the tissue-specific nature of our HSM peaks we compared their overlap with the DHS ENCODE data sets produced from 125 different tissue types. Our DNA methylome data set was derived from peripheral blood, but due to strong genetic influence, these HSM peaks are not specifically enriched for the blood-derived subset (22 of the 125) of DHSs (Fig. 8). In fact, a majority, 99 of the 125 tissue types, show at least nominal significant increase within the HSM peaks compared to genome (79.2%, χ 2 p < 0.05), across a range of different tissues (Supplementary Data 4). This, therefore, supports the genetic and nontissue-specific systemic nature of the HSM peaks identified.
Additionally, we further explored for any evidence that our results are biased towards blood-related disorders using the 15 broad disease categories for GWAS associations classified by Maurano et al. 32 . We see no evidence for HSM peaks to reside within GWAS LD blocks of blood-specific or related diseases above any of the other categories, either by number of HSM peaks per GWAS LD block or size-corrected HSM peaks/kb (Supplementary Fig. 6).
Facilitative transcription factor binding site enrichment. We identified 47 enriched motifs using MEME-ChIP 36 within the HSM peak DNA sequences. These were compared against the JASPAR Vertebrates database for enriched transcription factor binding sites (TFBSs). Nineteen of these matched significantly with known sequences via the TOMTOM algorithm 37 (Supplementary Data 5). Notably, this included the motifs for NRF1, ZFP161 and MYCN (Supplementary Fig. 7). These were recently identified by Domcke et al. to be methylation-sensitive TFs 38 , with NRF1 also shown to significantly influence nearby gene expression. These HSM peaks genetically introduce or include motifs for TFs were methylation can facilitate variability in binding and downstream action. Therefore, they support the potential for HSM peaks to play, not only an obligatory, but also in some cases a facilitative epigenetic role.
HSM peaks are enriched for Alu, SVA and LTR repeat classes. This sequencing study as opposed to array analysis is able to obtain some preliminary indication of repetitive methylation variation. We further investigated the identified above enrichment for the SINE repeat class, which are predominately Alu repeats. This revealed that HSM peaks were enriched for the younger and more active AluY and AluS elements compared to the older AluJ element. This was compared to both the genome and GWAS LD block regions (OR 1.24 and 1.15, Fisher's exact p = 5.81 × 10 −12 and 7.82 × 10 −6 , respectively). These younger repeats still possess mobilisation ability 39 and are more likely to be significant contributors to the population variation and haplotypic differences through direct and regional positional effects. Additionally, albeit smaller numbers, the even younger hominidspecific and Alu-containing SVA (SINE-VNTR-Alu) transposable element shows a significant enrichment (OR 2.08 and 1.44, Fisher's exact p = 2.24 × 10 −6 and 1.26 × 10 −2 , versus genome and GWAS LD block regions, respectively, Supplementary Fig. 8).
Within the LTR repeat class, HERV-H and LTR12C categories also show substantial overlap with our HSM peaks in comparison to the GWAS LD block regions background (OR 4.38 and 2.71, Fisher's exact p < 2.2 × 10 −16 and p = 7.82 × 10 −11 , respectively, Supplementary Fig. 8). Both these endogenous retroviruses (ERVs) are shown to be important in human development and the LTR12C subfamily has shown substantial co-location with enhancer signal that was tissue nonspecific 40 . Enrichment for allele-specific CTCF in HSM peaks. We further explored allele-specific functional differences using an allelespecific CTCF (AS-CTCF) data set from Ding et al. 41 This study identified common human variants acting as quantitative trait loci (QTLs) that influenced binding in ChIP-seq CTCF data. They found 1837 AS-CTCF binding events across the genome and 26 of these overlap HSM peaks, an extreme enrichment compared by chance in comparison to the genome and even within GWAS LD block regions themselves (Fig. 9, OR = 7.12 and 5.98, both Fisher's exact p < 2.2 × 10 −16 , respectively). This result suggests that these HSM peaks may be contributing to haplotypic variation in CTCF binding, potentially mediating population variation in 3D chromatin topography 42 . The haplotype allele-specific methylation (hap-ASM) results of Do et al. also identified a role for CTCF variability 6 . Furthermore, CTCF is known to commonly bind constitutively across many tissue types including a highly similar binding spectrum across all three germ layers in development 43 , which is consistent with the constitutive nature of the HSM peaks. Additional gene set enrichment analyses and multiple disease block identification are detailed in Supplementary Notes 1 and 2, with results in Supplementary Data 6 and 7, respectively.
Allelic effects on DNA methylation array analysis. Excluding known common SNP effects on Illumina DNA methylation array probes is part of the standard technical quality control for epigenome-wide association study (EWAS) analysis 44 . This is generally performed with arbitrary thresholds for common SNPs located at the interrogated CpG and within 10 bp or up to 50 bp of the probe sequence, although more nuanced approaches are being explored 45 . No allelic information is available from Illumina DNA methylation array results as methylation values are an average between both alleles. To investigate the potential genetic effects that may or may not be captured, we performed a similar analysis for possible allelic effects in GWAS LD blocks with a linear mixed model analysis in 811 Illumina 450k array peripheral blood-derived samples (88.9% overlap with MeDIP samples). The 450k array contain 169,151 probes (34.8%) within the GWAS LD block regions. After exclusion of multi-mapping probes, this reduced to 157,473 CpGs. Of these, 22,296 were nominally significant (p < 0.05) and 4192 were significant to a genome-wide Bonferroni level (p < 1 × 10 −7 , linear mixed model) for GWAS risk haplotype-capturing SNP allelic count. When common SNPs to 10 bp or 50 bp are excluded, these reduce to 21,091 and 16,878 at nominal, and 3890 and 2904 CpGs, at Bonferroni significance, respectively (Fig. 10a). Permutation analysis by random shuffling of observed genotype (allelic count) clearly displays the difference between observed and random results (Fig. 10b). 100× permutation strongly supported these findings as it identified only an average~4524 (range 4290-4696) cytosines at nominal and 0.32 (range 0-2) at Bonferroni significance, respectively (empirical p < 0.01). These findings indicate the importance of mQTL and additional detailed genetic interrogation of array results.

Discussion
We identified 7173 haplotype-specific methylation (HSM) peaks by investigating DNA methylation data in conjunction with robust GWAS SNPs and LD information. This analysis was performed in currently the largest available sequencing-based genome-wide DNA methylome data set. The identified loci are enriched for functional regions, such as enhancers, DHSs and allele-specific CTCF, illustrating their potential to play a pathogenic role.
This HSM analysis powerfully benefits from the combination of three main factors. Firstly, it focuses the analysis on functionally enriched 7, 11 and robustly disease-associated GWAS LD block regions. Secondly, the MeDIP-seq data enables DNA methylation interrogation across the entire GWAS LD regions, not just predetermined CpGs. Array data does not provide a representative survey of population-level DNA methylome variation, its underlying genetic architecture 46 , and lacks the density of coverage required for powerful DMR calling 47 . When DMRs can be identified they are found to be strongly enriched for functional elements 30 . Thirdly, is the considerable reduction in the number of tests, by reducing these components to DNA methylation 500 bp windows and a single haplotype-tagging GWAS SNP within each LD block. Individual CpG by SNP analyses require a punitive multiple testing correction and our approach reduces this burden.
Our LD aware approach differs from many recent excellent studies exploring genetic influences on the DNA methylome due to its direct focus on known GWAS SNPs 6,18,35,48 . Furthermore, whilst identifying 'pure' epigenetic changes by removing genetic effects has been the focus of some studies 6 , facilitative mechanisms are influential 35 . McClay et al. recently reported that the majority (75%) of cis-mQTL were due to collocating CpG-SNPs 35 , but only within GWAS regions were these functionally enriched. The individual genetic gain or loss of a CpG can impact upon dependent TF binding activity 38,49 . It can also led to DNA conformation changes affecting protein-DNA interaction strength, including an order of magnitude increase in adjacent DHS 50 . We identified a strong enrichment of CpG-SNPs within HSM peak regions as well as within allele-specific DHS sites. Previous work from Jaffe et al. 51 has also investigated GWAS loci with respect to DNA methylation from array data and identified a strong enrichment for mQTL within these loci. Furthermore, many of these mQTLs were not cell-type specific and involved critical disrupting CpG-SNPs, in similar fashion to our HSM data set. That analysis also proposed that many GWAS variants and highly correlated proxy SNPs influence nearby DNA methylation and in this way impact upon disease risk. An analysis in lung Fisher's exact test, NS non-significant, *p < 0.05, **p < 0.01, ***p < 0.001 tissue by Shi et al. equally found that cis-mQTL reside at CpG-SNPs distal to both genes or CpG-dense promoter regions 52 , and that mQTLs themselves are also enriched for CTCF loci, DHS and chromatin modifications. In our analysis a strong enrichment particularly for allele-specific CTCF was also seen, potentially indicating the ability of these loci to contribute to population variation in 3D structure 42 .
On average~2.65 HSM peaks were identified per GWAS LD blocks regions (59.0%≥1) and these will enable further hypotheses to be proposed in these disease susceptibility regions. The strong genetic influences on the DNA methylome leads to many more significant results than are seen for other potentially 'pure' epigenetic changes. For example, only 71 Bonferroni significant ageing-specific DMRs were identified in this same data set (including an LD correction for genetic effects) 53 . However, it needs to be clearly noted that methylation is the default state for the majority of CpGs throughout the genome. Those signals driven by variation in CpG number between risk and non-risk haplotypes can not be directly interpreted as functional, as is the case for tissue-specific epigenomic data, such as DHSs 32 . Yet the HSM peaks identified are enriched for functional indicators such as enhancer signal as well as motifs for DNA methylation sensitive TFs. Similarly to the hap-ASM results of Do et al. 6 they are also located outside CpG island promoter regions. Some of these peaks may represent population variation in distal regulatory regions, such as low methylation regions (LMRs) 14 . These LMR loci require a genetic infrastructure of a slightly above baseline CpG level and co-locate with transcription factor binding and   Global diversity in large deletions, duplication and CNVs have had a significant influence on shaping the genome of human populations through selection, mutation and demography 54 . The functional potential of these large-scale variants is significant and these regions, as would be expected due to modulation of available CpGs, influence signal and strongly overlap with HSM peaks. Additionally, STRs contribute to HSM peaks and these variants have shown enhancer activity 55 as well as the ability to influence expression 56 . Furthermore, DNA methylation within transposable elements can influence nearby gene expression 57 . Our data indicated enrichment for functionally implicated LTR repeat elements, HERV-H and LTR12C 40 . As well we see enrichment for younger primate-specific Alus and hominid-specific SVA repeats, which have known germline insertion mobility and population variability 58 . Thus, some of the HSM peaks represent their direct or adjacent positional effects on the regional epigenome 58 .
GWAS results are capturing haplotypes and a single variant may not be the only causal element. Recent analysis has proposed that the fraction of causal variation tagged by common SNPs is higher than previously calculated 59 . Therefore, there may be multiple genetic variants impacting to varying degrees, with some acting via this observed DNA methylation variability. Whilst functional overlap with HSM peaks was identified, this may itself be an underestimate, as the data sets for comparison themselves, such as chromatin segmentation, have only been performed in small numbers 46 . Population epigenomic variation will exist beyond that in current reference epigenomes. Our evidence of genetic effects in both the VMR and 'Dynamic' methylome data further imply that genetic polymorphism drives some of these findings assumed to be purely epigenetic. Additionally, strong HSM effects were previously identified in the FTO GWAS locus 12 . This located a peak within an enhancer region, which influences IRX3 expression 60 , that highlighted a SNP (rs7202116) subsequently found in a large meta-analysis to influence trait variability 61 . Thus, the potential of all 7173 HSM peaks is worthy of exploration and make excellent candidates for further functional analyses.
These extremely strong HSM signals clearly point to the caution and extra exploration required in the interpretation of population or non-inbred MeDIP as well as other DNA methylome analyses. Adjusting for known genetic influences impacts greatly on EWAS array analysis for common disease and phenotypes 47 . It is notable the pausity of significant directional disease-associated findings in robustly controlled experiments that have reduced genetic and cell-type heterogeneity issues by the use of isolated cell types in monozygotic twin discordant models 62 . Therefore, researchers need to be as aware as possible of the potential genetic confounding and/or interaction, both directly or due to regional or neighbouring effects. Our data set adds to the available genetic effects to be integrated in these analyses (Supplementary Data 1). Researchers should also interrogate results for additional evidence of strong genetic effects, such as ʻgap hunting' in DNA methylation data that may indicate the strongest of these genetic influences 63 , as well as populationspecific data availability of CpG-SNPs and larger genetic variants. A focused examination for mQTL SNPs influencing DNA methylation variability through haplotypic effects may be made more powerful by reducing tests through population-specific LD information. A tagging analysis may reveal unknown regional or haplotypic effects on the observed DNA methylation. We iden-tified~10.0% of the 450k CpG probes within these GWAS regions may be influenced (p < 0.05) by their lead SNP captured haplotype background, even after excluding probes with common SNP co-location. Identified effects require exploration for evidence of obligatory, facilitative, or dosage factors, as previously observed in array analysis 64 . Do et al. also identified haplotypic effects on a DNA methylation array analysis of Alzheimer's disease being driven by the nearby genetic susceptibility factor at HLA-DR* 6 . Furthermore, of the 278,873 probes on the new Illumina EPIC (850k) within these GWAS regions, 1815 directly overlap an identified HSM peak, and this further rises to 6671 probes including those within a 1 kb flanking region. Technological advances, including direct assessment of DNA modifications in long reads to enable more robust genetic and epigenetic haplotypic assessment will obviously improve our knowledge of these interactions and their interplay in disease risk.
We identified functionally enriched DNA methylome variation between risk and non-risk GWAS haplotypes. This robust set of HSM peaks propose potential new mechanisms to combine with tissue-specific data to further understand these diseases. Thus, this integrative analysis is a step in the process of defining population variation in the facilitative and obligatory DNA methylome. Complete allelic integration of both genetic and epigenetic variability will be required to fully understand these human disease-associated regions. Future advances with direct epigenetic detection and longer reads, will help define more precisely this variability. These HSM peaks begin this process by giving strongly demarcated loci across these GWAS regions, for further exploration and integration in human disease susceptibility understanding. The initial Discovery analysis set (1DISC) comprised of 895 DNA methylomes of 895 unique individuals. These included only a single sample from each family i.e. non-related individuals and only female samples to exclude sex-specific variation in DNA methylation. The second Follow-Up data set (2FOLL) comprised of 1343 methylomes and included siblings and some longitudinal data set from Set 1. The third Replication data set (3REPL) comprised 890 DNA methylomes of non-related individuals to the first two sets. Our discovery set is well powered to detect moderate sized effects (Cohen's f 2 = 0.15) 65 as we estimate these can be identified with 95% power in~500 samples. Furthermore, we possess the added benefit in this analysis of the two additional data sets for conformation and replication.

Methods
MeDIP-seq. The methods and analysis pipeline made use of Methylation Dependent Immunoprecipitation Illumina second-generation sequencing (MeDIPseq). BGI-Shenzhen (Shenzhen, China) performed DNA sample preparation, followed by Sonication with a Covaris system (Woburn, MA, USA) that fragmented whole peripheral blood DNA. Then the MeDIP reaction was performed and 5 ug of fragmented genomic DNA was used for sequencing library preparation. Illumina Single-End DNA Sample Prep kit was employed and end repair, < A > base addition and adaptor ligation steps were performed. Adaptor-ligated DNA was incubated with an antibody for 5-methylcytosine (5mC) Cat. No.: CO2010021 mcmagme-048 from Diagenode (Liège, Belgium). The protocol for the MagMeDIP kit (mc-magme-048) was followed: combining 0.5 μl antibody + 0.5 μl water; then add 0.6 μl MagBuffer A, 1.4 μl water, 2 μl MagBuffer C, therefore resulting in a final volume of 5 μl for the IP reaction. Immunocapture was performed with magnetic bead conjugation to capture the enriched DNA fraction. Quantitative PCR validated this resultant MeDIP, which was then purified with Zymo DNA Clean & Concentrator-5 (Zymo Research), and amplified with adaptor-mediated PCR. Gel excision for size selection of fragments (200-500 bp) was performed and these were quality assessed by Agilent BioAnalyzer (Agilent Technologies, Santa Clara, CA, USA). The libraries were subjected to highly parallel 50-bp single-end sequencing on the Illumina HiSeq2000 platform. FASTQC (v0.10.0) assessed initial base composition QC successfully. MeDIP-seq data was aligned with BWA (>mapping quality score of Q10), and duplicates were removed. The average high-quality BWA aligned reads was~16.9 million per sample. Quality control with SAMTools and FastQC and for MeDIP-specific analysis MEDIPS(v1.0) 66 was used to produce reads per million (RPM). These data were produced as BED files of genomic windows (500 bp, 250 bp slide). Additional quality checks were also employed with Principle Components Analysis and correlation matrix, hierarchical clustering, dendogram, heatmap, and density plots. All analysis and co-ordinates cited are for genome build Homo sapiens hg19/GRCh37. GWAS linkage disequilibrium blocks. Linkage disequilibrium blocks of the SNPs in the NHGRI-EBI GWAS catalogue 1, 11 were ascertained from the GRCh37 genetic map, downloaded from Center of Statistic Genetics, University of Michigan, Locuszoom 1.3 67 . Recombination rate 10 cM/Mb block boundaries were used. The NHGRI-EBI GWAS catalogue as at December 2014 provided the 8093 GWAS SNPs with p value<1 × 10 −7 deposited to that time point. These are in fact 5522 unique individual SNPs due to co-associations for the same SNP. Within the above-identified LD blocks 5474 of these SNPs were positioned and due to SNPs co-locating in the same block these represented a total of 2709 blocks, covering 22.1% of the genome.
Risk haplotype-specific methylation analysis. DNA methylation within the LD blocks of the GWAS catalogue association SNPs was compared by linear regression with respect to allelic count of the haplotype-tagging SNP, termed HSM analysis 12 . This identifies broad DNA methylation differences between risk and non-risk GWAS haplotypes. With MeDIP there is a direct relationship between the number of methylated cytosines in the DNA fragment and the amount of DNA captured by the antibody 16 . Therefore, genetic gains or losses of CpGs will influence the enrichment of fragments strongly. Consequently, this analysis results in a distinct signal due to population variation in both facilitated and obligatory genetic effects on the DNA methylome. These regions may have direct or regional impacts on further surrounding CpG methylation states with the potential for significant functional effects in these GWAS regions. DNA methylation was scored within 500 bp windows with 250 bp overlap by normalised RPM for each window. In the discovery set (1DISC) a linear model was fitted including chronological age at blood extraction for DNA sample, blood cell subtypes (lymphocytes, monocyte, neutrophil and eosinophil), smoking status and batch. In the Follow-Up (2FOLL), a linear mixed effect model was used for allelic count to DNA methylation with the fixed effects of 1DISC with additional family and zygosity as random effects. The replication (3REPL) set analysis was the same as for 2FOLL, but included sex and excluded blood and smoking covariate information. The described two linear effects models were compared with null models that excluded allelic count by the ANOVA function by likelihood ratio test for calculation of p-values. The lme4 R package was used to perform the linear mixed effect analysis of the relationship between allelic count of the haplotype-tagging SNP and normalised DNA methylation assayed by MeDIP-seq.
To correct for multiple testing, a strict Bonferroni cut-off was calculated by the total number of DNA methylome windows tested in the analysis, 2,708,462. Thus, a p-value significance level of <1.85 × 10 −8 . The mean p-value was calculated for each window for GWAS LD block regions containing greater than one GWAS SNP. ENCODE poor mappability blacklist regions 31 were subsequently removed from any further interpretation (13,726 windows removed). To identify a robust set of HSM peaks we determined those windows that passed the Bonferroni threshold in all three (1DISC, 2FOLL and 3REPL) subsets. The R (3.0.0) environment was used for all analysis, with graphing via the ggplot2 package with results and code available at http://www.epigenome.soton.ac.uk/hsm/hsm.php.
Variants within GWAS LD block regions and HSM peaks. Common genetic variants that overlapped locations within the GWAS LD block regions were defined. This included copy number variants (CNVs), insertions and deletions (Indels) and short tandem repeats (STRs). As above the known Blacklist regions were overlapped (13,726 windows,~0.5%) and then removed before subsequent enrichment analyses below. Common CNV data was ascertained from the Stringent set of the copy number variation map of the human genome in the Database of genomic variants from Zarrei et al. which includes CNV sized 50 bp to 3 Mb 68 . Indel data was obtained from the TwinsUK data set with 1000 Genomes imputation of MAF>0.05. Short tandem repeats or Microsatellite data was obtained from the landscape of human STR variation from Willems et al. of 689,512 STRs 69 . Furthermore, a subset of these STRs with recent evidence of effects on gene expression, 2060 expression STRs (eSTRs), was also investigated for overlap 70 . Large study numbers reduced substantially the potential influence of rare or private CNVs, indels and STRs on this analysis.
CpG-SNP identification. The dbSNP build 142 common SNP data set was downloaded including SNP alleles and surrounding base sequence. This represents 12,449,124 common SNPs found in ≥1% of samples within autosomes. These were then interrogated for those that were CpG-SNPs, i.e. where the allelic variation created or abrogated a CpG dinucleotide. 3,873,489 (~31.1%) of these SNPs were determined to be CpG-SNPs.
Enrichment analysis. We used Epiexplorer for our first examination of the HSM peaks 71 for assessment of enrichment for chromatin state (ChromHMM), histone modifications and TFBSs. We downloaded the additional Combined segmentation data for 6 tissue types (Gm12878; H1hesc; Helas3; Hepg2; Huvec; K562) from UCSC. Additional functional enrichments were also downloaded from UCSC data on CpG islands, ENCODE DHS in 125 cell types 2 , Vertebrate Multiz Alignment and Conservation (100 Species) from 100Vert_El_phastConsElement100way bedfile (~10.1 m regions), and TFBSs from ENCODE v3 (690 data sets from wgEn-codeRegTfbsClusteredV3 31 ). 'Dynamic' DNA methylation regions were taken from Ziller et al. 30 and eRNA validated FANTOM5 enhancers regions from Anderson et al. 72 . BEDTools (v 2.17.0) 73 IntersectBed with the f -0.1 parameter tested overlap between 500 bp non-overlapping windows of these elements and within the GWAS LD block regions.
Genomic regions enrichment of annotations tool (GREAT v3.0.0) was used for genomic-space aware Gene and Functional pathway enrichments 74 . The binomial analysis with default setting for basal, and extension parameters (constitutive 5.0 kb upstream, 1.0 kb downstream and up to 1 Mb max extension) was employed. Curated regulatory domains were also included. Two background sets were used for comparison, firstly the entire genome and secondly the GWAS LD block regions (Supplementary Note 1).
Transcription factor binding site motif enrichment was performed with the JASPAR core 2014 vertebrates database in the MEME suit (MEME-ChIP 36 ) with TOMTOM 37 (v4.10.2) using FASTA sequence files for the HSM peaks. MEME-Chip analysis compared with a set of 1434 DNA motifs, between 5 and 30 in length (average length 13.6), from the database Vertebrates (in vivo and in silico).
Tissue-specific and multi-tissue disease investigation. DNase I hypersensitivity sites data in 125 cell types, including 22 blood cell data sets, from the ENCODE analysis of Thurman et al. 2 were downloaded to compare any potential tissue enrichment of the HSM peaks. The total basepair overlap with DHS sites was compared between the HSM peaks and the entire genome. To investigate whether these peripheral blood-derived HSM peaks were enriched for blood-related disease categories, the broad disease categories as defined in Maurano et al. were used which had classified 5655 SNP-trait associations 32 . These 15 classes are: aging; autoimmune disease; cancer; cardiovascular; diabetes; drug metabolism; haematological parameters; kidney, lung and liver; miscellaneous; neurological and behavioural; parasitic and bacterial disease; quantitative traits; radiographic parameters; serum metabolites and viral disease. This data set was also used to identify multi-tissue associations (Supplementary Note 2).
DNA methylation array analysis. A data set of 811 females (two batches: 388 and 423) were analysed by the Infinium Human Methylation450 BeadChip from bisulphite-converted DNA derived from peripheral blood. 88.9% of these samples also overlap the MeDIP samples. The quality control steps comprised the removal of probes that reside on the X or Y chromosomes (n = 11,650), where the 50 bp sequence aligned to multiple locations in the genome (n = 17,764), or failed detection in ≥1 sample and with a bead count <3 in >5% of the samples. This resulted in a data set of 450,077 probes. Further QC involved inspection for outliers using boxplots for mean and median DNA methylation across all CpG sites, β density plots, and heatmaps. The proportion of blood cell subtypes was deconvoluted for CD8+ T cells, CD4+ T cells, B cells, Natural Killer cells, granulocytes and monocytes 75 . To correct for probe type bias all data was normalised via BMIQ 76 . Probes that resided within the GWAS LD Block Regions (169,151) were assessed for any significant GWAS SNP associated differentially methylated positions. This was performed using a linear mixed effects model fitted on standardised β values per probe (N(0,1)), with genotype as allelic count, age, smoking status, beadchip, position on the beadchip, granulocytes, monocytes, and CD8+ T cells as fixed effects, as well as family and zygosity as random effects. To assess for significance, ANOVA was used to compare this model to a null model without allelic count. Permutation was performed in R by random shuffling of genotype assignment of individuals whilst retaining all other variables constant.
Allele-specific data sets. We accessed the data on allele-specific CTCF from Ding et al. 41 . Allele-specific DNase-I Hypersensitivity SNPs, also termed 'Switch-SNPs', influence TF binding and were downloaded from Moyerbrailean et al. 33 .