Statistical and integrative system-level analysis of DNA methylation data

Published online:


Epigenetics plays a key role in cellular development and function. Alterations to the epigenome are thought to capture and mediate the effects of genetic and environmental risk factors on complex disease. Currently, DNA methylation is the only epigenetic mark that can be measured reliably and genome-wide in large numbers of samples. This Review discusses some of the key statistical challenges and algorithms associated with drawing inferences from DNA methylation data, including cell-type heterogeneity, feature selection, reverse causation and system-level analyses that require integration with other data types such as gene expression, genotype, transcription factor binding and other epigenetic information.

  • Subscribe to Nature Reviews Genetics for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.


  1. 1.

    & CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011).

  2. 2.

    , , , & Aging and DNA methylation in colorectal mucosa and cancer. Cancer Res. 58, 5489–5494 (1998).

  3. 3.

    et al. Epigenetic differences arise during the lifetime of monozygotic twins. Proc. Natl Acad. Sci. USA 102, 10604–10609 (2005).

  4. 4.

    et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–446 (2010).

  5. 5.

    et al. Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 20, 434–439 (2010).

  6. 6.

    et al. Widespread and tissue specific age-related DNA methylation changes in mice. Genome Res. 20, 332–340 (2010).

  7. 7.

    & Aging, methylation and cancer. Histol. Histopathol. 15, 835–842 (2000).

  8. 8.

    DNA methylation age of human tissues and cell types. Genome Biol. 14, R115 (2013).

  9. 9.

    Epigenetics as a unifying principle in the aetiology of complex traits and diseases. Nature 465, 721–727 (2010).

  10. 10.

    , & The epigenetic progenitor origin of human cancer. Nat. Rev. Genet. 7, 21–33 (2006).

  11. 11.

    Taking the measure of the methylome. Nat. Biotechnol. 28, 1026–1028 (2010).

  12. 12.

    et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics 6, 692–702 (2011).

  13. 13.

    , & Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8, 389–399 (2016).

  14. 14.

    , The International Human Epigenome Consortium & The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167, 1145–1149 (2016).

  15. 15.

    Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

  16. 16.

    et al. Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA. Nat. Genet. 49, 635–642 (2017). This paper demonstrates how DNAm patterns detected from cell-free DNA in blood plasma can be used to detect cancer and its tissue of origin.

  17. 17.

    , , , & DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin. Epigenet. 7, 113 (2015).

  18. 18.

    et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017).

  19. 19.

    et al. Epigenetic signatures of cigarette smoking. Circ. Cardiovasc. Genet. 9, 436–447 (2016).

  20. 20.

    et al. Prolonged high-fat diet induces gradual and fat depot-specific DNA methylation changes in adult mice. Sci. Rep. 7, 43261 (2017).

  21. 21.

    Analysing and interpreting DNA methylation data. Nat. Genet. 13, 705–719 (2012).

  22. 22.

    & Analysis pipelines and packages for Infinium HumanMethylation450 BeadChip (450k) data. Methods 72, 3–8 (2015).

  23. 23.

    et al. Comprehensive analysis of DNA methylation data with RnBeads. Nat. Methods 11, 1138–1140 (2014).

  24. 24.

    , , & DeepBlueR: large-scale epigenomic analysis in R. Bioinformatics 33, 2063–2064 (2017).

  25. 25.

    et al. An epigenome-wide association study of total serum immunoglobulin E concentration. Nature 520, 670–674 (2015).

  26. 26.

    et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat. Med. 21, 938–945 (2015).

  27. 27.

    et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS ONE 4, e8274 (2009).

  28. 28.

    et al. Leukocyte-adjusted epigenome-wide association studies of blood from solid tumor patients. Epigenetics 9, 884–895 (2014).

  29. 29.

    et al. Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers. Cancer Epidemiol. Biomarkers Prev. 21, 1293–1302 (2012).

  30. 30.

    et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31, 142–147 (2013). This paper presents an EWAS demonstrating the dramatic impact adjusting for cell-type heterogeneity can have on the number of discoveries.

  31. 31.

    et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformat. 13, 86 (2012). This paper presents a reference-based cell-type deconvolution algorithm for EWAS.

  32. 32.

    , & Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics 30, 1431–1439 (2014).

  33. 33.

    et al. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformat. 17, 259 (2016).

  34. 34.

    et al. Epigenomic deconvolution of breast tumors reveals metabolic coupling between constituent cell types. Cell Rep. 17, 2075–2086 (2016).

  35. 35.

    et al. Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC Bioinformat. 17, 120 (2016).

  36. 36.

    Model uncertainty, data mining and statistical inference. J. R. Statist. Soc. A 158, 419–466 (1995).

  37. 37.

    et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).

  38. 38.

    , , , & Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 15, R50 (2014).

  39. 39.

    , , & A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinformat. 18, 105 (2017).

  40. 40.

    et al. CancerLocator: non-invasive cancer diagnosis and tissue-of-origin prediction using methylation profiles of cell-free DNA. Genome Biol. 18, 53 (2017).

  41. 41.

    et al. MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes. Genome Biol. 15, 419 (2014).

  42. 42.

    et al. Predicting tumor purity from methylation microarray data. Bioinformatics 31, 3401–3405 (2015).

  43. 43.

    , , & Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies. Genome Biol. 18, 17 (2017).

  44. 44.

    & Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007). This paper presents SVA, a powerful framework for feature selection in the presence of confounders, including cell-type composition and unknown factors.

  45. 45.

    & A general framework for multiple testing dependence. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).

  46. 46.

    , , , & The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).

  47. 47.

    et al. An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies. Genome Biol. 17, 84 (2016).

  48. 48.

    et al. Correcting for cell-type heterogeneity in epigenome-wide association studies: revisiting previous analyses. Nat. Methods 14, 216–217 (2017).

  49. 49.

    et al. Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinformat. 18, 216 (2017).

  50. 50.

    , & Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27, 1496–1505 (2011).

  51. 51.

    et al. Correlation of smoking-associated DNA methylation changes in buccal cells with DNA methylation changes in epithelial cancer. JAMA Oncol. 1, 476–485 (2015).

  52. 52.

    & Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000).

  53. 53.

    , , , & Epigenome-wide association studies without the need for cell-type composition. Nat. Methods 11, 309–311 (2014).

  54. 54.

    et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods 13, 443–445 (2016).

  55. 55.

    et al. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer. Nat. Commun. 7, 10478 (2016).

  56. 56.

    & Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).

  57. 57.

    & Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 15, R31 (2014).

  58. 58.

    et al. MeDeCom: discovery and quantification of latent components of heterogeneous methylomes. Genome Biol. 18, 55 (2017).

  59. 59.

    et al. DNA methylation of cord blood cell types: Applications for mixed cell birth studies. Epigenetics 11, 354–362 (2016).

  60. 60.

    et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).

  61. 61.

    et al. Correcting for cell-type effects in DNA methylation studies: reference-based method outperforms latent variable approaches in empirical studies. Genome Biol. 18, 24 (2017).

  62. 62.

    , & Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  63. 63.

    et al. Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues. Nat. Genet. 44, 1207–1214 (2012).

  64. 64.

    et al. Dynamic heterogeneity and DNA methylation in embryonic stem cells. Mol. Cell 55, 319–331 (2014).

  65. 65.

    & Epigenetic control of immunity. Cold Spring Harb. Perspect. Biol. 6, a019307 (2014).

  66. 66.

    et al. Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia. Cancer Cell 26, 813–825 (2014). This paper uses WGBS data to estimate epigenetic clonal heterogeneity in cancer and to show that increased epigenetic heterogeneity is associated with a poor clinical outcome.

  67. 67.

    et al. Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia. Nat. Med. 22, 792–799 (2016).

  68. 68.

    et al. Epigenetic variability in cells of normal cytology is associated with the risk of future morphological transformation. Genome Med. 4, 24 (2012). This paper demonstrates that the risk of an epithelial cancer can be predicted from the DNAm patterns measured in normal cells, years before neoplastic transformation. The detection of DNAm risk markers was only possible using differential variability as a novel feature-selection paradigm in a risk prediction algorithm called EVORA.

  69. 69.

    et al. Dynamic evolution of clonal epialleles revealed by methclone. Genome Biol. 15, 472 (2014).

  70. 70.

    van ' et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

  71. 71.

    et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformat. 11, 587 (2010).

  72. 72.

    , , , & Non-specific filtering of beta-distributed data. BMC Bioinformat. 15, 199 (2014).

  73. 73.

    , & A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform. BMC Bioinformat. 13, 59 (2012).

  74. 74.

    et al. A comprehensive overview of Infinium HumanMethylation450 data processing. Briefings Bioinformat. 15, 929–941 (2014).

  75. 75.

    , & Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 45, e22 (2017).

  76. 76.

    et al. Genetic and environmental influences interact with age and sex in shaping the human methylome. Nat. Commun. 7, 11115 (2016).

  77. 77.

    et al. Age-related accrual of methylomic variability is linked to fundamental ageing mechanisms. Genome Biol. 17, 191 (2016). This paper demonstrates the importance of differentially variable DNAm patterns in the context of ageing, linking age-associated DVCs to age-associated transcriptional changes. It provides a novel paradigm for understanding the role of age-associated DNAm changes in disease aetiology.

  78. 78.

    & limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20, 3705–3706 (2004).

  79. 79.

    Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, Article3 (2004).

  80. 80.

    et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

  81. 81.

    et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat. Genet. 38, 1378–1385 (2006).

  82. 82.

    et al. Saturation analysis for whole-genome bisulfite sequencing data. Nat. Biotechnol. 34, 691–693 (2016).

  83. 83.

    et al. Information recovery from low coverage whole-genome bisulfite sequencing. Nat. Commun. 7, 11306 (2016).

  84. 84.

    , , & Discovering high-resolution patterns of differential DNA methylation that correlate with gene expression changes. Nucleic Acids Res. 41, 6816–6827 (2013).

  85. 85.

    , & Modeling complex patterns of differential DNA methylation that associate with gene expression changes. Nucleic Acids Res. 45, 5100–5111 (2017).

  86. 86.

    et al. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int. J. Epidemiol. 41, 200–209 (2012).

  87. 87.

    et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–1369 (2014).

  88. 88.

    et al. Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res. 18, 780–790 (2008).

  89. 89.

    et al. Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina-associated domains. Nat. Genet. 44, 40–46 (2012).

  90. 90.

    et al. Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors. Genome Med. 6, 61 (2014).

  91. 91.

    et al. An integrative multi-scale analysis of the dynamic DNA methylation landscape in aging. PLoS Genet. 11, e1004996 (2015).

  92. 92.

    et al. Age and sun exposure-related widespread genomic blocks of hypomethylation in nonmalignant skin. Genome Biol. 16, 80 (2015).

  93. 93.

    et al. Increased methylation variation in epigenetic domains across cancer types. Nature Genet. 43, 768–777 (2011).

  94. 94.

    et al. Large-scale hypomethylated blocks associated with Epstein-Barr virus-induced B-cell immortalization. Genome Res. 24, 177–184 (2014).

  95. 95.

    et al. De novo identification of differentially methylated regions in the human genome. Epigenetics Chromatin 8, 6 (2015).

  96. 96.

    , , & Comb-p: software for combining, analyzing, grouping and correcting spatially correlated P-values. Bioinformatics 28, 2986–2988 (2012).

  97. 97.

    & Statistical Methods (Wiley-Blackwell, 1989).

  98. 98.

    & Differential variability improves the identification of cancer risk markers in DNA methylation studies profiling precursor cancer lesions. Bioinformatics 28, 1487–1494 (2012).

  99. 99.

    & Adaptive index models for marker-based risk stratification. Biostatistics 12, 68–86 (2011).

  100. 100.

    & DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging. Genome Biol. 15, 465 (2014).

  101. 101.

    et al. On the potential of models for location and scale for genome-wide DNA methylation data. BMC Bioinformat. 15, 232 (2014).

  102. 102.

    & A powerful statistical method for identifying differentially methylated markers in complex diseases. Pac. Symp. Biocomput. 2013, 69–79 (2012).

  103. 103.

    , & Stochastic epigenetic outliers can define field defects in cancer. BMC Bioinformat. 17, 178 (2016).

  104. 104.

    et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49, 359–367 (2013).

  105. 105.

    , , & Significance analysis and statistical dissection of variably methylated regions. Biostatistics 13, 166–178 (2012).

  106. 106.

    , , & Potential energy landscapes identify the information-theoretic nature of the epigenome. Nat. Genet. 49, 719–729 (2017).

  107. 107.

    et al. eFORGE: A Tool for Identifying Cell Type-Specific Signal in Epigenomic Data. Cell Rep. 17, 2137–2150 (2016).

  108. 108.

    et al. Charting a dynamic DNA methylation landscape of the human genome. Nature 500, 477–481 (2013).

  109. 109.

    et al. Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics 29, 1851–1857 (2013).

  110. 110.

    , & missMethyl: an R package for analyzing data from Illumina's HumanMethylation450 platform. Bioinformatics 32, 286–288 (2016).

  111. 111.

    , , & An integrative network algorithm identifies age-associated differential methylation interactome hotspots targeting stem-cell differentiation pathways. Sci. Rep. 3, 1630 (2013).

  112. 112.

    et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

  113. 113.

    , & Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).

  114. 114.

    , & Epigenome-wide association studies and the interpretation of disease -omics. PLoS Genet. 12, e1006105 (2016).

  115. 115.

    & Associating cellular epigenetic models with human phenotypes. Nat. Rev. Genet. 18, 441–451 (2017).

  116. 116.

    et al. Blood lipids influence DNA methylation in circulating cells. Genome Biol. 17, 138 (2016).

  117. 117.

    , , , & Histone modification levels are predictive for gene expression. Proc. Natl Acad. Sci. USA 107, 2926–2931 (2010).

  118. 118.

    et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).

  119. 119.

    et al. Systematic identification of genetic influences on methylation across the human life course. Genome Biol. 17, 61 (2016).

  120. 120.

    et al. Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels. PLoS Genet. 10, e1004663 (2014).

  121. 121.

    et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 12, R10 (2011).

  122. 122.

    et al. Disease variants alter transcription factor levels and methylation of their binding sites. Nat. Genet. 49, 131–138 (2017). This paper demonstrates how genetic variants that affect the activity of a transcription factor in cis are associated in trans with coherent DNAm alteration at its binding sites. This principle provides a new strategy for elucidating the role of non-coding GWAS SNPs.

  123. 123.

    et al. Genome-wide methylation data mirror ancestry information. Epigenetics Chromatin 10, 1 (2017).

  124. 124.

    & Two-step epigenetic Mendelian randomization: a strategy for establishing the causal role of epigenetic processes in pathways to disease. Int. J. Epidemiol. 41, 161–176 (2012). This is paper proposes the use of genotype as a causal anchor to strengthen causal inference in epigenetic studies. It sets out the principle of two-step Mendelian randomization for molecular mediation.

  125. 125.

    et al. Mendelian randomization analysis identifies CpG sites as putative mediators for genetic influences on cardiovascular disease risk. Am. J. Hum. Genet. 101, 590–602 (2017).

  126. 126.

    et al. Exploring a causal role of DNA methylation in the relationship between maternal vitamin B12 during pregnancy and child's IQ at age 8, cognitive performance and educational attainment: a two-step Mendelian randomization study. Hum. Mol. Genet. 26, 3001–3013 (2017).

  127. 127.

    , & DNA methylation of distal regulatory sites characterizes dysregulation of cancer genes. Genome Biol. 14, R21 (2013).

  128. 128.

    et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).

  129. 129.

    , & Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).

  130. 130.

    et al. Gene body methylation can alter gene expression and is a therapeutic target in cancer. Cancer Cell 26, 577–590 (2014).

  131. 131.

    DNA methylation and gene silencing in cancer. Nat. Clin. Pract. Oncol. 2 (Suppl. 1), S4–S11 (2005).

  132. 132.

    Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 13, 484–492 (2012).

  133. 133.

    , , , & On the presence and role of human gene-body DNA methylation. Oncotarget 3, 462–474 (2012).

  134. 134.

    , & A systems-level integrative framework for genome-wide DNA methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics 30, 2360–2366 (2014).

  135. 135.

    et al. DNA methylation of the first exon is tightly linked to transcriptional silencing. PLoS ONE 6, e14524 (2011).

  136. 136.

    & Cytosine methylation and mammalian development. Genes Dev. 13, 26–34 (1999).

  137. 137.

    , & Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009).

  138. 138.

    et al. The integrative epigenomic-transcriptomic landscape of ER positive breast cancer. Clin. Epigenet. 7, 126 (2015).

  139. 139.

    et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl Acad. Sci. USA 110, 4245–4250 (2013).

  140. 140.

    et al. Role of DNA methylation in modulating transcription factor occupancy. Cell Rep. 12, 1184–1195 (2015).

  141. 141.

    et al. Competition between DNA methylation and transcription factors determines binding of NRF1. Nature 528, 575–579 (2015).

  142. 142.

    , & Transcription factors as readers and effectors of DNA methylation. Nat. Rev. Genet. 17, 551–565 (2016).

  143. 143.

    et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).

  144. 144.

    et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009).

  145. 145.

    et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature 480, 490–495 (2011).

  146. 146.

    et al. Meta-analysis of IDH-mutant cancers identifies EBF1 as an interaction partner for TET2. Nat. Commun. 4, 2166 (2013).

  147. 147.

    , , , & Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome Biol. 16, 105 (2015).

  148. 148.

    et al. Decoding the regulatory landscape of medulloblastoma using DNA methylation sequencing. Nature 510, 537–541 (2014).

  149. 149.

    et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  150. 150.

    et al. Identification of activated enhancers and linked transcription factors in breast, prostate, and kidney tumors by tracing enhancer networks using epigenetic traits. Epigenetics Chromatin 9, 50 (2016).

  151. 151.

    et al. Identification of novel prostate cancer drivers using RegNetDriver: a framework for integration of genetic and epigenetic alterations with tissue-specific regulatory network. Genome Biol. 18, 141 (2017).

  152. 152.

    et al. A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature 503, 290–294 (2013).

  153. 153.

    et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

  154. 154.

    et al. Role of DNA methylation and epigenetic silencing of HAND2 in endometrial cancer development. PLoS Med. 10, e1001551 (2013). This is paper uses a system-level integrative analysis of DNAm data, identifying HAND2 promoter methylation as a driver event in endometrial carcinogenesis. It presents an example of an epigenetically deregulated gene linking ageing and obesity, the two main risk factors for endometrial cancer.

  155. 155.

    & Protein networks as logic functions in development and cancer. PLoS Computat. Biol. 7, e1002180 (2011).

  156. 156.

    , , , & Network-based classification of breast cancer metastasis. Mol. Systems Biol. 3, 140 (2007).

  157. 157.

    et al. Rewiring of genetic networks in response to DNA damage. Science 330, 1385–1389 (2010).

  158. 158.

    , , , & NEpiC: a network-assisted algorithm for epigenetic studies using mean and variance combined signals. Nucleic Acids Res. 44, e134 (2016).

  159. 159.

    , , , & Multiple network algorithm for epigenetic modules via the integration of genome-wide DNA methylation and gene expression data. BMC Bioinformat. 18, 72 (2017).

  160. 160.

    et al. SMITE: an R/Bioconductor package that identifies network modules by integrating genomic and epigenomic information. BMC Bioinformat. 18, 41 (2017).

  161. 161.

    The Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

  162. 162.

    et al. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

  163. 163.

    et al. The multi-omic landscape of transcription factor inactivation in cancer. Genome Med. 8, 89 (2016).

  164. 164.

    et al. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 40, 9379–9391 (2012).

  165. 165.

    et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 7, e35236 (2012).

  166. 166.

    & JIVE for exploration of multi-source molecular data. Bioinformatics 32, 2877–2879 (2016).

  167. 167.

    , , & Joint and Individual Variation Explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Statist. 7, 523–542 (2013).

  168. 168.

    & PARAFAC: Parallel factor analysis. Comput. Stat. Data Anal. 18, 39–72 (1994).

  169. 169.

    et al. Tensor decomposition for multiple-tissue gene expression experiments. Nat. Genet. 48, 1094–1100 (2016).

  170. 170.

    et al. Epigenetic aging signatures in mice livers are slowed by dwarfism, calorie restriction and rapamycin treatment. Genome Biol. 18, 57 (2017).

  171. 171.

    et al. Diverse interventions that extend mouse lifespan suppress shared age-associated epigenetic changes at critical gene regulatory regions. Genome Biol. 18, 58 (2017).

  172. 172.

    et al. Dietary restriction protects from age-associated DNA methylation and induces epigenetic reprogramming of lipid metabolism. Genome Biol. 18, 56 (2017).

  173. 173.

    et al. Hypomethylation of smoking-related genes is associated with future lung cancer in four prospective cohorts. Nat. Commun. 6, 10192 (2015).

  174. 174.

    & Evolution in health and medicine Sackler colloquium: stochastic epigenetic variation as a driving force of development, evolutionary adaptation, and disease. Proc. Natl Acad. Sci. USA 107 (Suppl. 1), 1757–1764 (2010).

  175. 175.

    Epigenetic variation and cellular Darwinism. Nat. Genet. 43, 724–726 (2011).

  176. 176.

    et al. Epigenomic reprogramming during pancreatic cancer progression links anabolic glucose metabolism to distant metastasis. Nat. Genet. 49, 367–376 (2017).

  177. 177.

    et al. The dynamics and prognostic potential of DNA methylation changes at stem cell gene loci in women's cancer. PLoS Genet. 8, e1002517 (2012).

  178. 178.

    & Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data. Genome Biol. 16, 180 (2015).

  179. 179.

    et al. DNA methylation age of blood predicts future onset of lung cancer in the women's health initiative. Aging 7, 690–700 (2015).

  180. 180.

    et al. Correlation of an epigenetic mitotic clock with cancer risk. Genome Biol. 17, 205 (2016).

  181. 181.

    et al. DNA methylation age of blood predicts all-cause mortality in later life. Genome Biol. 16, 25 (2015).

  182. 182.

    et al. DNA methylation signatures in peripheral blood strongly predict all-cause mortality. Nat. Commun. 8, 14617 (2017).

  183. 183.

    et al. Frailty is associated with the epigenetic clock but not with telomere length in a German cohort. Clin. Epigenet. 8, 21 (2016).

  184. 184.

    et al. Identification of tissue-specific cell death using methylation patterns of circulating DNA. Proc. Natl Acad. Sci. USA 113, E1826–E1834 (2016).

  185. 185.

    et al. Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science 355, eaai8478 (2017).

  186. 186.

    et al. Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13, 833–836 (2016).

  187. 187.

    , & From profiles to function in epigenomics. Nat. Rev. Genet. 18, 51–66 (2017).

  188. 188.

    , , & Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016).

  189. 189.

    & Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat. Biotechnol. 33, 364–376 (2015).

  190. 190.

    & Statistical mechanics of pluripotency. Cell 154, 484–489 (2013).

  191. 191.

    & Single-cell entropy for accurate estimation of differentiation potency from a cell's transcriptome. Nat. Commun. 8, 15599 (2017).

  192. 192.

    et al. The dynamics of DNA methylation covariation patterns in carcinogenesis. PLoS Computat. Biol. 10, e1003709 (2014).

  193. 193.

    , , & Epigenetic landscapes explain partially reprogrammed cells and identify key reprogramming genes. PLoS Computat. Biol. 10, e1003734 (2014).

  194. 194.

    et al. Cell fate decision as high-dimensional critical state transition. PLoS Biol. 14, e2000640 (2016).

  195. 195.

    & Decomposition of gene expression state space trajectories. PLoS Computat. Biol. 5, e1000626 (2009).

  196. 196.

    , & Signalling entropy: a novel network-theoretical framework for systems analysis and interpretation of functional omic data. Methods 67, 282–293 (2014).

  197. 197.

    & Towards a statistical mechanics of cell fate decisions. Curr. Opin. Genet. Dev. 22, 619–626 (2012).

  198. 198.

    , & Single cell pluripotency regulatory networks. Proteomics 16, 2303–2312 (2016).

  199. 199.

    & The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51, 1173–1182 (1986).

  200. 200.

    & Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum. Mol. Genet. 23, R89–R98 (2014).

  201. 201.

    et al. Association of body mass index with DNA methylation and gene expression in blood cells and relations to cardiometabolic disease: a Mendelian randomization approach. PLoS Med. 14, e1002215 (2017).

  202. 202.

    et al. Genome-wide DNA methylation study in human placenta identifies novel loci associated with maternal smoking during pregnancy. Int. J. Epidemiol. 45, 1644–1655 (2016).

  203. 203.

    et al. Mendelian randomization supports causality between maternal hyperglycemia and epigenetic regulation of leptin gene in newborns. Epigenetics 10, 342–351 (2015).

  204. 204.

    , & Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015).

  205. 205.

    , , & Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet. Epidemiol. 40, 304–314 (2016).

  206. 206.

    et al. Investigating the possible causal association of smoking with depression and anxiety using Mendelian randomisation meta-analysis: the CARTA consortium. BMJ Open 4, e006141 (2014).

  207. 207.

    in Computational and Statistical Epigenomics (ed. Teschendorff, A. E.) 161–185 (Springer, 2015).

  208. 208.

    , , & Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data. Nucleic Acids Res. 43, e106 (2015).

  209. 209.

    , & BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol. 13, R83 (2012).

  210. 210.

    et al. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res. 45, 54–66 (2017).

  211. 211.

    et al. MR-Base: a platform for systematic causal inference across the phenome using billions of genetic associations. Preprint at bioRxiv (2016).

  212. 212.

    et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 49, 600–605 (2017).

  213. 213.

    et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).

  214. 214.

    et al. Detection and interpretation of shared genetic influences on 42 human traits. Nat. Genet. 48, 709–717 (2016).

  215. 215.

    et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).

Download references

Author information


  1. Department of Women's Cancer, University College London, 74 Huntley Street, London WC1E 6AU, UK.

    • Andrew E. Teschendorff
  2. UCL Cancer Institute, University College London, 72 Huntley Street, London WC1E 6BT, UK.

    • Andrew E. Teschendorff
  3. Chinese Academy of Sciences (CAS) Key Laboratory of Computational Biology, CAS–Max Planck Gesellschaft (MPG) Partner Institute for Computational Biology, 320 Yue Yang Road, Shanghai 200031, China.

    • Andrew E. Teschendorff
  4. Medical Research Council Integrative Epidemiology Unit (MRC IEU), School of Social & Community Medicine, University of Bristol, Oakfield House, Oakfield Grove, Bristol BS8 2BN, UK.

    • Caroline L. Relton


  1. Search for Andrew E. Teschendorff in:

  2. Search for Caroline L. Relton in:


Both authors contributed to all aspects of manuscript researching, discussion, writing and editing.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Andrew E. Teschendorff.


Bisulfite conversion

A technique in which DNA is treated with bisulfite, resulting in modification (upon amplification) of unmethylated cytosines into thymines, whereas methylated cytosines are protected from modification.

Epigenome-wide-association studies

(EWAS). A study design that seeks associations between DNA methylation at many sites across the genome and an exposure, trait or disease of interest.

Intra-sample normalization

The procedure of adjusting the raw data profile of a biological sample for technical biases and artefacts. This is often followed by inter-sample normalization, in which adjustments are made to the data for technical and biological factors that otherwise cause unwanted (and often confounding) data variation across samples.


When the relationship between an exposure and an outcome is not causal but is due to the effects of a third variable (the confounder) on the exposure and the outcome. White blood cell heterogeneity can act as a confounder in many epigenetic studies.

Feature selection

The statistical procedure of identifying features which, in some broad sense, correlate with an exposure or phenotype of interest (POI).

Differentially methylated cytosines

(DMCs). Cytosines (usually in a CpG context) that exhibit a statistically significant difference in DNA methylation between two groups of samples, according to some statistical test.

Condition number

In the context of reference-based cell-type deconvolution, the condition number of a reference matrix represents an index of the numerical stability of the inference. Formally, it measures the sensitivity of the regression parameters (also known as cell weights) to small perturbations or errors in the reference matrix.

Constrained projection

(CP). Also known as quadratic programming (QP). A widely used technique for performing multivariate linear regression with constraints (such as non-negativity and normalization) imposed on the regression coefficients. In the context of cell-type deconvolution, the coefficients correspond to cell-type proportions in a sample. By definition, these proportions are non-negative, and their sum must be ≤1.

Beta distributions

The distributions of beta values. The beta value is a statistical term used to describe the quantification of DNA methylation at a given cytosine, as the ratio of methylated alleles to the total number of alleles (methylated + unmethylated), a number that by definition must lie between 0 (fully unmethylated) and 1 (fully methylated).

Surrogate variable analysis

(SVA). A widely used technique for selecting features associated with a factor of interest, which is not confounded by other factors. SVA uses a model to identify the data variation that is orthogonal to the factor of interest and subsequently uses principal component analysis (PCA) on this orthogonal variation matrix to construct 'surrogate variables', which in theory should capture confounding sources of variation.

Phenotype of interest

(POI). The factor or variable of interest in an epigenome-wide association study (EWAS). This factor is often binary, representing case–control status, but could also represent an ordinal variable (for example, genotype) or be continuous (for example, age).

Blind source separation

(BSS). The problem of inferring the sources of variation gives rise to a data matrix without using any prior information ('blind'). Algorithms that can achieve this are called BSS algorithms, of which independent component analysis (ICA) is one example.

Independent component analysis

(ICA). An unsupervised dimensionality reduction algorithm that decomposes the data matrix into a sum of linear components of variation, which are as statistically independent from each other as possible. Statistical independence is a stronger condition than the linear uncorrelatedness of principal component analysis (PCA) components, allowing improved modelling of sources of variation in complex data.

Principal component analysis

(PCA). An unsupervised dimensionality reduction algorithm that decomposes the data matrix into a sum of linear principal components (PCs) of variation, ranked by decreased variance and uncorrelated to each other.

Latent components

Components or sources of data variation that are 'hidden' (or latent) and that are inferred from the data using an unsupervised algorithm.


Of statistical inferences, using the phenotype of interest from the outset, for instance, when identifying features correlating with a phenotype.

Variably methylated cytosines

(VMCs). Cytosines (usually in a CpG context) that exhibit a significant amount of variance in DNA methylation, as assessed across independent samples and relative to other CpG sites.


Of a statistical distribution or of a random sample thereof, the expected variance, or spread, being dependent on the mean.

Logit transformation

A mathematical transformation that takes values defined on the unit interval (0,1) (for example, beta values (β)) into values defined on the open interval (−∞,+∞), termed M-values. Mathematically, M = log2[β/(1 − β)].

Methylation quantitative trait loci

(mQTLs). CpG sites whose DNA methylation level is correlated with a single-nucleotide polymorphism (SNP). If the SNP occurs close to the CpG (for instance, within a 10 kb window), it is called cis-mQTL, otherwise trans-mQTL.

Differentially variable cytosines

(DVCs). Cytosines (usually in a CpG context) that exhibit a statistically significant difference in the variance of DNA methylation between two groups of samples, according to some statistical test.

Field defects

Genetic or epigenetic alterations that are thought to predate the development of cancer and that are usually seen in the normal tissue found adjacent to cancer.

Type 1 error rate

The probability of erroneously calling the result of a test significant (positive) when the underlying true hypothesis is the null. It corresponds to the fraction of true negatives that are called positive, also known as the false-positive rate.

Variably methylated regions

(VMRs). Contiguous genomic regions where DNA methylation is highly variable relative to a normal 'ground state'. A VMR can be defined for one given sample.

Differentially variable regions

(DVRs). Contiguous genomic regions containing a statistically significant number of differentially variable cytosines (DVCs). This is different from a variably methylated region (VMR) in that a DVR is derived by comparing a fairly large number of cases and controls.

Gene set enrichment analysis

(GSEA). A widely used statistical procedure to assess whether a derived gene list of interest is enriched for specific biological terms, usually including gene ontologies, signalling pathways, specific transcriptomic signatures or targets of gene regulators.

System epigenomics

An emerging field whereby cellular phenotypes in normal development and disease are modelled as complex systems, using tools from complexity science (for example, dynamical system theory or statistical physics) to understand them.


A phenomenon that occurs when a genetic variant is associated with multiple traits. Vertical pleiotropy occurs where the traits are all on the same pathway (and is generally less of a problem), whereas horizontal pleiotropy exists where a genetic variant is associated with multiple traits via separate pathways.

Expression quantitative trait loci

(eQTLs). Genes whose expression levels are correlated with single-nucleotide polymorphisms (SNPs). If the SNP occurs near (definitions vary, but it could range from 10 kb to a 1 Mb window centred on the transcription start site) the gene, it is called a cis-eQTL; otherwise, it is a trans-eQTL.

TF hubs

In the context of a regulatory network where edges represent regulatory interactions between transcription factors (TFs) and target genes, those TFs with the largest number of interactions.

Expression quantitative trait methylation loci

(eQTMs). Genes whose expression levels are correlated with the DNA methylation level of a CpG. If the CpG occurs close to the gene (within a 250 kb window), it is called a cis-eQTM.


A multi-dimensional array with the number of dimensions often called the 'order' or 'rank' of the tensor and for which linear decomposition algorithms are available, analogous to linear matrix factorization algorithms for data matrices. Scalars, vectors and matrices are tensors of order 0, 1 and 2, respectively.

Mendelian randomization

A technique to estimate the effect of an exposure on an outcome using genetic variants and instrumental variables for the exposure. This approach can also be applied to assessing mediation.