Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA


Adjacent CpG sites in mammalian genomes can be co-methylated owing to the processivity of methyltransferases or demethylases, yet discordant methylation patterns have also been observed, which are related to stochastic or uncoordinated molecular processes. We focused on a systematic search and investigation of regions in the full human genome that show highly coordinated methylation. We defined 147,888 blocks of tightly coupled CpG sites, called methylation haplotype blocks, after analysis of 61 whole-genome bisulfite sequencing data sets and validation with 101 reduced-representation bisulfite sequencing data sets and 637 methylation array data sets. Using a metric called methylation haplotype load, we performed tissue-specific methylation analysis at the block level. Subsets of informative blocks were further identified for deconvolution of heterogeneous samples. Finally, using methylation haplotypes we demonstrated quantitative estimation of tumor load and tissue-of-origin mapping in the circulating cell-free DNA of 59 patients with lung or colorectal cancer.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Identification and characterization of human methylation haplotype blocks (MHBs).
Figure 2: Comparison of methylation haplotype load with four other metrics used in the literature.
Figure 3: Tissue clustering based on methylation haplotype load.
Figure 4: Quantitative estimation of the proportion of DNA derived from cancer cells in cell-free DNA, using the MHL of informative MHBs.
Figure 5: MHL-based prediction of cancer tissue of origin from plasma DNA.

Accession codes

Primary accessions

Gene Expression Omnibus

Referenced accessions

Gene Expression Omnibus


  1. 1

    Wigler, M., Levy, D. & Perucho, M. The somatic replication of DNA methylation. Cell 24, 33–40 (1981).

    CAS  Article  Google Scholar 

  2. 2

    Landau, D.A. et al. Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia. Cancer Cell 26, 813–825 (2014).

    CAS  Article  Google Scholar 

  3. 3

    Slatkin, M. Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9, 477–485 (2008).

    CAS  Article  Google Scholar 

  4. 4

    Shoemaker, R., Deng, J., Wang, W. & Zhang, K. Allele-specific methylation is prevalent and is contributed by CpG-SNPs in the human genome. Genome Res. 20, 883–889 (2010).

    CAS  Article  Google Scholar 

  5. 5

    Jones, B. DNA methylation: switching phenotypes with epialleles. Nat. Rev. Genet. 15, 572 (2014).

    CAS  Article  Google Scholar 

  6. 6

    Schwartzman, O. & Tanay, A. Single-cell epigenomics: techniques and emerging applications. Nat. Rev. Genet. 16, 716–726 (2015).

    CAS  Article  Google Scholar 

  7. 7

    Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

  8. 8

    Houseman, E.A. et al. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics 17, 259 (2016).

    Article  Google Scholar 

  9. 9

    Sun, K. et al. Plasma DNA tissue mapping by genome-wide methylation sequencing for non-invasive prenatal, cancer and transplantation assessments. Proc. Natl. Acad. Sci. USA 112, E5503–E5512 (2015).

    CAS  Article  Google Scholar 

  10. 10

    Lehmann-Werman, R. et al. Identification of tissue-specific cell death using methylation patterns of circulating DNA. Proc. Natl. Acad. Sci. USA 113, E1826–E1834 (2016).

    CAS  Article  Google Scholar 

  11. 11

    Schultz, M.D. et al. Human body epigenome maps reveal noncanonical DNA methylation variation. Nature 523, 212–216 (2015).

    CAS  Article  Google Scholar 

  12. 12

    Heyn, H. et al. Distinct DNA methylomes of newborns and centenarians. Proc. Natl. Acad. Sci. USA 109, 10522–10527 (2012).

    CAS  Article  Google Scholar 

  13. 13

    Xie, W. et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134–1148 (2013).

    CAS  Article  Google Scholar 

  14. 14

    Blattler, A. et al. Global loss of DNA methylation uncovers intronic enhancers in genes showing expression changes. Genome Biol. 15, 469 (2014).

    Article  Google Scholar 

  15. 15

    Heyn, H. et al. Epigenomic analysis detects aberrant super-enhancer DNA methylation in human cancer. Genome Biol. 17, 11 (2016).

    Article  Google Scholar 

  16. 16

    Chen, K. et al. Loss of 5-hydroxymethylcytosine is linked to gene body hypermethylation in kidney cancer. Cell Res. 26, 103–118 (2016).

    CAS  Article  Google Scholar 

  17. 17

    Shao, X., Zhang, C., Sun, M.A., Lu, X. & Xie, H. Deciphering the heterogeneity in DNA methylation patterns during stem cell differentiation and reprogramming. BMC Genomics 15, 978 (2014).

    Article  Google Scholar 

  18. 18

    Hansen, K.D. et al. Increased methylation variation in epigenetic domains across cancer types. Nat. Genet. 43, 768–775 (2011).

    CAS  Article  Google Scholar 

  19. 19

    Guelen, L. et al. Domain organization of human chromosomes revealed by mapping of nuclear lamina interactions. Nature 453, 948–951 (2008).

    CAS  Article  Google Scholar 

  20. 20

    Wen, B., Wu, H., Shinkai, Y., Irizarry, R.A. & Feinberg, A.P. Large histone H3 lysine-9-dimethylated chromatin blocks distinguish differentiated from embryonic stem cells. Nat. Genet. 41, 246–250 (2009).

    CAS  Article  Google Scholar 

  21. 21

    Dixon, J.R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

    CAS  Article  Google Scholar 

  22. 22

    Pujadas, E. & Feinberg, A.P. Regulated noise in the epigenetic landscape of development and disease. Cell 148, 1123–1131 (2012).

    CAS  Article  Google Scholar 

  23. 23

    Irizarry, R.A. et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 41, 178–186 (2009).

    CAS  Article  Google Scholar 

  24. 24

    Ziller, M.J. et al. Charting a dynamic DNA methylation landscape of the human genome. Nature 500, 477–481 (2013).

    CAS  Article  Google Scholar 

  25. 25

    Leung, D. et al. Integrative analysis of haplotype-resolved epigenomes across human tissues. Nature 518, 350–354 (2015).

    CAS  Article  Google Scholar 

  26. 26

    Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  27. 27

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  28. 28

    Mitsui, K. et al. The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell 113, 631–642 (2003).

    CAS  Article  Google Scholar 

  29. 29

    Shu, J. et al. Induction of pluripotency in mouse somatic cells with lineage specifiers. Cell 153, 963–975 (2013).

    CAS  Article  Google Scholar 

  30. 30

    Guo, H. et al. Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced-representation bisulfite sequencing. Genome Res. 23, 2126–2135 (2013).

    CAS  Article  Google Scholar 

  31. 31

    Snyder, M.W., Kircher, M., Hill, A.J., Daza, R.M. & Shendure, J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 164, 57–68 (2016).

    CAS  Article  Google Scholar 

  32. 32

    Williams, K. et al. TET1 and hydroxymethylcytosine in transcription and DNA methylation fidelity. Nature 473, 343–348 (2011).

    CAS  Article  Google Scholar 

  33. 33

    Saito, D. & Suyama, M. Linkage-disequilibrium analysis of allelic heterogeneity in DNA methylation. Epigenetics 10, 1093–1098 (2015).

    Article  Google Scholar 

  34. 34

    Takai, D. & Jones, P.A. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. USA 99, 3740–3745 (2002).

    CAS  Article  Google Scholar 

  35. 35

    Timmons, J.A., Szkop, K.J. & Gallagher, I.J. Multiple sources of bias confound functional enrichment analysis of global '-omics' data. Genome Biol. 16, 186 (2015).

    Article  Google Scholar 

  36. 36

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    CAS  Article  Google Scholar 

  37. 37

    Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–947 (2013).

    CAS  Article  Google Scholar 

  38. 38

    Thurman, R.E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).

    CAS  Article  Google Scholar 

  39. 39

    Xie, H. et al. Genome-wide quantitative assessment of variation in DNA methylation patterns. Nucleic Acids Res. 39, 4099–4108 (2011).

    CAS  Article  Google Scholar 

  40. 40

    Landan, G. et al. Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues. Nat. Genet. 44, 1207–1214 (2012).

    CAS  Article  Google Scholar 

  41. 41

    Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  Google Scholar 

  42. 42

    Houseman, E.A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, 86 (2012).

    Article  Google Scholar 

  43. 43

    Gong, T. & Szustakowski, J.D. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-seq data. Bioinformatics 29, 1083–1085 (2013).

    CAS  Article  Google Scholar 

Download references


We thank S. Kaushal for managing and handling patient samples in the UCSD Moores Cancer Center Biorepository Tissue Technology Shared Resource, and S.M. Lippman, R. Liu and B. Ren for insightful discussions. This study was supported by US National Institutes of Health grants R01GM097253 (Kun Zhang), R01CA217642 (Kang Zhang), R01EY025090 (Kang Zhang) and P30CA23100 (S.M.L.), and a VA Merit Award (Kang Zhang).

Author information




Kun Zhang conceived the initial concept and oversaw the study; S.G., D.D. and Kun Zhang performed the bioinformatics analyses; N.P., D.D. and H.-L.F. performed the experiments; Kang Zhang contributed plasma samples from healthy individuals; and Kun Zhang, S.G. and D.D. wrote the manuscript with input from all co-authors.

Corresponding author

Correspondence to Kun Zhang.

Ethics declarations

Competing interests

S.G., D.D. and Kun Zhang are listed as inventors in patent applications related to the methods disclosed in this manuscript, and Kun Zhang is a co-founder and scientific advisor of Singlera Genomics, Inc.

Integrated supplementary information

Supplementary Figure 1 Characteristics of MHBs in the human genome.

(a) Distribution of MHB sizes. (b) Distribution of MHBs CpG densities (CpGs/bp). (c) Co-localization of known genomic features broken down by CpG density. We split all MHBs into quartiles where each quartile is as follows: (0, 0.046], (0.046, 0.097], (0.097, 0.155], (0.155, 6]. Note that closed brackets are inclusive. The 1st quartile (MHBs with the lowest CpG densities) are mostly in CGI shelf or shore, and are enriched for LAD, LOCK and enhancers.

Supplementary Figure 2 Loss of CpG linkage disequilibrium replicated in two additional samples tumor tissues from patients with kidney cancer.

Two kidney cancer WGBS data were downloaded from NCBI GEO GSE63183), and processed with the same computational procedures.

Supplementary Figure 3 Validation of MHBs with TCGA Methylation HM450K beadchip and ENCODE RRBS data.

(a) Squared Pearson correlation coefficient r2 versus methylation LD r2. (b) The Pearson correlation coefficient for CpGs in RRBS and HM450K data were significantly higher in regions overlapping with MHBs compared with the CpGs without overlapping with MHBs. IN denotes RRBS or HM450K CpGs within MHBs. OUT denotes RRBS or HM450K regions beyond MHBs.

Supplementary Figure 4 Profiles of H3K27ac, H3K4me3 and H3K4me1 over methylation haplotype blocks for 12 human adult tissue types.

X-axis denote the distances from the centers of MHBs (+/- 1000 bp) and y-axis denotes the average reads density in RPKM (input normalized reads per kilobase per million). Epigenomics Roadmap histones data were downloaded from NCBI GEO (

Supplementary Figure 5 PCA of human tissues and cells based on methylation haplotype loads in MHBs regions.

Tissues and cells from WGBS datasets were downloaded from 5 other studies and 10 adult tissues WGBS were from generated in this study.

Supplementary Figure 6 Distinct patterns of functional enrichment for TFBS associated with layer-specific MHBs.

(a) Venn diagrams of transcription factors (TF) with binding sites associated with layer specific hypo- or hyper- MHL regions. (b) Functional enrichment analysis of associated TFBS using GREAT (

Supplementary Figure 7 Distribution of incidence of cancer-associated HMH in plasma samples from patients with colorectal cancer or lung cancer.

Y-axis denotes the frequency of caHMH and x-axis denotes the incidence (sample number) of the caHMH in CRC plasma samples (a) or LC plasma samples (b). The majority of caHMH are patient specific while a few have high incidence among the cancer plasma samples.

Supplementary Figure 8 Deconvolution of cancer and normal plasma samples using non-negative decomposition with quadratic programming.

(a) Deconvolution accuracy as a function of tumor fraction using simulated data. (b) Cancer DNA proportions estimated by deconvolution of plasma samples using CCT or LCT as the tumor reference.

Supplementary Figure 9 Estimated tumor fraction in plasma correlated with the normalized yield of DNA extraction from plasma.

Supplementary Figure 10 Distribution of tissue-specific MHBs counts in human plasma samples.

Color bar represents the number of tissue specific MHBs (for each respective tissue) over the MHL threshold in each plasma sample.

Supplementary Figure 11 Distributions of counts of highly methylated tsMHBs in human plasma samples.

(a) Distributions of counts of ts-MHBs infor normal plasma samples forto 11 reference tissues. (b) Distributions of count of ts-MHBs infor lung cancer plasma samples forto lung tissue or to pan-cancer tissue (CT). (c) Distributions of counts of ts-MHBs infor colorectal cancer plasma samples forto colon tissue or to pan-cancer tissue (CT).

Supplementary Figure 12 Joint prediction of cancer status and tissue of origin on plasma DNA.

Distribution of enrichment Z-score in each set of reference-specific MHBs for colon cancer plasma samples (a) and lung cancer plasma samples (b). Integrating signatures from cancer and tissue-of-origin (Colon+CT; Lung+CT) improveds the prediction accuracy (c,d), on both types of plasma samples, compared withover focusing on cancer signatures alone (CT). The ROC curves were created by adjusting the Z-score cutoff for calculating specificities and sensitivities. AUC denotes area under the curve.

Supplementary Figure 13 Flowchart of data processing and samples used in this study.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13 and Supplementary Note (PDF 2077 kb)

Supplementary Tables

Supplementary Tables 1–13 (XLSX 4198 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guo, S., Diep, D., Plongthongkum, N. et al. Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA. Nat Genet 49, 635–642 (2017).

Download citation

Further reading