## Abstract

An outstanding challenge of epigenome-wide association studies (EWASs) performed in complex tissues is the identification of the specific cell type(s) responsible for the observed differential DNA methylation. Here we present a statistical algorithm called CellDMC (https://github.com/sjczheng/EpiDISH), which can identify differentially methylated positions and the specific cell type(s) driving the differential methylation. We validated CellDMC on in silico mixtures of DNA methylation data generated with different technologies, as well as on real mixtures from epigenome-wide association and cancer epigenome studies. CellDMC achieved over 90% sensitivity and specificity in scenarios where current state-of-the-art methods did not identify differential methylation. By applying CellDMC to an EWAS performed in buccal swabs, we identified smoking-associated differentially methylated positions occurring in the epithelial compartment, which we validated in smoking-related lung cancer. CellDMC may be useful in the identification of causal DNA-methylation alterations in disease.

## Access options

Subscribe to Journal

Get full journal access for 1 year

$227.00

only $18.92 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

## Data availability

All data analyzed in this manuscript are publicly available from GEO (https://www.ncbi.nlm.nih.gov/geo/), the TCGA data portal (https://gdc.cancer.gov), International Human Epigenome Consortium (IHEC; http://epigenomesportal.ca/ihec) or ArrayExpress (https://www.ebi.ac.uk/arrayexpress/). Accession codes for data from GEO include GSE31848, GSE59250, GSE71955, GSE71244, GSE43976, GSE50222, GSE56047, GSE42861, GSE69914, GSE40699 and GSE35069. The accession code for data from ArrayExpress is E-MTAB-2145. All DNA methylation reference matrices used to estimate cell-type fractions are available on Github (https://github.com/sjczheng/EpiDISH). The DNAm data set in buccal cells is available by request to mrclha.swiftinfo@ucl.ac.uk; the full policy is available at http://www.nshd.mrc.ac.uk/data.aspx. Managed access is in place for this 69-year-old NSHD (National Survey of Health and Development) study to ensure that use of the data is within the bounds of consent given previously by participants, and to safeguard any potential threat to anonymity, as the participants were all born in the same week.

## Additional information

**Publisher’s note:** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

- 1.
Petronis, A. Epigenetics as a unifying principle in the aetiology of complex traits and diseases.

*Nature***465**, 721–727 (2010). - 2.
Feinberg, A. P., Ohlsson, R. & Henikoff, S. The epigenetic progenitor origin of human cancer.

*Nat. Rev. Genet.***7**, 21–33 (2006). - 3.
Lappalainen, T. & Greally, J. M. Associating cellular epigenetic models with human phenotypes.

*Nat. Rev. Genet.***18**, 441–451 (2017). - 4.
Rakyan, V. K., Down, T. A., Balding, D. J. & Beck, S. Epigenome-wide association studies for common human diseases.

*Nat. Rev. Genet.***12**, 529–541 (2011). - 5.
Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis.

*Nat. Biotechnol.***31**, 142–147 (2013). - 6.
Jaffe, A. E. & Irizarry, R. A. Accounting for cellular heterogeneity is critical in epigenome-wide association studies.

*Genome Biol.***15**, R31 (2014). - 7.
Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution.

*BMC Bioinformatics***13**, 86 (2012). - 8.
Houseman, E. A., Molitor, J. & Marsit, C. J. Reference-free cell mixture adjustments in analysis of DNA methylation data.

*Bioinformatics***30**, 1431–1439 (2014). - 9.
Zheng, S. C. et al. Correcting for cell-type heterogeneity in epigenome-wide association studies: revisiting previous analyses.

*Nat. Methods***14**, 216–217 (2017). - 10.
Teschendorff, A. E. & Relton, C. L. Statistical and integrative system-level analysis of DNA methylation data.

*Nat. Rev. Genet.***19**, 129–147 (2018). - 11.
Zou, J., Lippert, C., Heckerman, D., Aryee, M. & Listgarten, J. Epigenome-wide association studies without the need for cell-type composition.

*Nat. Methods***11**, 309–311 (2014). - 12.
Rahmani, E. et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies.

*Nat. Methods***13**, 443–445 (2016). - 13.
Rahmani, E. et al. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation.

*Nat. Methods***14**, 218–219 (2017). - 14.
Lutsik, P. et al. MeDeCom: discovery and quantification of latent components of heterogeneous methylomes.

*Genome Biol.***18**, 55 (2017). - 15.
Breeze, C. E. et al. eFORGE: A tool for identifying cell type–specific signal in epigenomic data.

*Cell Rep.***17**, 2137–2150 (2016). - 16.
Teschendorff, A. E., Breeze, C. E., Zheng, S. C. & Beck, S. A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies.

*BMC Bioinformatics***18**, 105 (2017). - 17.
Zheng, S. C. et al. A novel cell-type deconvolution algorithm reveals substantial contamination by immune cells in saliva, buccal and cervix.

*Epigenomics***10**, 925–940 (2018). - 18.
Teschendorff, A. E. et al. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer.

*Nat. Commun.***7**, 10478 (2016). - 19.
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis.

*PLoS Genet.***3**, 1724–1735 (2007). - 20.
McGregor, K. et al. An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies.

*Genome Biol.***17**, 84 (2016). - 21.
Sandoval, J. et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome.

*Epigenetics***6**, 692–702 (2011). - 22.
Moran, S., Arribas, C. & Esteller, M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences.

*Epigenomics***8**, 389–399 (2016). - 23.
Stunnenberg, H. G. & Hirst, M. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery.

*Cell***167**, 1145–1149 (2016). - 24.
Julià, A. et al. Epigenome-wide association study of rheumatoid arthritis identifies differentially methylated loci in B cells.

*Hum. Mol. Genet.***26**, 2803–2811 (2017). - 25.
Koboldt, D. C. et al. Comprehensive molecular portraits of human breast tumours.

*Nature***490**, 61–70 (2012). - 26.
Jones, A. et al. Role of DNA methylation and epigenetic silencing of HAND2 in endometrial cancer development.

*PLoS Med.***10**, e1001551 (2013). - 27.
Jiao, Y., Widschwendter, M. & Teschendorff, A. E. A systems-level integrative framework for genome-wide DNA methylation and gene expression data identifies differential gene expression modules under epigenetic control.

*Bioinformatics***30**, 2360–2366 (2014). - 28.
Kandoth, C. et al. Integrated genomic characterization of endometrial carcinoma.

*Nature***497**, 67–73 (2013). - 29.
Yuan, T. et al. An integrative multi-scale analysis of the dynamic DNA methylation landscape in aging.

*PLoS Genet.***11**, e1004996 (2015). - 30.
Teschendorff, A. E. et al. Correlation of smoking-associated DNA methylation changes in buccal cells with DNA methylation changes in epithelial cancer.

*JAMA Oncol.***1**, 476–485 (2015). - 31.
Gao, X., Jia, M., Zhang, Y., Breitling, L. P. & Brenner, H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies.

*Clin. Epigenetics***7**, 113 (2015). - 32.
Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers.

*Nature***489**, 519–525 (2012). - 33.
Chen, Y., Widschwendter, M. & Teschendorff, A. E. Systems-epigenomics inference of transcription factor activity implicates aryl-hydrocarbon-receptor inactivation as a key event in lung cancer development.

*Genome Biol.***18**, 236 (2017). - 34.
Venables, W. N. & Ripley, B. D.

*Modern Applied Statistics with S*. (Springer, New York, 2002). - 35.
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles.

*Nat. Methods***12**, 453–457 (2015). - 36.
Nazor, K. L. et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives.

*Cell Stem Cell***10**, 620–634 (2012). - 37.
Absher, D. M. et al. Genome-wide DNA methylation analysis of systemic lupus erythematosus reveals persistent hypomethylation of interferon genes and compositional changes to CD4+ T-cell populations.

*PLoS Genet.***9**, e1003678 (2013). - 38.
Limbach, M. et al. Epigenetic profiling in CD4+ and CD8+ T cells from Graves’ disease patients reveals changes in genes associated with T cell receptor signaling.

*J. Autoimmun.***67**, 46–56 (2016). - 39.
Marabita, F. et al. An evaluation of analysis pipelines for DNA methylation profiling using the Illumina HumanMethylation450 BeadChip platform.

*Epigenetics***8**, 333–346 (2013). - 40.
Nestor, C. E. et al. DNA methylation changes separate allergic patients from healthy controls and may reflect altered CD4+ T-cell population structure.

*PLoS Genet.***10**, e1004059 (2014). - 41.
Reynolds, L. M. et al. Age-related variations in the methylome associated with gene expression in human monocytes and T cells.

*Nat. Commun.***5**, 5366 (2014). - 42.
Zilbauer, M. et al. Genome-wide methylation analyses of primary human leukocyte subsets identifies functionally important cell-type-specific hypomethylated regions.

*Blood***122**, e52–e60 (2013). - 43.
Reinius, L. E. et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility.

*PLoS One***7**, e41361 (2012). - 44.
Smyth, G. K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments.

*Stat. Appl. Genet. Mol. Biol.***3**, Article3 (2004).

## Acknowledgements

This work was supported by National Science Foundation of China grants 31571359, 31771464 and 31401120, a Wellcome Trust grant (99148) and a Royal Society Newton Advanced Fellowship (NAF project 522438, NAF award 164914).

## Author information

### Affiliations

#### CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China

- Shijie C. Zheng
- & Andrew E. Teschendorff

#### Altius Institute for Biomedical Sciences, Seattle, WA, USA

- Charles E. Breeze

#### UCL Cancer Institute, Paul O’Gorman Building, University College London, London, UK

- Stephan Beck
- & Andrew E. Teschendorff

#### Department of Women’s Cancer, University College London, London, UK

- Andrew E. Teschendorff

### Authors

### Search for Shijie C. Zheng in:

### Search for Charles E. Breeze in:

### Search for Stephan Beck in:

### Search for Andrew E. Teschendorff in:

### Contributions

The study was conceived by A.E.T. S.C.Z. and A.E.T. contributed jointly to methods development. S.C.Z. performed the statistical analyses. The manuscript was written by S.C.Z. and A.E.T. S.B. and C.E.B. contributed valuable feedback.

### Competing interests

The authors declare no competing interests.

### Corresponding author

Correspondence to Andrew E. Teschendorff.

## Integrated supplementary information

### Supplementary Figure 1 CellDMC flowchart.

CellDMC consists of three main steps: (Step-1) Estimation of cell-type fractions in a given sample using a DNAm reference matrix for

*K*cell types and the DNAm profile of the sample*s*. The estimation procedure is based on the Hierarchical EpiDISH (HEpiDISH) algorithm and uses DNAm references for a generic epithelial cell, fibroblasts and for the seven main immune cell subtypes. (Step-2) For each CpG, a linear model is fit to the CpG’s DNAm profile, which includes covariates for all cell-type fractions and interaction terms between cell-type fractions and phenotype (only K-1 terms are independent in each summation). The P-values for the interaction terms inform on the nature of the association (if any) between phenotype, cell type and DNAm. For instance, as shown for CpG1, only the P-value of the interaction term for cell-type*CT*_{1}is highly significant, indicating that CpG1 is a DMCT only in cell-type*CT*_{1}. The sign of the estimated regression coefficient of the interaction term is positive indicating that CpG1 is hypermethylated in cases vs controls. CpGs are declared significant DMCTs across cell types according to the rule as indicated. (Step-3) Finally, for all significant DMCTs, these can then be ranked for each cell type, according to the P-value of the interaction term.### Supplementary Figure 2 Choice of signal-to-noise ratio (SNR) and validation of CellDMC’s ranking procedure.

(

**a**) Box plot of the effect size (i.e. the average of absolute DNAm differences between cases and controls within an affected cell type, averaged over 100 DMCs) against the corresponding SNR value. Each box plot contains 100 values representing 100 different Monte Carlo runs. Thus, for SNR ~ 3, the mean difference in DNAm between case and control within an affected cell type is just over 0.4. The cases SNR ~ 1.2 and SNR ~ 0.9 do not differ in terms of the average difference in DNAm, but differ in terms of the variance, with a larger variance in the SNR ~ 0.9 case. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the IQR from these centiles, where IQR is the interquartile range. (**b**) Scatterplots of the absolute true methylation difference of DMCTs between cases and controls (y-axis) against CellDMC’s predicted rank position (x-axis), for a range of different simulation scenarios (Uni-1C, Uni-2C, Uni-AllC and Bi-2C,Bi-AllC) with 1,000 true DMCTs.### Supplementary Figure 3 CellDMC’s sensitivity for detecting DMCTs in a scenario where there is a shift in the mean cell-type fractions between cases and controls.

(

**a**) Box plot of cell-type fractions between cases and controls for one illustrative Monto Carlo run. (**b**) Sensitivity measure to detect DMCTs for CellDMC. All results were based on 100 Monte Carlo runs, with 100 ‘control’ in silico mixtures and 100 ‘disease’ in silico mixtures. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles### Supplementary Figure 4 Performance of CellDMC and non-interaction-based model (NoInt) when only one immune cell subtype is differentially methylated.

Unlike previous figures, where differential methylation in the immune cell (IC) compartment affected all immune cell subtypes equally, we are now considering only one immune cell subtype (here B cells) to be altered. (

**a**) Scatterplot of the adjusted β (adjusted for all cell-type fractions) against cell-type fraction for a hypothetical example where differential methylation happens for only one IC cell subtype (e.g. B cells). (**b, c**) Corresponding sensitivity and specificity measures of CellDMC and non-interaction-based model (NoInt) for detecting DMCs. (**d**) Sensitivity measure of CellDMC to detect DMCTs. All results were based 100 Monte Carlo runs, with 100 ‘control’ in silico mixtures and 100 ‘disease’ in silico mixtures. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles.### Supplementary Figure 5 Validation of purified IC subtype datasets.

In this figure, we checked the purities of B cells, CD4

^{+}T cells, CD8^{+}T cells, neutrophils and monocytes as profiled in independent studies. Cell-type fractions were estimated using EpiDISH (RPC mode) with our previously validated blood subtype DNAm reference, which consists of 333 cell-type-specific DMCs, to estimate fractions of six major blood subtypes (B cells, CD4^{+}T cells, CD8^{+}T cells, NK cells, monocytes and granulocytes/neutrophils). We display the estimated fractions for all purified blood datasets in**a)**, and for each cell subtype over the different source datasets in**b, c, d, e, f)**. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the inter-quartile range from these centiles.### Supplementary Figure 6 Performance comparison of CellDMC and the no-interaction model when only one IC subtype is altered including blood EWAS context.

(

**a-d**) Similar to Supplementary Fig. 4, but now we used a two-step method (HEpiDISH) to estimate the fractions of individual IC subtypes. (**a**) Scatterplot of an example where differential methylation happens only in B cells. HEpiDISH first estimates the fractions of total immune cells, epithelial cells, and fibroblasts. It then reapplies EpiDISH using the modified blood DNAm reference to estimate relative fractions of IC blood subtypes, which are then rescaled to obtain approximate absolute fractions for all cell types. Y-axis shows the cell-type fractions’ adjusted β, and the x-axis gives estimated fractions of each cell type. (**b, c**) Sensitivity and specificity measures of DMC calling for CellDMC and non-interaction (NoInt)-based model. (**d**) Sensitivity measures of DMCT calling for CellDMC. All results were based on 100 Monte Carlo runs, with 100 ‘control’ and 100 ‘disease’ in silico mixtures. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles. (**e-h**) As (**a-d**), but now for the simulation scenario of a blood EWAS, where only blood cell subtypes are mixed together, with B cells the only altered cell type.### Supplementary Figure 7 Power analysis of CellDMC.

(

**a, b**) Sensitivity of CellDMC to detect (**a**) DMCs and (**b**) DMCTs as a function of total sample size (numbers of cases and controls are assumed equal and therefore one-half of this value) for five different SNR values in the scenario of a uni-directional change in one cell type. (**c**) Corresponding specificity of CellDMC to detect DMCs/DMCTs. (**d,e,f**) As in (**a-c**), but now for the scenario of a uni-directional change affecting all cell types in the mixture. Each data point represents the mean over 100 Monte Carlo runs, where in each run 200 mixture samples were simulated: 100 ‘control’ in silico mixtures and 100 ‘disease’ in silico mixtures.### Supplementary Figure 8 Performance of CellDMC as a function of cell-type complexity and cell-type fraction range.

Assuming all cell types exhibit fractions of same underlying mean and variance, (

**a, b**) Sensitivity to detect DMCs and DMCTs as a function of the total number of cell types in the mixtures. Scenario considered is where only one cell type is differentially methylated. (**c**) As (**a**) but for the specificity. Each data point represents the mean over 100 Monte Carlo runs, where in each run 200 mixture samples were simulated: 100 ‘control’ in silico mixtures and 100 ‘disease’ in silico mixtures. In all cases, cell-type fractions were sampled from a Dirichlet distribution. For K = 3, we mixed epithelial, fibroblast and monocytes and used EpiDISH with centEpiFibIC DNAm reference to estimate cell-type fractions. For larger K values, we iteratively added one more IC cell subtype to the mix in the order of neutrophils (*K*= 4), CD4^{+}T cells (*K*= 5), B cells (*K*= 6) and CD8^{+}T cells (*K*= 7). For*K*> 3, we used HEpiDISH. For all K, in each Monte Carlo, there were 150 true DMCTs (75 hyper, and 75 hypo) occurring specifically in the monocytes. Considering the case of DMCs occurring only in one cell type, (**d**) box plots of cell-type fractions of the altered cell type (y-axis) against the cell-type fraction range. (**e**) Box plots display sensitivity to detect DMCTs is larger than 50% as long as the range of variation in the cell-type fraction is 0.2 or higher. Even if the variation range in the cell-type fraction is only 0.15 (i.e. 15%), we still have a sensitivity close to 50%. (**f**,**g**) The corresponding sensitivity and specificity to detect DMCs. All results were based on 100 Monte Carlo runs, with 100 ‘control’ and 100 ‘disease’ in silico mixtures. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles.### Supplementary Figure 9 Robustness of CellDMC to a missing cell type in DNAm reference matrix.

Assuming all cell types exhibit fractions of same underlying mean and variance, (

**a**) Comparison of the sensitivity to detect DMCTs as a function of SNR when all major cell types have representative DNAm profiles in the reference DNAm matrix (AllCT) versus the case when one major cell type is missing from the DNAm reference matrix (MissingCT). Scenario considered is where only one cell type out of three is differentially methylated (Uni-1C) and with the three cell-type fractions per sample drawn from a uniform Dirichlet distribution.**b**) As**a**) but for the sensitivity to detect DMCs. (**c**) As (**b**) but for the specificity. (**d-f**) As (**a-c**), but now for the scenario where the same DNAm alterations are happening in two cell types, with one of these missing from the DNAm reference matrix. (**g-j**) Robustness of CellDMC to a missing cell type in the blood DNAm reference matrix. (**g**) Sensitivity of CellDMC to detect DMCTs occurring in CD4^{+}T cells at various SNRs and for the case where all depicted five blood cell subtypes are present in the DNAm reference matrix (“AllCT”-blue) versus the case where one cell type (CD8^{+}T cells) was missing/removed from the DNAm reference matrix. (**h-i)**As (**g**), but for the sensitivity to detect DMCs (i.e. not caring which cell type the DMCT is called in) and specificity. In all cases, samples were in silico mixtures of the five underlying blood cell subtypes**. (j**) Realistic distribution of cell-type fractions in blood from which the simulated fractions were modeled. Each box plot contains result from 100 Monte Carlo runs, where in each run 200 mixture samples were simulated: 100 ‘control’ and 100 ‘disease’ in silico mixture samples. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles.### Supplementary Figure 10 Benchmarking of CellDMC to non-interaction-based DMC calling (NoInt) in bi-directional scenarios.

(

**a-b**) Sensitivity measures of the non-interaction-based model and CellDMC for detecting DMCs and DMCTs for the bi-directional DNAm change scenarios affecting all or only two cell-types, and for five signal-to-noise ratios (SNRs), as indicated. Note that for the bi-directional two-cell scenario depicted in (**b**), a non-interaction-based analysis cannot detect DMCs because the opposite directional DNAm change in two of the underlying cell types (assumed here to be of equal magnitude) ‘cancel out’. (**c-d)**As (**a-b)**, but now for the specificity. All results were based 100 Monte Carlo runs, with 100 ‘control’ in silico mixtures and 100 ‘disease’ in silico mixtures. (**e**) Illustrative example of a DMCT, which would be missed by a conventional DM analysis (the non-interaction-based model) because the bi-directional change in the two cell-types cancel out. (**f-g**) Same CpG, but now with the adjusted β plotted against cell-type fraction for the two underlying cell types, showing differences in DNAm between cases and controls, with the directionality switching depending on the cell-type fraction. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles.### Supplementary Figure 11 Benchmarking of CellDMC to NoInt model, RefFreeEWAS and SVA.

(

**a**) From left to right, plots of the sensitivity of CellDMC, of reference-based DMC calling (which includes cell-type fractions only as covariates without interaction terms, denoted “NoInt”), and of the reference-free methods RefFreeEWAS and SVA, against the SNR for five different differentially methylated cell-type scenarios where mean cell-type fractions are not different between cases and controls. Because the competing methods can only detect DMCs, and not DMCTs, we compare the methods in terms of their sensitivity to detect DMCs. Results are for 100 Monte Carlo runs, where in each run 200 in silico mixture samples were simulated (100 ‘control’ and 100 ‘disease’). (**b**) As (**a**), but now for the specificity. (**c-d**) As (**a-b**), but now there is a change in mean cell-type fractions between cases and controls. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles.### Supplementary Figure 12 Evaluation of CellDMC’s performance on WGBS data and comparison to Illumina 450k.

(

**a,c,e**) Sensitivity to detect DMCTs, DMCs and specificity for DMC/DMCT detection for CellDMC in five different simulation scenarios and five different SNRs, as indicated, on in silico mixtures of DNAm samples generated from Illumina 450k data. (**b,d,f)**As**a,c,e**, but now displaying CellDMC’s performance on in silico mixtures of DNAm profiles generated from WGBS data derived from IHEC. All results were based on 100 Monte Carlo runs, with 100 ‘control’ and 100 ‘disease’ in silico mixtures. Boxes indicate 25th and 75th centiles. Whiskers extend to the largest values no further than 1.5 times the interquartile range from these centiles.### Supplementary Figure 13 Robustness of CellDMC on the rheumatoid arthritis EWAS in whole blood.

(

**a**) Heat map shows the true B-cell-specific rheumatoid arthritis DMCs and predicted DMCTs by CellDMC (with cell-type fractions inferred with constrained projection-CP). (**b**) Scatterplot of the interaction t-statistics for B cells as inferred using CellDMC with EpiDISH RPC (x-axis) vs. constrained projection (CP; y-axis). Statistic values larger than 10 or smaller than –10 are highly significant and are very robust to the choice of cell-type-fraction estimation algorithm. (**c-d**) As (**a-b**), but now cell-type fractions were estimated using EpiDISH RPC with CD8^{+}T cells missing from the DNAm reference. Calling of DMCTs was based on a two-tailed*t*-test in CellDMC.### Supplementary Figure 14 Validation of CellDMC in detecting breast cancer epithelial DMCTs.

(

**a**) Bar plots display CellDMC’s sensitivity to detect DMCs and Epi-DMCTs and specificity. (**b**) Bar plot showing the percentage of breast cancer epithelial DMCTs, correctly predicted by CellDMC to occur in the epithelial compartment. There were a total of 19,379 true positive breast cancer epithelial DMCTs, of which 11,807 were correctly predicted to be a DMCT in epithelial cells. We also display the percentages for true hypermethylated and hypomethylated DMCTs, separately. (**c**) Bar plot of percentage of breast cancer epithelial DMCs being predicted by CellDMC to be DMCTs in the epithelial fibroblast, fat, and immune cell compartments. (**d**) Bar plot of percentage of predicted nonDMCTs (nonDMCs) in breast cancer for a set of 15,212 true negatives, confirming that the great majority of these are not predicted to be epithelial DMCTs. We note that these CpGs are also not predicted to be DMCs in fibroblasts, fat or immune cells. All numbers on bar plots denote the number of CpGs. Error bars indicate 99% confidence interval, which was calculated as p ± 2.576 * √[p * (1 – p)/n].### Supplementary Figure 15 Identification of smoking DMCTs in buccal tissues.

(

**a**) Bar plots comparing the percentage of DMCTs that occur in epithelial and IC compartments, measured relative to the set of 62 smoking DMCs in blood, or relative to all called DMCs/DMCTs. (**b**) Bar plots displaying the number of different types of DMCT, as indicated. Epi, epithelial only; IC, immune cell only; Hyper, hypermethylated in smokers; Hypo, hypomethylated in smokers. “All” means a DMC that occurs in both cell types.

## Supplementary information

### Supplementary Text and Figures

Supplementary Figs. 1–15 and Supplementary Tables 1 and 2

### Reporting Summary

### Supplementary Data 1

Validation of EpiFibIC DNAm reference

### Supplementary Data 2

Performance of CellDMC for continuous and heterogeneous binary phenotypes

### Supplementary Data 3

Comparison of CellDMC run with EpiDISH to CellDMC run with constrained projection (CP)

### Supplementary Data 4

Robustness of CellDMC to perturbations in the cell-type fraction point estimates (using analytical error estimates)

### Supplementary Data 5

Performance of CellDMC with perturbations in the cell-type fraction point estimates (using empirically derived bootstrapped error estimates)

### Supplementary Data 6

Validation of EpiDISH on WGBS data from IHEC

### Supplementary Data 7

Scatterplots of adjusted β-values for all five HAND2 1stExon CpGs in the TCGA UCEC dataset

### Supplementary Data 8

Estimated cell-type fractions in the smoking buccal dataset

## Rights and permissions

To obtain permission to re-use content from this article visit RightsLink.