EMeth: An EM algorithm for cell type decomposition based on DNA methylation data

We introduce a new computational method named EMeth to estimate cell type proportions using DNA methylation data. EMeth is a reference-based method that requires cell type-specific DNA methylation data from relevant cell types. EMeth improves on the existing reference-based methods by detecting the CpGs whose DNA methylation are inconsistent with the deconvolution model and reducing their contributions to cell type decomposition. Another novel feature of EMeth is that it allows a cell type with known proportions but unknown reference and estimates its methylation. This is motivated by the case of studying methylation in tumor cells while bulk tumor samples include tumor cells as well as other cell types such as infiltrating immune cells, and tumor cell proportion can be estimated by copy number data. We demonstrate that EMeth delivers more accurate estimates of cell type proportions than several other methods using simulated data and in silico mixtures. Applications in cancer studies show that the proportions of T regulatory cells estimated by DNA methylation have expected associations with mutation load and survival time, while the estimates from gene expression miss such associations.

www.nature.com/scientificreports/ In general, there are two classes of methods studying cell type decomposition using DNA methylation data: reference-based methods vs. reference-free methods. Reference-free methods do not require the reference of cell type-specific DNA methylation [9][10][11] . Their main goal is to account for the variation of cell type composition in the association analysis of DNA methylation 12 , and the latent factors estimated from these methods are often linear combinations of cell type proportions rather than the cell type proportions themselves. The reference-based methods, which require prior knowledge of cell type-specific DNA methylation, can estimate the proportion of each cell type. Our EMeth method is a reference-based method.
A few reference-based methods have been developed in a regression framework where the response variable is the DNA methylation of multiple CpGs in a tissue sample and each covariate is the DNA methylation of these CpGs in a cell type. Houseman et al. 13 proposed a linear regression method with quadratic programming to impose the constraint that the regression coefficients are none-negative. Teschendorff et al. 14 proposed a method Epidish that replaces linear regression with robust linear regression (R function MASS/rlm) that uses a weighted loss function so that the data points with larger residuals have smaller weights. Motivated by the success of CIBERSORT in gene expression decomposition, Teschendorff et al. 14 and Chakravarthy et al. 15 employed support vector regression to estimate cell type composition using DNA methylation data.
A limitation of reference-based methods is that for some CpGs, the DNA methylation in the reference samples may not accurately capture the cell type-specific DNA methylation in a tissue sample. Another limitation is that reference may not be available for all cell types. For example, when studying tumor immune microenvironment, we often have the reference for immune cell types, but not for tumor cells. Chakravarthy et al. 15 have used the DNA methylation derived from tumor cell lines as the reference for tumor cells, which may not be ideal for many cancer types where there are considerable differences between cell lines and bulk tumor samples.
To overcome these two limitations, we propose a new method for cell type decomposition using DNA methylation data. We name our method EMeth, since it uses an EM (Expectation-Maximization) algorithm for parameter estimation and it is applied on DNA methylation data. To overcome the challenge of inaccurate reference for some CpGs, EMeth models the observed DNA methylation of each CpG by a mixture distribution with one component for regular/consistent CpGs and the other component for aberrant CpGs whose DNA methylation are inconsistent with what is expected from the deconvolution model. EMeth automatically down-weighs the contributions of the aberrant CpGs on cell type deconvolution. To overcome the second challenge of unknown reference, EMeth includes a special cell type without methylation reference, but with known cell type proportions. This is motivated by cancer studies where tumor purity can be estimated from DNA copy number data, or even methylation data itself 16 . EMeth estimates the DNA methylation of this special cell type.
We evaluated EMeth and four competing methods (linear regression, robust linear regression, quadratic programming, and support vector regression) by two sets of studies where DNA methylation were generated by simulation or in silico mixture of observed cell type-specific DNA methylation. Our results demonstrate that EMeth has better performance than all the competing methods. Then we applied EMeth on four cancer types that have relatively higher mutation burdens, hence higher likelihood of immune infiltration 17 : colon adenocarcinoma (COAD), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), and skin cutaneous melanoma (SKCM). We found that cell type proportions estimated by EMeth have good consistency with the estimates from gene expression data for CD8T and B cells, but correlated poorly for T regulatory cells (Tregs). The Tregs proportion estimates by DNA methylation were negatively associated with mutation load and survival time, consistent with the functional roles of Tregs. Such associations were missed by the Treg proportion estimates by gene expression. Therefore, the cell type proportion estimates by DNA methylation provide some helpful additions to those estimates by gene expression.

Results
Method overview and study design. DNA methylation of a CpG site can be measured by a beta-value or an M-value. A beta-value is between 0 and 1 and it quantifies the proportion of DNA molecules in which this CpG is methylated. An M-value is the logit transformation of a beta-value. Appropriate decomposition should be performed using beta-values 13 .
The input data of EMeth include the methylation data (beta-values) from bulk tissue samples and the reference data of cell type-specific DNA methylation in a pre-defined set of CpGs. For each bulk sample, these CpGs can be divided into two groups: the consistent (or aberrant) CpGs on which the DNA methylation in the bulk sample is consistent (or inconsistent) with what is expected from the deconvolution model. Note that the set of aberrant CpGs may vary across bulk tissue samples. EMeth models the DNA methylation data in a tissue sample by a normal mixture of regression model with two components, designed for consistent and aberrant CpGs, respectively. We assume the aberrant component has a larger variance, which can be estimated from the data and such larger variance automatically down-weighs the contribution of the aberrant CpGs for cell type deconvolution. Within each mixture component, EMeth assumes the mean value of DNA methylation is a weighted summation of cell type-specific DNA methylation where the weights are cell type proportions. A standard EM algorithm can be used to estimate the parameters of this mixture of regression model. From our exploratory analysis, we found the estimation may be unstable when two or more cell types have highly correlated cell typespecific DNA methylation. To alleviate this problem, we added a ridge penalty on cell type compositions and incorporated it in the likelihood function for EMeth.
EMeth allows a special cell type of which the cell type proportions in tissue samples are known but the corresponding cell type-specific methylation reference is unknown. This is often the case for tumor tissues where tumor purity is known but DNA methylation in tumor cells is unknown. The methylation level of this special cell type can be estimated by borrowing information across tissue samples. For each CpG, the expected contribution of this special cell type to the observed methylation is proportional to the product of its proportion in a tissue www.nature.com/scientificreports/ sample and its methylation. Therefore, its methylation can be estimated by a regression using the methylation data of this CpG across tissue samples, where the response variable is the observed methylation, and the covariate is the proportion of this cell type. See the Methods Section for more details. Among the outputs of EMeth, most users may be interested in the estimates of cell type proportions and the DNA methylation in the special cell type. Though we also provide the estimate of aberrant probability for each CpG in each bulk sample, which can be informative for CpGs selection by other deconvolution methods that do automatically weigh each CpG.
EMeth is implemented in an open-source R package, freely available at https ://githu b.com/Hanyu z1996 /EMeth . More details of EMeth are presented in the Methods Section, and Section 1 of the Supplementary Materials.
We compared EMeth versus the following four methods.
1. Ordinary linear regression that minimizes residual sum squares of the model fit (least squares or LS).
2. Support vector regression (SVR) with linear kernel. This is the method used by CIBERSORT/CIBER-SORTx 5,15 . There is a cost parameter in svr that balances the model fit and penalty. The default cost parameter does not work well, and we choose it by cross-validation. 3. Robust linear regression (RLS) with Huber loss. This is the method used by Epidish 14 . 4. Linear regression with constraints that the regression coefficients are non-negative, solved by quadratic programming (QP). This is the method used by Houseman et al. 13 .
For all these methods, the regression coefficients are the estimates of cell type proportions. In the first three methods, a coefficient is set to 0 if it is smaller than 0. Then we re-normalize the coefficients such that they add up to one for each sample. All these four methods are based on a linear mean structure of methylation but do not consider variance structure. In contrast, EMeth uses the variance to detect aberrant CpGs. These methods also do not allow a special cell type with known proportions but unknown methylation.
Evaluation of different methods using simulated data. This simulation study is based on cell typespecific DNA methylation data from seven types of immune cells (CD4T, CD8T, B cells, Natural Killer cells, Neutrophils, Monocytes, and T Regulatory Cells). In order to generate this cell type-specific reference dataset, we combined cell type-specific DNA methylation data measured by Illumina 450k arrays from six studies [18][19][20][21][22][23] .
The raw data were processed to remove batch effects due to different data sources, see Section 2 of the Supplementary Materials for details. We first estimated the mean methylation level per CpG and per cell type using this cell type-specific reference dataset. For the special cell type without reference, the mean methylation of each CpG was simulated independently across CpGs from a uniform (0,1) distribution. Then we randomly simulated cell type proportions for each tissue sample and calculated the mean methylation of each CpG in each tissue sample by the weighted summation of cell type-specific methylation, where the weights were cell type proportions. Finally, we simulated methylation in each tissue sample by a normal distribution with these mean values and variances calculated based on a mean-variance relation of the binomial distribution (see Methods Section for more details). For each sample, the proportion of aberrant CpGs was sampled from a uniform distribution in the range of 5% to 15%. We set the number of bulk tissue samples to be 100 for all simulations and selected 946 CpGs to construct the reference data. The details of CpG selection can be found in Section 3 of the Supplementary Materials.
The key parameters in this simulation include (1) a variance parameter σ 2 c , which represents the noise level in the consistent CpGs; and (2) the ratio of the variance of aberrant CpGs vs. consistent CpGs: = σ 2 a /σ 2 c . The value of σ 2 c is based on the estimates from four cancer types that are part of The Cancer Genome Atlas (TCGA) study. We evaluate the performance of different methods by rooted mean square error (RMSE). Specifically, we first computed the difference of the estimated proportions and the true proportions for each cell type, and then computed the square root of the average of the squared difference. For each method, the average was taken across all cell types and all samples. More details of simulation setup are provided in Section 3.1 of the Supplementary Materials.
EMeth always reaches the lowest RMSE in all simulation settings (Table 1). Among the four benchmark algorithms, SVR and RLS have similar performance and both have smaller RMSE than QP, which has slightly smaller RMSE than LR. When = σ 2 a /σ 2 c increases from 2 to 100, the performance of all the four competing methods become worse, which is expected since there is more noise in the data; while in contrast, the performance of EMeth becomes better. This is because larger makes it easier for EMeth to separate the aberrant CpGs from the consistent ones and thus appropriately down-weighs their contributions to the final estimates. For example, for the middle level of σ 2 c = 0.00221 (corresponding to the σ 2 c estimate from lung adenocarcinoma (LUAD) patients) and = 2 , EMeth reduces the RMSE of SVR by 31% (from 0.466 to 0.322). For the same σ 2 c but with larger = 100 , EMeth reduces the RMSE of SVR by 36% (from 0.486 to 0.310). We also conducted additional simulations where the proportion of aberrant CpG probes varies from 5 to 35% and reached the same conclusion that EMeth consistently outperforms all the other methods (Supplementary Table S3).
We also investigated the cell proportion estimates for each cell type separately. EMeth performs the best for all the cell types in most simulation settings ( Fig. 1 and Figures S5-S22 in Section 3 of the Supplementary Materials). The estimation accuracy varies a lot across cell types. For some cell types all the methods provide accurate estimates. For example, the RMSE can be as low as 0.01 for monocyte by all the methods. This low RMSE is not an artifact of low cell type proportions since the expected proportions of all the cell types are the same in this simulation. In contrast, the RMSE can be relatively high for CD4T or CD8T. RMSE for CD8T can be as high as 0.05 for EMeth, and higher than 0.05 for all the other four methods. The difficulty in estimating the proportions www.nature.com/scientificreports/ of CD4T and CD8T is due to the high level of similarity of DNA methylation for these two cell types. In the reference data, the correlation between the average methylation of CD4T and CD8T can reach 0.96. The ridge penalty in our model helps improve the robustness against this collinearity and leads to more accurate estimates. In addition to cell type proportions, EMeth also estimates the DNA methylation in the special cell type without reference. Our simulation results show that the EMeth estimates are reasonably accurate. The correlations between the estimates and the true values are around or above 0.7 in most cases (Table S2 in  Evaluation of different methods using in silico mixtures of cell type-specific DNA methylation data. Next, we evaluated different methods by directly mixing individual-specific and cell type-specific DNA methylation data to generate pseudo tissue samples. This approach captures the variation of cell type-specific DNA methylation across individuals and does not rely on any distribution assumption. To this end, we used the methylation and gene expression data in three human immune cell types (Monocytes, Neutrophils and naive CD4 T cells) from blood samples of 197 individuals, which were generated as part of the BLUEPRINT project 25 . This is a very unique and valuable dataset because it includes cell type-specific omic data in a large number of individuals. We selected the 124 individuals with complete gene expression and DNA methylation data for all three cell types for this evaluation. We compared cell type decomposition based on DNA methylation data by EMeth and four other methods as well as cell type decomposition using gene expression data by CIBERSORTx 26 . We did not consider the special cell type with unknown methylation in this in silico mixing dataset.
For each individual, we simulated a mixture by linearly combining cell type-specific data, followed by adding Gaussian noise in M-value scale for methylation data and in log-scale for expression data. The only parameter in this simulation study is the variance of the Gaussian noise. Since we use the same mixture proportions on both methylation and gene expression data, we are able to compare the cell type proportion estimates from these two types of data. We used the data from 56 individuals to construct the reference data and the remaining 68 individuals to generate mixture samples. We added the same level of noise to both expression and methylation data. See Section 4 of the Supplementary Materials for more details.
Both EMeth and RLS have accurate estimation results (correlation with true cell type proportions as high as 0.95 and RMSE around 10 −3 for each cell type) and they consistently outperform other methods (Fig. 2). The estimates by EMeth and RLS based on methylation data are more accurate than the estimates by CIBERSORTx based on expression data. The superior performances of EMeth and RLS are more apparent by examining the cell type proportion estimates for each cell type across all the individuals (Fig. 3).
Results on TCGA data. We studied immune cell type composition for four cancer types (colon cancer, lung adenocarcinoma, lung squamous cell carcinoma, and skin cutaneous melanoma) using the gene expression data (bulk RNA-seq data) and DNA methylation data (Illumina 450k array) from The Cancer Genome Atlas (TCGA). We applied EMeth using the reference data of the seven immune cell types used in our simulation study. To estimate cell type proportions using gene expression data, we applied CIBERSORTx 26 using its default www.nature.com/scientificreports/ reference of 22 immune cell types (referred to as LM22), followed by cell size correction. Cell size correction is needed because CIBERSORTx estimates the proportion of gene expression from each cell type, and thus if one cell type has more transcripts per cell, its proportion will be over-estimated 28 . We further collapsed the cell type proportion estimates to the seven cell types used for DNA methylation-based deconvolution to facilitate direct comparisons. Across all the cancer types, the cell proportion estimates using DNA methylation and gene expression have good consistency for B cells and monocytes, with correlation > 0.7 for most cancer types or deconvolution methods ( Fig. 4 and Figures S23-S25 in Supplementary Materials). The proportions of CD8T have good consistency by correlation, though the estimates using gene expression is often higher, leading to relatively large RMSE when comparing the estimates using the two types of data ( Figures S26-S30). Consistency for CD4T and neutrophil varies according to cancer types or methods. The proportion estimates of natural killer and T regulatory cells (Treg) by gene expression and DNA methylation often have very low correlations ( Figures S31-S34), which could be partly due to the low abundance of these two cell types.
Without independent cell type proportion estimates by other approaches for the same samples (e.g., flow cytometry), it is not possible to assess which type of omic data (gene expression or DNA methylation) provide more accurate estimates. We sought to assess this indirectly by examining the associations between cell type proportions and two variables of interest: mutation burden and survival time.
We used the mutation burden calculated in an earlier work of our group 29 , where we defined the mutation burden of a tumor sample as the total number of non-silent single nucleotide variants (SNVs) plus the number of indels. A tumor cell with higher mutation burden may present more mutated peptides on its cell surface, and thus has higher chance to be recognized by the immune system. This will in turn triggers immune infiltration in tumor, particularly CD8 T cells. Colon cancer is an ideal case to study the relation between mutation burden    www.nature.com/scientificreports/ and immune infiltration because there is a subset of tumor samples that are hypermutated (Fig. 5A). We observe that CD8T cells proportions (either the ones estimated using gene expression/CIBERSORTx or DNA methylation/EMeth) are indeed higher in the hypermutated subset (Fig. 5B). The Treg proportions estimated by gene expression are similar between the two subsets. In contrast, Treg proportions estimated by DNA methylation are lower in the hypermutated subsets (Fig. 5B), consistent with earlier finding that Tregs are depleted in hypermutated colon cancer samples 30 . Angelova et al. 30 used gene expression to infer immune cell abundance. Instead of deconvolution based on references, as was done by CIBERSORTx or EMeth, they inferred the abundance of a cell type using weighted average of a group of genes that are highly expressed in that cell type. Therefore the method of Angelova et al. is less quantitative (since they do not give exact proportions) but more robust because they do not fit a deconvolution model. The fact that their conclusions match our EMeth results (i.e., Tregs are depleted in hypermutated colon cancer samples) provides a strong support for the accuracy of EMeth.
We assessed the associations between survival time and cell type proportion estimates (from either gene expression or DNA methylation) in all cancer types and only detected significant associations in melanoma (SKCM) samples, therefore here we focused on the results from SKCM samples. The association patterns for  www.nature.com/scientificreports/ CD8T are consistent between cell type proportions estimated by gene expression and DNA methylation: higher abundance of CD8T is associated with better survival outcomes (see Fig. 5C for log-rank test results and Table S4 in the Supplementary Materials for Cox regression results). This is expected since more immune infiltration leads to better survival outcomes 2,3 . Since T regulatory cells function in suppressing tumor-specific immune responses 31 , negative association between Treg abundance and survival time is expected. This is the case for the DNA-methylation-based Treg proportions, but not true for the gene expression-based Treg proportions ( Fig. 5C and Table S4 in Supplementary Materials). The discrepancy of Treg proportion estimates by gene expression and DNA methylation could be due to inaccuracy of cell type-specific reference. While DNA methylation of bulk tumor samples and reference were both measured by Illumina 450k arrays, the gene expression data were measured by two different platforms: RNA-seq for bulk tumor samples and microarray for reference. Such platform difference could lead to bias in cell proportion estimates. To address this potential issue, we prepared another gene expression reference of 11 cell types using single cell RNA-seq (scRNA-seq) data of 16,291 immune cells from 48 tumor samples of melanoma patients 33 , and we refer to it as SF11 reference matrix. For each cell type, gene expression in this scRNA-seq data has positive, but sometime moderate correlation with the gene expression in the LM22 matrix ( Figure S34), reflecting difference due to two platforms and two types of samples (blood samples for LM22 and tumor samples for SF11). We selected 594 genes for the SF11 signature matrix based on the results of differential expression analysis. The expression of a few signature genes were illustrated in Figures S35-S40. See Section 5.2 of the Supplementary Materials for more details on how this SF11 signature was prepared, and the complete pipeline is available at https ://githu b.com/Sun-lab/IT-predi ctor/tree/maste r/R/SF201 8_analy sis.
We estimated cell proportions in the TCGA SKCM samples by CIBERSORTx, using the SF11 signature matrix, and compared the estimates with the estimates based on gene expression LM22 signature and DNA methylation www.nature.com/scientificreports/ EMeth signature ( Figure S41). The cell types that have relatively high correlation between LM22 and EMeth estimates, such as B cells or CD8T cells, also have high correlation between LM22 and SF11 estimates. In contrast, Treg proportion estimates using SF11 and LM22 have weaker correlation. In fact, the Treg proportion estimates using LM22 has stronger correlation with SF11 estimates in 3 other cell types (correlation 0.74, 0.55, and 0.54 for B cells, CD4T, and monocyte, respectively) than with Treg itself (correlation 0.5). The Treg proportion estimates using either SF11 or LM22 have negative correlations with EMeth estimates (correlation -0.18 and -0.2 for SF11 and LM22 estimates, respectively, Figure S41). It is worth noting that for some cell types, the proportion estimates from gene expression and DNA methylation have higher correlations than the estimates from two gene expression signatures. For example, for B cells, the correlations between EMeth estimates and SF11/LM22 estimates are 0.84 and 0.90, respectively, while the correlation between SF11 and LM22 estimates is 0.79.

Method
EMeth overview. Suppose we have I bulk samples, each one of which consists of a mixture of Q + 1 cell types. Assume the cell type-specific methylation are known for the first Q cell types, and the mixture proportion is denoted by ρ qi for the q-th cell type in the i-th sample. The (Q + 1)-th cell type is a special one that does not have cell type-specific methylation information but has known cell type proportions in bulk samples, denoted by η i for sample i. In tumor samples, this cell type can be tumor cell and its proportion is tumor purity that can be estimated by copy number data. We also allow aberrant CpGs whose cell type-specific methylation is inconsistent between bulk tissue sample and reference of cell type-specific methylation.
Denote the methylation (in the scale of β-value) at the k-th CpG and the i-th bulk sample by z ki . We model z ki by a mixture of two normal distributions.
Here π ai and π ci are the probabilities that a CpG being aberrant or consistent in sample i, respectively, and π ai + π ci = 1 . N (µ ki , σ 2 a w ki ) , the aberrant component, is a normal distribution with mean value µ ki and variance σ 2 a w ki . The consistent component N (µ ki , σ 2 c w ki ) is another normal distribution with the same mean value but smaller variance: σ c < σ a . The decomposition into multiple cell types is achieved by a regression model µ ki = η i ν 0k + Q q=1 ρ qi ν qk , where η i + Q q=1 ρ qi = 1 . Furthermore, we assume that η i is known for each sample and ν qk is known for each CpG and each cell type. From this model, we can see that the cell type-specific DNA methylation in the special cell type ( ν 0k ) is estimable because it can be considered as the regression coefficient for η i in a CpG-specific regression. As the consequence, a reasonable sample size and sufficient variation of cell type proportions across tissue samples is needed to estimate ν 0k .
By default, we set w ki ≡ 1 , other choices of w ki may further improve the performance of EMeth. As an example, in the Supplementary Materials we compare original version of EMeth with a new version EMeth-Binom that sets the weights based on the binomial distribution: w ki = q ρ qi ν qk (1 − ν qk ) + η i ν 0k (1 − ν 0k ) . EMeth-Binom can provide more accurate estimates of the model parameters but did not improve the estimates of cell type proportions, and thus we focus on the results of standard EMeth with all the weights being 1. In our simulation studies, we simulated DNA methylation using the mean-variance relation specified by EMeth-Binom.
Since we are interested in estimating ρ qi , the model can be reorganized as where ε ki ∼ π ai N (0, σ 2 a w ki ) + π ci N (0, σ 2 c w ki ) and Q q=1 ρ qi = 1 − η i , ρ qi ≥ 0 . For this mixture of regression model, it is natural to propose an EM-type algorithm to maximize the likelihood. The reference data often have strong collinearity between two closely related cell types. We provide a simple remedy to this issue by adding a ridge penalty term: ρ 2 qi . The missing data in EM algorithm is denoted by γ ik , which denotes whether the j-th CpG is aberrant or consistent for the i-th bulk tumor sample. To conclude, in our model the complete penalized likelihood is given by where e ki = z ki − η i ν 0k − Q q=1 ρ qi ν qk . The details of the computation of EMeth are presented in Section 1 of the Supplementary Materials. We also consider a variant by replacing the normal distribution with Laplace distribution (EMeth-Lap). Note that the Laplace distribution is also in the location-scale family, the main difference is to replace the ℓ 2 -loss corresponding to normal distribution to ℓ 1 -loss corresponding to the Laplace distribution, i.e., replace e 2 ki by |e ki | . We still www.nature.com/scientificreports/ include the ridge penalty term in EMeth-Lap. Its performance is similar to standard version of EMeth and thus we focus on the results of the standard EMeth in this paper.

Discussion
Estimates of cell type proportions, particular for immune cell infiltration in tumor samples, are highly valuable for down-stream association analyses. Many efforts have been devoted to this topic and most of the existing works focus on deconvolution of cell type proportions using gene expression data. DNA methylation is an attractive alternative because it is often a more stable than gene expression. Our method, EMeth, is designed for deconvolution using DNA methylation and it has two novelties. One is to allow a subset of aberrant CpGs and thus increase the robustness of the method. The other one is to allow one cell type without reference and estimate the DNA methylation of this cell type. We have demonstrated the advantages of EMeth in simulations studies. In our case study using TCGA data, the proportion of T regulatory cells (Tregs) estimated from gene expression and DNA methylation have very low correlation, and discrepant association results. We found the association results using Treg proportion estimates from DNA methylation is more consistent with the functional roles of Tregs. However, for the cell types with low proportions across most of the samples, e.g., Tregs, the estimation noise may have large impact on the association signals and further studies are warranted.
In this study, we focus on Illumina 450k methylation array because it is the platform where we can find good amount of cell type-specific and bulk tissue DNA methylation data. Since we only select around 900 CpGs for the final deconvolution, our approach should be applicable to the platforms with fewer CpGs such as the Illumina 27k methylation array that still have enough CpGs to make selection. Our method is also applicable to the newer arrays such as the MethylationEPIC arrays with 850k CpGs or whole genome bisulfite sequencing. It is ideal to generate cell type-specific reference using the same platform though we expect the reference generated from 450k array can be used for deconvolution of methylation data from other array platforms, after a new round of CpG selection to match the array platforms. Whether such array based reference is appropriate for deconvolution using bisulfite sequencing data warrants further studies.
Although our EMeth method is motivated by cancer studies (e.g., the special cell type with known proportion is designed to mimic tumor cells), it can be applied to other tissues, such as blood, to study the cell type composition associated with other disorders such as autoimmune diseases. Our simulation study is designed to mimic the level of noise observed in cancer studies (Table 1). Since the noise in tumor tissues is often larger than other tissues, the simulation study could be considered as a more challenging case. In contrast, our in silico study used a very unique and valuable dataset with cell type-specific DNA methylation measured in ∼ 200 individuals. The unique advantage of this dataset is that we can generate mixture of multiple cell types while preserving the biological variation across individuals. At the same time, we can use a subset of the individuals to estimate cell type-specific reference, and evaluate deconvolution performance using the remaining individuals, and thus preserve the biological variation between reference data and bulk tissue samples. Because the reference and bulk are generated from the same study and thus we can avoid batch effects, this in silico study represents an easier case. The fact that EMeth performs well in both cases suggest it has robust performance in different settings.
A limitation for DNA methylation-based deconvolution, compared with gene expression-based ones is the lack of cell type-specific DNA methylation data in different cell types. For example, in our deconvolution analysis, we only considered seven cell types. Some of the cell types could be further refined and it is desirable to include more cell types. We believe as more works demonstrate the value of cell type-specific DNA methylation data, and with the technique advancement, more of such reference data will be generated in the near future. Although EMeth allows a special cell type without reference, we do assume the proportions of this cell type are known. EMeth cannot handle an arbitrary missing cell type without either cell type-specific reference or cell type proportions.
When the cell type-specific reference is inappropriate, EMeth may find the vast majority of of the CpG probes are aberrant. With a very limited number of CpG probes usable, there is risk of over-fitting. EMeth will issue a warning in such situation and suggest improvement of the cell type-specific references.
Single cell DNA methylation data is potential resource to generate cell type-specific reference. However, currently such data are often very sparse and expensive to generate. A promising direction is to leverage cell type-specific scRNA-Seq data to generate corresponding cell-type-specific DNA methylation reference matrix 34 . In the near future, the technique development may be more mature to measure both gene expression and DNA methylation in the same single cell 35 , and thus making it feasible to combine gene expression and DNA methylation to infer cell type-specific DNA methylation reference matrix.
Finally, we conclude with the computational efficiency of EMeth. The EM is known to be slow in convergence. Though in our application, we use EM for a classical mixture model where convergence is relatively fast. For example, in our analysis of TCGA melanoma data from 393 bulk tumor samples, it took about 10 minutes to run the EM algorithm 20 times (4 penalty parameters and 5-fold cross validation), using a MacBook Pro with 2.3GHz CPU and single thread. The pipelines for analyzing simulated data and TCGA data can be found at https ://githu b.com/Sun-lab/dMeth .