Abstract
In epigenomewide association studies, the measured signals for each sample are a mixture of methylation profiles from different cell types. Current approaches to the association detection claim whether a cytosinephosphateguanine (CpG) site is associated with the phenotype or not at aggregate level and can suffer from low statistical power. Here, we propose a statistical method, HIgh REsolution (HIRE), which not only improves the power of association detection at aggregate level as compared to the existing methods but also enables the detection of riskCpG sites for individual cell types.
Introduction
Epigenomewide association studies (EWAS) aim to identify cytosinephosphateguanine (CpG) sites associated with phenotypes of interest, such as disease status^{1,2,3}, smoking history^{4,5}, body mass index^{6}, and age^{7,8}. However, because the samples in EWAS are measured at the bulk level rather than at the singlecell level, the obtained methylome for each sample shows the signals aggregated from distinct cell types^{3,9,10}, which leads to two main challenges in the analysis of EWAS data. On the one hand, the cell type compositions differ among samples and can be associated with phenotypes^{3,10}. Both binary phenotypes, such as the diseased or normal status^{3}, and continuous phenotypes, such as age^{10}, have been found to affect the cell type compositions. As a result, ignoring the cellular heterogeneity in EWAS can lead to many spurious associations^{10,11,12,13}. On the other hand, the phenotype may change the methylation level of a CpG site in some but not all of the cell types. Identification of the exact cell types that carry the riskCpG sites can deepen our understandings of disease mechanisms. However, such identification is challenging because only the aggregatedlevel signals can be observed.
To the best of our knowledge, no existing statistical method for EWAS can detect celltypespecific associations despite active research to account for celltype heterogeneity. The existing approaches can be categorized into two schools^{14}: referencebased and referencefree methods. The referencebased methods^{9,15} require the reference methylation profiles for each cell type to be known a priori, and they regress the aggregated methylation levels observed from each sample on the same set of references to learn the sample’s cellular compositions. However, because samples have different attributes, such as age and gender, the methylation levels of a given cell type can vary among samples. It is thus problematic to assume that all of the samples have the same set of reference profiles^{10,14}. Furthermore, highquality references are difficult to obtain for most EWAS due to the existence of unknown cell types, the high cost of cell sorting, and confounding effects^{14}. Consequently, a large amount of recent EWAS literature was devoted to identification of riskCpG sites without the need for the reference methylation profiles.
The referencefree methods can generally be further divided into two classes according to whether they estimate the celltype mixing proportions directly. The directdecompositionbased procedures consist of two stages. In the first stage, they simultaneously estimate the cellular compositions of each sample and the celltypespecific reference methylomes via quadratic programming^{16}; and in the second stage, they treat the estimated celltype proportions as covariates with additive effects in the linear models to conduct association tests. However, when estimating cellular compositions during the first stage, the directdecompositionbased methods also do not consider samples’ phenotype information, thus suffering from the same problem of biasing the cellular composition estimates as the referencebased approaches^{9}. Moreover, similar to tumor purity^{17}, we argue that the estimated cellular composition has a multiplicative rather than an additive effect on the observed methylation level (Methods). The second class of methods, which is exemplified by SVA^{18}, RefFreeEWAS^{19}, and ReFACTor^{13}, does not carry out celltype decompositions. They resort to singular value decomposition, which includes the principal component analysis, to construct surrogates for the underlying celltype composition. EWASher, a linear mixed model, also belongs to this class because it is equivalent to the use of principal components as fixedeffect covariates^{11}. However, the use of principal components as the covariates in the regression undergoes the same issue of additive effects as the directdecompositionbased methods. Therefore, the existing referencefree methods have low power in detecting riskCpG sites^{12}.
Although the existing methods aim to address the cellular heterogeneity problem in EWAS and claim whether a CpG site is associated with phenotypes at the aggregate level, none of them can identify the riskCpG sites for each individual cell type, thus missing the opportunity to obtain finergrained results in EWAS.
Here, we propose a method, HIRE, to identify the association in EWAS at a HIgh REsolution: detecting whether a CpG site has any associations with the phenotypes in each cell type (Methods). The keys to HIRE’s success are twofold. First, HIRE links the underlying celltypespecific methylation profiles for each sample to the sample’s phenotypes, thus avoiding the bias in estimating the cellular composition by the referencebased and directdecompositionbased methods. Second, HIRE correctly characterizes the cellular compositions as the multiplicative effects, whereas the existing methods inappropriately treat the cell proportions as additive effects (Methods). HIRE is applicable to EWAS with binary phenotypes, continuous phenotypes, or both. By helping researchers understand in which cell types the CpG sites are affected by a disease, HIRE can ultimately facilitate the development of epigenetic therapies by targeting the specifically affected cell types.
Results
Method overview
HIRE is a hierarchical model that closely follows the data generation process. Its elaborate modeling depicts how phenotypes affect the methylation levels of each sample. Here, we briefly introduce the method. The technical details are provided in the Methods section and the Supplementary Methods.
Let us first review the cornerstone in most EWAS approaches. These methods model the observed methylation levels of the m CpG sites for sample i, O_{i} = (O_{1i}, O_{2i}, …, O_{mi})^{T}, as the weighted average of the methylation profiles of K cell types, u_{i} = (u_{i1}, u_{i2}, …, u_{iK}). The weights are the cellular compositions p_{i} = (p_{1i}, p_{2i}, …, p_{Ki})^{T} of sample i (see the top panel of Fig. 1a). However, regardless of whether the reference is known a priori or not, the existing methods assume that the celltypespecific methylation profiles u_{i}s remain the same for all samples: u_{i} = M, for i = 1, …, n. Unfortunately, because the methylation levels can actually change with covariates such as age and disease status, ignoring the covariates’ effects and enforcing static reference methylomes can bias the estimation of p_{i} and thus affect all downstream analyses^{14}. More importantly, the assumption that celltypespecific methylation profiles are the same for each sample prevents the detection of celltypespecific riskCpG sites.
For association detection at the aggregate level, after estimation of p_{i} using the deconvolutionbased approach or its surrogates from principal componentbased methods, the existing methods examine a linear model in which the phenotypes \({\mathbf{x}}_i = (x_{i1}, \ldots ,x_{i\ell }, \ldots ,x_{iq})^T\) and the cellular proportions p_{i} exert additive effects on the methylation level O_{i}:
A CpGsite j is then associated with phenotype \(\ell \) if we reject the null hypothesis that the covariate coefficient \(T_{j\ell }\) equals zero.
In contrast, HIRE further models the effect of each phenotype on each cell type as shown in the bottom panel of Fig. 1a. In cell type k, sample i’s celltypespecific methylation profile, u_{ik}, is the summation of the corresponding baseline celltypespecific methylation levels, μ_{k}, and the phenotype effects \({\mathbf{B}}_{k\ell }x_{i\ell }\) on sample i from all the l = 1, …, q phenotypes: \({\mathbf{u}}_{ik} = {\boldsymbol{\mu }}_k + \mathop {\sum}_{l = 1}^q {{\mathbf{B}}_{k\ell }} x_{i\ell }\), where \(x_{i\ell }\) is the phenotype \(\ell \) of sample i and \({\mathbf{B}}_{k\ell } = (\beta _{1k\ell }, \ldots ,\beta _{mk\ell })^T\)—the kth column of \({\mathbf{B}}_\ell \)—reflects the association of phenotype \(\ell \) with each of the m CpG sites in cell type k. Thus, by collecting the baseline celltypespecific methylation profiles to μ = (μ_{1}, …, μ_{k}) and denoting the m by K phenotype coefficient matrix \((\beta _{jk\ell }:1 \le j \le m,1 \le k \le K)\) by \({\mathbf{B}}_\ell \), we now have:
A comparison of \({\mathrm{x}}_i\) in Eq. (1) and \({x_{i \ell }}{\boldsymbol{p}}_i ,{\ell} = 1,…,q\) in Eq. (2) reveals that via the twolayer hierarchical model HIRE correctly captures the multiplicative effects of the cellular compositions on the phenotype effects (see also Methods and the Supplementary Methods). As a result, HIRE achieves greater statistical power for association detection at the aggregate level and enables the finescale resolutions that were previously infeasible. We mathematically prove that the HIRE model is identifiable under mild conditions that are easily met in reality (see Theorem 1 and its proof in Methods).
Figure 1b summarizes the inputs and outputs of HIRE. Given the methylation measurements at the aggregate level of n samples, HIRE can estimate all parameters of interests —p_{i} (i = 1, …, n), μ, and \({\mathbf{B}}_\ell \) (\(\ell = 1, \ldots ,q\)). HIRE then determines whether any association exists between CpG site j and phenotype \(\ell \) in each individual cell type by testing the hypotheses \(H_0:\beta _{jk\ell } = 0\) versus \(H_1:\beta _{jk\ell } \ \ne\ 0\). When the null hypothesis \(H_0:\beta _{jk\ell } = 0\) is rejected, HIRE calls CpG site j as a riskCpG site for phenotype \(\ell \) in cell type k. The detection of celltypespecific riskCpG sites cannot be performed with any of the existing stateoftheart methods.
Moreover, HIRE allows users to prespecify the number of cell types K. When K is unknown, HIRE selects the number of cell types according to the penalized Bayesian information criterion (pBIC)^{20} (Supplementary Methods).
Simulation
As the definition of the gold standard for real data is debatable^{21,22}, we designed extensive simulation studies to evaluate the performance of HIRE and compared it with commonly used methods—unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor (Methods). We generated datasets in which the observed methylation was a mixture of several cell types and each sample was accompanied with a diseased or normal status and a continuous age attribute. We deliberately designed some cell types to have similar baseline methylation profiles to mimic cell types from the same cell lineage. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated two scenarios in which (1) all phenotype effects \(\beta _{jk\ell }\)s are zero—the true null case—to compare the ability of each method to control false positives; and (2) a small portion of \(\beta _{jk\ell }\)s are nonzero—the true alternative case—to study each method’s power to detect riskCpG sites. Under the true alternative, both the binary and the continuous phenotypes were assumed to have celltypespecific riskCpG sites and to affect the celltype proportions among the samples^{10}. We further simulated phenotype effects with various directions and magnitudes.
Under the true null, HIRE, EWASHer, and ReFACTor control the false positive rates (FPRs) very well: none are greater than 0.05% (Table 1 and Supplementary Figs. 1–9). In comparison, RefFreeEWAS often has FPRs greater than 0.1% and thus does not perform as well as HIRE, and the unadjusted analysis and SVA further suffer from the dramatic inflation of false positives. For the true alternative settings, given that the FPRs are wellcontrolled, with FPRs below 0.05%, HIRE achieves the highest true positive rates (TPR) of all methods in every simulation setting (see also Fig. 2a and Supplementary Figs. 10–17). As expected, as the sample size increases, HIRE’s power increases. For example, when the data include five cell types, HIRE can identify 89.6% of the riskCpG sites with 300 samples, and HIRE can detect almost all riskCpG sites when the sample size reaches 600, which is a typical sample size for EWAS. Although EWASHer and ReFACTor have low FPRs, they miss a large proportion of riskCpG sites. EWASHer’s maximum TPR is only 35.33%, and ReFACTor’s maximum TPR is slightly over 60%. However, in those cases, HIRE’s power is greater than 95%. Consistent with the true null scenario, in the true alternative, RefFreeEWAS has inflated FPRs compared to HIRE, and the unadjusted analysis and SVA always have huge false positives. Therefore, HIRE substantially improves the power of association detection at the aggregate level compared with existing methods.
In the multiple hypothesis testing, the pvalues from the truly null features should follow a uniform distribution on (0, 1), whereas those for the truly alternative features are concentrated near zero^{23}. Both the histograms (Fig. 2d–i) and the QQ plots (Fig. 2j–o) show that the pvalue distribution of HIRE is the best fit to the underlying truth—there are only a small proportion of signals, followed by RefFreeEWAS and ReFACTor. EWASHer easily overcorrects signals with its pvalue density having a dip near zero (Fig. 2h), thus failing to detect the true associations. In contrast, the unadjusted analysis and SVA generate very small pvalues clustered near zero, resulting in inflated type I errors.
In addition to the traditional association detection at the aggregate level, HIRE can identify the association for each CpG site with the phenotypes under each cell type. Table 2 shows the FPR and TPR of HIRE for each cell type in various simulation settings. Such fine analysis is not possible with the other methods. Consistent with association detection at the aggregate level, HIRE always controls the FPR well. When K = 3 and n = 180, HIRE accurately detects the riskCpG sites associated with disease status a TPR of greater than 83% and an FPR of 0.01% or less in all three cell types. Similarly, most of the CpG sites affected by age are also correctly identified in each cell type. HIRE’s learned celltypespecific association patterns closely matches the underlying true associations (see Fig. 2b, c and Supplementary Figs. 18–26). Once again, HIRE’s power decreases with the number of cell types and increases with the sample size. When the samples consist of seven cell types and the proportion of the least abundant cell type is as low as 4.2%, given a typical current EWAS with around 600 samples, HIRE can detect most celltypespecific riskCpG sites reasonably well. Moreover, HIRE’s estimates for the baseline methylation profiles, cellular compositions, and phenotype effects have little bias (Supplementary Figs. 27–62); therefore, HIRE can provide accurate estimates and is powerful in detecting celltypespecific riskCpG sites.
In the HIRE model, we assume that different CpG sites are independent, and we investigate the performance of HIRE when such a model assumption is violated and dependences exist among nearby CpG sites. Specifically, we assume that every 50 consecutive CpG sites belongs to a block. For CpG sites within the same block, their random noises \({\boldsymbol{\epsilon }}\) follow a multivariate normal distribution with mean zero and 50 × 50 covariance matrix Σ, and Σ’s corresponding correlation matrix has its (i, j) entry equal to ρ^{i−j}. We vary ρ to 0.8, 0.6, and 0.4. A comparison of Supplementary Tables 1–3 with Supplementary Table 4 shows that even when strong correlations exit among nearby CpG sites, HIRE still provides good performances in controlling the FPR and detecting the riskCpG sites under the model misspecification setting.
To further evaluate HIRE’s performance on experimentally mixed samples, we conducted another semisimulated dataset that includes six samples mixed with six purified cell types in predetermined proportions^{24}. Once again, HIRE successfully recovers the six underlying reference cell types and estimates the cellular compositions well (see Methods).
Real data analysis
HIRE also provides greater insight into real data than previous studies. The rheumatoid arthritis (RA) dataset^{3} contains methylation profiles collected from the whole blood of 354 patients with RA and 335 normal participants. In addition to the RA status, other attributes such as gender, smoking history, age, and batch information are available. We first corrected the batch effects and then applied HIRE to the dataset (Methods). Figure 3a displays the pvalues regarding the association with the RA status for each CpG site in each cell type, in which HIRE selected six cell types (Supplementary Fig. 63a), consistent with the number of cell types in the previous study^{13}. Despite potential batch effects and biological variability, three of the six cell types can be matched to known blood cell references—cell type 1 was matched to CD4+ T cells, cell types 2 and 4 were matched to neutrophils, and the remaining three cell types cannot be aligned to the references (Methods and Supplementary Fig. 64). HIRE detected 63 riskCpG sites in cell type 3—the largest number of associations across all cell types—but no riskCpG sites in cell type 1 (Supplementary Table 5). Therefore, the disease status affected some but not necessarily all cell types. Note that the significant CpG site cg06373940 called by HIRE is located on gene ERCC3. The level of ERCC3’s corresponding protein has been reported to increase in RA synovium^{25}. Moreover, we found that five CpG sites had a significant association with smoking history (Supplementary Fig. 65 and Supplementary Table 6). One of them is cg05575921, which was recently linked to smoking in two other independent studies of blood samples^{26,27}. However, these findings were missed by the association detection at the aggregate level in previous analyses of the same dataset^{11,13}. The pvalue density plots and QQ plots for the commonly used methods are also displayed in Fig. 3c–n; they present patterns similar to those observed in the simulation study except for an obvious overcorrection by ReFACTor.
The high resolution provided by HIRE makes it a powerful tool for EWAS studies. Rahmani et al. used ReFACTor^{13} to analyze the GALA II blood methylation dataset^{28}, which consists of 573 samples collected from a pediatric Latino population. Each sample includes the gender information and belongs to one of the following four populations: Mexican, Mixed Latino, Puerto Rican, and Other Latino. We applied HIRE to the dataset to investigate whether any celltypespecific CpG sites were associated with gender and ethnicity. We created three dummy variables to represent the four ethnic groups. By taking the indicators of ethnicity as phenotypes in the model, HIRE automatically and simultaneously accounts for the population differences in cell composition and celltypespecific methylation levels. HIRE correctly selected the number of cell types as six as reported in the previous study^{13} (Supplementary Fig. 63b). According to celltype alignment, cell types 1 and 5 can be annotated as CD4+ T cells; cell types 2, 3, and 4 belong to neutrophils; and cell type 6 was annotated as CD56+ natural killer cell (CD56+ NK) using the references (Supplementary Fig. 66). HIRE found that 1936 CpG sites were associated with ethnicity across all cell types (Supplementary Fig. 67) and identified 14, 52, 155, 15, 18, and 14 riskCpG sites for gender in cell types 1–6, respectively (Fig. 3b). Gene set enrichment analysis showed that the genes that harbored riskCpG sites for gender were significantly enriched in seven canonical pathways (Supplementary Table 7), of which the PID_CMYB_PATHWAY was ranked the highest. The transcription factor c − MYB in the PID_CMYB_PATHWAY enhances the progression of breast cancer^{29}; therefore, the different occurrence rates of breast cancers in men and women may be linked to the differences at the epigenome level. In comparison, only one pathway was found to be enriched with the genes that host the riskCpG sites claimed by ReFACTor at the aggregate level (Supplementary Table 8). All of these observations highlight the importance of the finerscale resolutions of HIRE.
Discussion
In reality, the phenotype may affect a riskCpG site in some but not all of the cell types. HIRE can detect the celltypespecific association pattern with each phenotype for EWAS. The identification of celltypespecific riskCpG sites will help epigenetic therapies to target the affected cell types in a more effective manner.
Statistically, instead of assuming fixed reference methylomes for all samples as the existing methods do^{9,13,16}, HIRE allows each sample’s celltypespecific methylation profiles to depend on its phenotypes. As a result, HIRE correctly models the multiplicative effects of the cellular compositions on the observed methylation levels, whereas the existing approaches all misspecify the cellular compositions as additive effects (Methods). As a result, HIRE enables the detection of celltypespecific riskCpG sites that cannot be feasibly detected with existing stateoftheart methods. As a byproduct, HIRE also improves the statistical power of association detection at the aggregate level relative to existing stateoftheart methods. Computationally, the time complexity of one iteration by HIRE is O(nmKp + nK^{3}), which thus provides fast convergence when K is moderate. The statistical and computational advantages equip HIRE to be scaled up for largecohort EWAS.
So far, in the EWAS community, no goldstandard exists for the comparison of various methods. Ideally, we would like to have epigenetic spikein experiments in which purified cell types are isolated, CpGsites are epigenetically edited on a percelltype basis, and cell types are finally mixed in predetermined proportions. Given such experiments, the underlying knowledge of which CpGs are differentially methylated in each cell type and the cell mixing proportions for each sample are known. However, biotechnologies for epigenetic editing, such as CRISPRCas, are still not mature at this stage, with many offtarget modifications^{30}. Therefore, most computational EWAS studies refer to numerical simulation studies rather than to experimental studies when evaluating the performance of their algorithms^{12,13}. Here, we follow the example of previous comparative studies and design our simulation studies to serve as the computational counterpart of experimental spikein studies. With the rapid advances in epigenetic editing, we hope the community can devote greater effort in the near future to the creation of a goldstandard dataset, such as those generated in the early years for gene expression microarray studies^{31}.
The betavalues that represent methylation levels always lie between zero and one. As previous approaches to EWAS often assume normal distribution for the betavalues and show good performances in real applications^{9,13}, in HIRE, we also assume that the betavalues follow a normal distribution. Consequently, the fitted methylation level may lie outside the range of [0, 1]. Nevertheless, we do in fact constrain the baseline methylation profiles μ_{jk}s to the closed interval [0, 1] and force the cellular compositions p_{ki}s to be nonnegative and to add up to one: \(\mathop {\sum}_{k = 1}^K {p_{ki}} = 1\). As a result, because the phenotypes have no effect on most CpG sites, most observations, O_{ji}s, have their means \(\mathop {\sum}_{k = 1}^K {\mu _{jk}} p_{ki}\)s in [0, 1]. In fact, for both the RA dataset and the GALA II dataset, more than 99.99% of the fitted methylation values \(\hat O_{ji}\)s based on HIRE estimates lie between zero and one. Therefore, the normal assumption fits the data reasonably well and does not have a large effect on the performance of HIRE.
One major issue for all of the celltype deconvolution methods is that deconvolution cannot be achieved if the cellular compositions do not vary among samples. For example, assuming that the samples are mixtures of two cell types and p_{i} = p for all of the samples, then the observed methylation profile O_{i} equals u_{i1}p_{1} + u_{i2}p_{2} = (u_{i1} + p_{2}C)p_{1} + (u_{i2} − p_{1}C)p_{2} := \(\widetilde {\mathbf{u}}_{i1}p_1 + \widetilde {\mathbf{u}}_{i2}p_2\) for any constant C. As a result, u_{i1} and u_{i2} are not estimable. In our paper, we show mathematically that HIRE is identifiable under mild conditions in Theorem 1 and that condition (b) of Theorem 1 formulates the requirement for the variability of the cellular compositions (Methods). HIRE can accurately estimate cellular compositions of tissues with great cellular heterogeneity, such as blood. Although the mild conditions in Theorem 1 are easily met for real DNA methylation data, identification of both sufficient and necessary conditions for model identifiability is a theoretically interesting and challenging statistical problem that we will investigate in a future study.
HIRE requires a moderate sample size to obtain precise estimates because HIRE needs to learn (1 + 2K + qK)m + (K − 1)n parameters with a total of mn observed values. Our simulation studies show that HIRE performs very well at the aggregate level with 180 samples (Table 1). If the sample size drops below 150, say to 120, HIRE can still control the FPR well but begins to lose power (Supplementary Table 9). For small sample sizes, we have also developed a special case of HIRE by reparameterizing all \(\sigma _{gk}^2\)s as one single parameter σ^{2}, and we found that such a variancestabilized approach can achieve even better inflation control (see Supplementary Figs. 71–76) and power comparable to HIRE (see Supplementary Table 10). Like the two datasets analyzed in the real application, a typical sample size for a current EWAS exceeds 500, thus guaranteeing a high TPR for HIRE. Given the decreasing cost of EWAS, we recommend that researchers collect at least 200 samples for their studies for association detection at the aggregate level and 600 samples for identification of celltypespecific riskCpG sites. A larger sample size can further boost the power.
With the popularity of EWAS, we believe that HIRE will be widely applied, and we hope that HIRE can motivate more researchers to mine out finerscale results from EWAS.
Methods
Multiplicative effects of cellular composition on methylation
In this section, we illustrate that the effects of the celltype composition are actually multiplicative. Let us assume that the betavalues that represent the methylation levels are observed across m CpG sites for n samples. As the measured sample comprises cells of various types, the observed betavalue is a weighted average of the mean methylation levels of distinct cell types, and the weights correspond to the proportions of each cell type. Let O_{ji} denote the measurement at CpG site j for sample i. If we assume that there exist K cell types in all samples and that the mean methylation level for CpG site j in cell type k is μ_{jk}, then
where p_{ki} is the proportion of cell type k in sample i with a natural constraint \(\mathop {\sum}_{k = 1}^K {p_{ki}} = 1\), and \(\epsilon _{ji}\) is a random error.
Let us consider a casecontrol EWAS. Without loss of generality, we assume that CpG site j is differentially methylated between cases and controls in cell type 1 with a mean shift δ_{j1} and that it is not differentially methylated in the remaining cell types. As a result, for case samples,
If we then use Z_{i} to indicate the casecontrol status of sample i, the observed methylation level becomes
Therefore, the proportions of cell type 1—p_{1i}, i = 1, …, n—have multiplicative effects rather than additive effects on the mean difference between the case and control samples.
The existing methods, which either estimate the cell type proportions explicitly or approximate them implicitly with surrogate variables, add the estimated proportions and the casecontrol indicator Z_{i} as the covariates to the regression as follows:
where b_{jk}s are the regression coefficients. As a result, CpG site j is called differentially methylated on the basis of hypothesis testing for τ_{j} = 0. In general, τ_{j} in Eq. (4) is not equal to δ_{j1} in Eq. (3). Please see the Supplementary Notes for a numerical example. Moreover, testing for τ_{j} = 0 loses the information regarding cell type in which CpG site j may be at risk. To account for the multiplicative effects, we propose the HIRE model that conserves the individual celltype level information, which is introduced in the next section.
The HIRE model
HIRE uses a hierarchical model to closely follow the data generation process for the EWAS data. To begin, we assume that the baseline methylation level for CpG site j in cell type k is μ_{jk}. For sample i with phenotypes x_{i} = (x_{i1}, …, x_{iq}), the mean methylation value for CpG site j in cell type k is assumed to be \(\mu _{jk} + \mathop {\sum}_{\ell = 1}^q {\beta _{jk\ell }} x_{i\ell }\). In other words, the phenotypes have linear effects where \(\beta _{jk\ell }\) characterizes the influence of phenotype \(\ell \) on CpG site j in cell type k. Let u_{ijk} represent the signal from CpG site j in cell type k for sample i with x_{i}. We assume that u_{ijk} follows a normal distribution with mean \(\mu _{jk} + \mathop {\sum}_{\ell = 1}^q {\beta _{jk\ell }} x_{i\ell }\) and standard deviation σ_{jk},
After u_{ijk}s are generated for all of the K cell types, the observed methylation value O_{ji} is sampled as follows:
Collectively, O = {O_{ji} : 1 ≤ j ≤ m, 1 ≤ i ≤ n} denote the observed data; u = {(u_{ij1}, …, u_{ijK})^{T} : 1 ≤ i ≤ n, 1 ≤ j ≤ m} are the missing data; and μ_{j} = (μ_{j1}, …, μ_{jK})^{T}, \({\mathbf{B}}^{(j)} = (\beta _{jk\ell })_{K \times q}\), \(\sigma _{\epsilon j}^2\), the diagonal matrix \(\Sigma _j = diag(\sigma _{j1}^2, \ldots ,\sigma _{jK}^2)\) for j = 1, …, m, and p_{i} = (p_{1i}, …, p_{Ki})^{T} for i = 1, …, n are the parameters. With \({\mathbf{\Theta }} = \{ {\mathbf{p}}_i,{\boldsymbol{\mu }}_j,{\mathbf{B}}^{(j)},\Sigma _j,\sigma _{\epsilon ,j}^2:1 \le j \le m,1 \le i \le n\} \), the complete data loglikelihood function, l_{c}, can be expressed as follows:
Accordingly, we develop a generalized expectationmaximization algorithm^{32} to estimate the parameters. In the expectationmaximization algorithm, a good initialization can lead to faster convergence than random starts. We adopt the cellular composition estimations from the methylation matrix decomposition algorithm^{16} with slight modifications as the initializations. The initial values for the baseline methylation profiles μ_{jk} are accordingly estimated by simple linear regressions. As the number of riskCpG sites is often small, all of the phenotype effects \(\beta _{jk\ell }\) are set to zero at the beginning. For the standard deviations, the initial values are randomly sampled from inverse gamma distributions with small means. We choose the number of cell types K by using a variant of the penalized Bayesian information criterion (pBIC)^{20} (see details in Supplementary Methods).
For each phenotype \(\ell \), we can conduct the hypothesis test \(H_0:\beta _{jk\ell } = 0\) versus \(H_1:\beta _{jk\ell } \ \ne \ 0\) for any cell type k and any CpG site j. Combining Eqs. (5) and (6), we obtain the following equations:
We can then take (O_{j1}, …, O_{jn}) as the response vector and concatenate 1_{n}, (p_{k1}, …, p_{kn}) (k = 2, …, K) and \((x_{1\ell }p_{k1}, \ldots ,x_{n\ell }p_{kn})\) \((\ell = 1, \ldots ,q;k = 1, \ldots ,K)\) to a n × (p + 1) · K design matrix in the linear regression. We plug in the estimated cellular compositions \(\hat p_{ki\ell }\) and conduct the hypothesis test for \(\beta _{jk\ell } = 0\) using the twosided ttests in the linear models. We claim that CpG site j has an association with phenotype \(\ell \) at the aggregate level if phenotype \(\ell \) affects CpG site j in at least one of the K cell types. Note that in the regression we incorporate the estimated cellular compositions into the linear model as multiplicative effects rather than additive effects.
More technical details of the method and the algorithm are available in the Supplementary Methods.
Data simulation
We compared the performance of HIRE with five previous methods—unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor—in 18 simulation settings. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated the true null case and the true alternative case. As a result, we have in total 3 (the number of sample sizes) × 3 (the number of cell types) × 2 (the true null case and the true alternative case) = 18 simulation settings. For each setting, we considered 10,000 CpG sites and simultaneously accounted for the following factors.
Cell lineage. We first constructed the baseline methylation matrix μ = (μ_{jk})_{m×K}, in which each column corresponds to the baseline methylation levels of a cell type. To mimic the phenomenon in which cell types from the same lineage have similar methylation profiles, we assumed that K_{sim} of the total K cell types were similar. Specifically, without loss of generality, we assumed that the first K_{sim} cell types came from the same cell lineage and that the remaining K − K_{sim} cell types are irrelevant to one another. We set K_{sim} to 2, 2, and 3 for K = 3, 5, and 7, respectively. We generated μ_{jk} for cell types k = 1, K_{sim} + 1, …, K from the beta distribution beta(3, 6) on each CpG site j independently. For each of the remaining cell types k′ = 2, …, K_{sim}, we randomly selected 20% of the CpG sites and drew their μ_{jk′}s independently from beta(3, 6); and for the remaining 80% of CpG sites, we let their μ_{jk′} be μ_{j1} plus a very small randomness, thus inducing the similarities among cell types 1 to K_{sim}.
Discrete and continuous phenotypes. We further generated a discrete and a continuous phenotype x = (x_{1}, x_{2})^{T} for each individual i (i = 1, …, n). We let the first n/3 individuals be the control samples with x_{i1} = 0 for i = 1, …, n/3 and the remaining 2n/3 individuals serve as cases with x_{i1} = 1 for i = n/3 + 1, …, n. The continuous phenotypes x_{2} = (x_{12}, …, x_{i2}, …, x_{n2})^{T} were independently drawn from a Unif(20, 50) to act as age.
Phenotype effects with different magnitudes and directions. We then simulated the phenotype effect \(\beta _{jk\ell }\) of each phenotype \(\ell \) on CpG site j in cell type k. For the true null cases, all of the \(\beta _{jk\ell }\)s are zero. For a true alternative setting, we set nonzero phenotype effects as follows.
For phenotype 1—the case/control status, we let it affect the first 10 CpG sites in all of the cell types: β_{jk1} ≠ 0 for j = 1, …, 10 and k = 1, …, K. We then assumed that the next 10 CpG sites were influenced by the disease status in the first K_{sim} cell types which come from the same lineage but not the other cell types: β_{jk1} ≠ 0 (k = 1, …, K_{sim}) and β_{jk1} = 0 (k = K_{sim} + 1, …, K) for any j = 11, …, 20. Furthermore, for cell type k ∈ {K_{sim} + 1, …, K}, we let the disease status affect CpG sites j = 20 + 10(k − K_{sim} − 1) + 1, …, 20 + 10(k − K_{sim}) only in cell type k. We generated the celltypespecific effects of age in a similar fashion for CpG site loci 21 to 40 + 10(K − K_{sim}).
For each nonzero β_{jk1}, we let β_{jk1} = r_{jk} · ω_{jk}, where ω_{jk} ~ Unif(0.07, 0.15) and r_{jk} takes values of 1 and −1 with equal probabilities. Thus, β_{jk1}s can have both positive and negative effects. In the same spirit, we generated nonzero β_{jk2}s with \(r_{jk}^\prime \)s and \(\omega _{jk}^\prime \)s where \(\omega _{jk}^\prime \sim Unif(0.007,0.015)\).
Association between phenotypes and cellular compositions. Notice that the phenotypes may be associated with the cellular composition. Therefore, when K = 3, we drew p_{i} = (p_{1i}, …, p_{Ki}) from a Dirichlet distribution Dir(4, 4, 2 + 0.1x_{i2}) if sample i is a control and p_{i} ~ Dir(4, 4, 5 + 0.1x_{i2}) if it is a case; when K = 5, we let p_{i} ~ Dir(3, 3, 3, 3, 2 + 0.1x_{i2}) for a control sample and p_{i} ~ Dir(3, 3, 3, 3, 5 + 0.1x_{i2}) for a case sample; and when K = 7, we sampled p_{i} ~ Dir(1, 3, 3, 3, 2, 2, 2 + 0.1x_{i2}) for controls and p_{i} ~ Dir(1, 3, 3, 3, 2, 2, 5 + 0.1x_{i2}) for cases.
Finally, we generated the observed value O_{ji} for CpG site j of sample i as follows: sample u_{ijk} from N(μ_{jk} + β_{jk1}x_{i1} + β_{jk2}x_{i2}, 0.01^{2}) for k = 1, …, K; and sample O_{ji} from \(N(\mathop {\sum}_{k = 1}^K {u_{ijk}} p_{ki},0.01^2)\). In case O_{ji} lies outside the interval (0, 1), we truncate it to zero if O_{ji} is lower than zero and to one if O_{ji} is greater than one.
Semisimulated dataset including samples with known cell mix proportions
The GEO dataset GSE110554^{24} contains purified celltypespecific methylation profiles for six cell types: neutrophils, monocytes, B cells, CD4+ T, CD8+ T, and NK. Moreover, GSE110554 includes mixed samples whose methylation signals were aggregated from the six cell types with predetermined cell mix proportions. Therefore, because of the known cell type and cellular proportion information, GSE110554 is an ideal dataset with which to test HIRE’s performance.
In GSE110554, the number of mixed samples is much smaller than the typical size of an EWAS and, as discussed in the manuscript, HIRE usually requires hundreds of samples to obtain accurate and stable results. Therefore, to increase the sample size, we first generated a simulated methylation dataset with 600 samples using the purified methylation profiles. We focused on 10k CpG sites, including the 450 IDOL CpG sites, which were previously identified as the optimal library of CpG sites for estimation of leukocyte subtype proportions^{24}, and another 9550 CpG sites whose methylation values across the purified cell types fell within the range of [0.2, 0.8] and had large standard deviations^{11}. We then combined the 600 samples and six mixed samples (generated by method A)^{24} available in GSE110554 to compose a semisimulated dataset.
After applying HIRE to the semisimulated data, we annotated the estimated cell types based on the methylation profiles from GSE110554. Supplementary Figure 69 shows the heatmap for the Pearson correlation matrix between inferred cell types and the underlying truth. The correlation signals on the diagonal are the strongest in each row. HIRE successfully recovers the six underlying cell types. We also compared the estimated cellular compositions with the underlying true proportions for the six mixed samples. Each panel in Supplementary Fig. 70 displays a scatter plot between the cellular proportion estimates and the true mix proportions for a given cell type; they all indicate that HIRE obtains good estimates for cellular compositions.
Cell type matching protocol
Assume that we have the reference methylation profiles for the H annotated cell types. We first denote the methylation profile for cell type h as ϕ_{h} = (ϕ_{1h}, …, ϕ_{mh}). We aim to annotate μ_{k} using the references. Following the previous study^{33}, first, we calculate the cosine similarity, the Pearson correlation, and the Spearman correlation between μ_{k} and ϕ_{h} for each cell type h ∈ {1, …, H}. Notice that the three similarity measures lie between −1 and 1, and a high positive value indicates great similarity between two vectors. Second, for each similarity measure \(\ell \) (\(\ell = 1,2,3\)), we identify the cell type \(h_\ell \) that has the maximal degree of similarity with μ_{k}. If at least two out of the three similarity measures identify the same reference cell type \(\tilde h\) and their corresponding similarity values are greater than 0.5, then we annotate μ_{k} with the reference cell type \(\tilde h\). Otherwise, μ_{k} is believed to belong to a new cell type that is not included in the references. We repeat the above process for each methylation profile μ_{k} estimated from HIRE.
Blood cell references
The two real data sets analyzed in our applications were obtained from whole blood. Therefore, we prepared the references from a whole blood methylation study^{34} with GEO accession code GSE35069. The study included seven isolated blood cell subpopulations—CD4+ T cells, CD8+ T cells, CD14+ monocytes, CD19+ B cells, CD56+ NK cells, neutrophils, and eosinophils—for six individuals. Accordingly, we define the reference profile ϕ_{h} for cell type h as the average methylation profile of these individuals, i.e., \({\boldsymbol{\phi }}_h: = \frac{1}{6}\mathop {\sum}_{i = 1}^6 {{\boldsymbol{\phi }}_{hi}} \).
Data preprocessing
The RA dataset is publicly available in GEO with accession number GSE42861. The dataset measures the methylation levels of the whole blood. The methylation data have been normalized by Illumina’s control probe scaling procedure (see Liu et al.^{3} “Illumina 450K microarray data preprocessing” section for details). The dataset includes 689 samples, and the RA status, age, gender, smoking history, and batch information are available for each sample. We removed two samples GSM1051535 and GSM1051691 because their smoking information is missing. CpG sites with a high methylation mean (>0.8) and a low methylation mean (<0.2) were discarded^{11,13}. We adjusted the data for batch effects using COMBAT^{35}. The correction process was justified because we did not observe a high degree of colinearity between the RA status and the batches (Supplementary Fig. 68). The 10,000 most variable CpG sites were kept. For the RA status, we denoted RA patients with 1 and the normal control subjects with 0; we represented men with 1 and women with 0; for the smoking history, we used (0, 0, 0) to refer to “never,” (1, 0, 0) to “ex,” (0, 1, 0) to “current,” and (0, 0, 1) to “occasional” smokers.
We downloaded the GALA II dataset from Gene Expression Omnibus (GEO) with accession number GSE77716. The dataset contains the wholeblood DNA methylation betavalues from 573 samples. The betavalues have been normalized by SWAN^{36} and corrected for batch effects by COMBAT^{35}. There are two types of covariates: gender and ethnicity. Ethnicity includes Mexican, Mixed Latino, Puerto Rican, and Other Latino. Out of the 573 samples, one sample “GSM2057284” has no gender information, so we removed it. As suggested by previous studies^{11,13}, CpG sites with a mean methylation value of less than 0.2 or higher than 0.8 were filtered out. We selected the 10,000 most variable of the remaining CpG sites. For gender, we denoted men with 1 and women with 0. For the ethnicity variables, we used three dummy variables to represent the four ethnicity categories. In particular, (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1) corresponded to Mexican, Mixed Latino, Puerto Rican, and Other Latino, respectively.
For ReFACTor and EWASHer, according to their rules, we first filtered out CpG sites that were consistently hypomethylated or consistently hypermethylated and then regressed out the known covariates. We finally used the residuals to perform their analysis. Note that in their software these steps are processed automatically. For RefFreeEWAS, SVA, and the unadjusted analysis, the phenotypes and the covariates were regarded as the fixed effects in the regression model. In detail, for ReFACTor, in both GALA II and RA datasets, the cell type number “K” was specified to be six, which was the same as in their paper^{13}. For RefFreeEWAS, we fixed the dimensionality of latent space “d” at six in the real data. For SVA, we also fixed the number of surrogate variables to six.
Gene enrichment analysis was carried out on the Broad Institute website http://software.broadinstitute.org/gsea/msigdb/annotate.jsp. The canonical pathways were selected as the basis gene sets, and only pathways with a false discovery rate of less than 0.05 were reported.
Identifiability of HIRE
Although the nonnegative matrix factorization (NNMF) O = μP has been widely applied in cell type deconvolution^{16}, where O is the observed methylation matrix, μ is the unknown methylation profile, and P is the unknown cellular compositions, model identifiability is rarely discussed. During the review period of our paper, Rahmani et al.^{37} provided a setting under which the NMMF model is not identifiable.
Why then does NNMF always provide satisfactory cell type deconvolution results in real practice, and why can HIRE estimate all those parameters well? Here, we show mathematically that the HIRE model is identifiable under mild conditions that are easily met in reality.
Let us first introduce some notations and definitions. In the HIRE model, the whole parameter set is denoted by \({\mathbf{\Theta }}: = \{ {\mathbf{P}}_i,{\boldsymbol{\mu }}_j,{\mathbf{B}}_\ell ^{(j)},\sigma _{jk}^2,\sigma _{\epsilon j}^2:1 \le j \le m,1 \le i \le n,1 \le k \le K,1 \le \ell \le q\} \), where p_{i} is the cellular composition vector of sample i, μ_{j} is the baseline methylation vector of CpG site j, \({\mathbf{B}}_\ell ^{(j)}\) is the phenotype \(\ell \) effect vector on CpG site j, \(\sigma _{jk}^2\) is the celltypek noise variance on CpG site j, and \(\sigma _{\epsilon j}^2\) is the overall noise variance on CpG site j.
The observed data in our study are the methylation matrix O = {O_{ij}:1 ≤ i ≤ n, 1 ≤ j ≤ m} and the covariate matrix \({\mathbf{X}} = ({\mathbf{x}}_1, \ldots ,{\mathbf{x}}_\ell , \ldots ,{\mathbf{x}}_q)\), where \({\mathbf{x}}_\ell \) is the column vector that indicates phenotype\(\ell \) for the n samples. The observed likelihood function \(({\mathbf{\Theta }}{\mathbf{O}}) = \mathop {\prod}_{i = 1}^n {\mathop {\prod}_{j = 1}^m N } (O_{ji}:{\mathbf{P}}_i^T {\boldsymbol{\mu }}_{j} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T {\mathbf{B}}_\ell ^{(j)},\mathop {\sum}_{k = 1}^K {\sigma _{jk}^2} \, P_{ik}^2 + \sigma _{\epsilon j}^2)\) (see Eq. (S7) in the Supplementary Methods), where N(O : η, τ^{2}) indicates the normal density with mean η and variance τ^{2} at value O.
We further define 1_{K} = (1, 1, …, 1)^{T} as a Kdimension column vector with all entries being one, an n by K matrix J_{1} as \({\mathbf{1}}_n{\mathbf{1}}_K^T\), and an n by K matrix \({\mathbf{J}}_{x_\ell }\) as \({\mathbf{x}}_\ell {\mathbf{1}}_K^T\) for each \(1 \le \ell \le q\). We use ⊙ to represent the entrywise matrix product for two matrices M and N with the same dimension, i.e., \(({\mathbf{M}} \odot {\mathbf{N}})_{ij}: = {\mathbf{M}}_{ij}{\mathbf{N}}_{ij}\).
Theorem 1. If (a) for each cell type k, there exists a CpG site r_{k} such that \({\mathbf{B}}_\ell ^{(r_k)} = 0\) for any phenotype \(\ell \) and \(\mu _{r_kk} = 1\) while \(\mu _{r_kk{\prime}} = 0\) for k′ ≠ k, and (b) the cellular compositions P satisfies that \(rank(({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)) = (q + 1)K\) and \(rank(({\mathbf{1}}_n,{\mathbf{P}}^T) \odot ({\mathbf{1}}_n,{\mathbf{P}}^T)) = K + 1\), then the HIRE model is identifiable. In other words, \(L({\mathbf{\Theta }}{\mathbf{O}}) = L(\widetilde {\mathbf{\Theta }}{\mathbf{O}})\) for any O implies \({\mathbf{\Theta }} = \widetilde {\mathbf{\Theta }}\).
Proof: First, by integrating out all O elements except O_{ji}, \(L({\mathbf{\Theta }}{\mathbf{O}}) = L(\widetilde {\mathbf{\Theta }}{\mathbf{O}})\) implies \(N(O_{ji}:{\mathbf{P}}_i^T {\boldsymbol{\mu }}_{j} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T {\mathbf{B}}_\ell ^{(j)},\mathop {\sum}_{k = 1}^K {\sigma _{jk}^2} P_{ki}^2 + \sigma _{\epsilon j}^2)\) = \(N(O_{ji}:\widetilde {\mathbf{P}}_i^T \widetilde \mu_{j} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} \widetilde {\mathbf{P}}_i^T\widetilde {\mathbf{B}}_\ell ^{(j)},\mathop {\sum}_{k = 1}^K {\tilde \sigma _{jk}^2} \tilde P_{ki}^2 + \tilde \sigma _{\epsilon j}^2)\). Because the univariate normal distribution is identifiable, we have
Taking j = r_{k} in Eq. (8), we have \(LHS = {\mathbf{P}}_i^T{\boldsymbol{\mu }}_{r_k} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T{\mathbf{B}}_\ell ^{(r_k)} = {\mathbf{P}}_i^T{\boldsymbol{\mu }}_{r_k} = 0 + P_{ki}\cdot 1 + 0 = P_{ki}\) and similarly \(RHS = \tilde P_{ki}\), so \(P_{ki} = \tilde P_{ki}\), which holds for any i and k. Hence, we obtain \({\mathbf{P}} = \widetilde {\mathbf{P}}\). Next, we rewrite Eq. (8) into a matrix form.
By combining these n equations, it follows that
Because the rank of \(A: = ({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)\) is (q + 1)K (full column rank), A has a left inverse A^{−1}. Multiplying Eq. (10) by A^{−1} from the left on both sides, we obtain \({\boldsymbol{\mu }}_j = \widetilde {\boldsymbol{\mu }}_j\) and \({\mathbf{B}}_\ell ^{(j)} = \widetilde {\mathbf{B}}_\ell ^{(j)}\) for \(1 \le \ell \le q\). Therefore, we have \({\boldsymbol{\mu }} = \widetilde {\boldsymbol{\mu }}\), \({\mathbf{B}} = \widetilde {\mathbf{B}}\).
In addition, because Eq. (9) holds for any i, we can also rewrite it into a matrix form.
The left matrix is equal to \(({\mathbf{1}}_n,{\mathbf{P}}^T) \odot ({\mathbf{1}}_n,{\mathbf{P}}^T)\) which has a full column rank; therefore, it has a left inverse. Consequently, \(\sigma _{\epsilon j}^2 = \tilde \sigma _{\epsilon j}^2\) and \(\sigma _{jk}^2 = \tilde \sigma _{jk}^2\). As a result, \({\mathbf{\Theta }} = \widetilde {\mathbf{\Theta }}\), and we have proven the identifiability of HIRE. \(\square \)
Conditions (a) and (b) are easily met for DNA methylation data. Condition (a) requires that for each cell type k, there exists a CpG site that is not associated with any phenotype and is only methylated in cell type k but not methylated in any other cell type. Given the 450K CpG sites assayed by the microarray, we can expect that such CpG sites are not absent at all. Moreover, condition (a) can also be relaxed to the condition that for each cell type k, there exists a CpG site r_{k} such that \({\mathbf{B}}_\ell ^{(r_k)} = 0\) for any phenotype \(\ell \) and \(\mu _{r_kk} = 1\) while \(\mu _{r_kk\prime } = 0\) for k′ ≠ k or there exists a CpG site r_{k} such that \({\mathbf{B}}_\ell ^{(r_k)} = 0\) for any phenotype \(\ell \) and \(\mu _{r_kk} = 0\) while \(\mu _{r_kk{\prime}} = 1\) for k′ ≠ k. The proof follows in a similar manner.
For condition (b), intuitively, the rank requirement of \(({\mathbf{1}}_n,{\mathbf{P}}^T) \odot ({\mathbf{1}}_n,{\mathbf{P}}^T)\) asks the cellular compositions to vary across subjects, which guards against the case in which all the subjects have the same cellular compositions and hence no cell type deconvolution is possible; the rank requirement on \(({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)\) is the same requirement as those in a standard linear regression, which requires that no collinearity exists among the covariates. Because the sample size n is much larger than the underlying cell type number K and the phenotype number q, the two rank requirements can commonly be satisfied in reality.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The RA whole blood methylation dataset is available in the Gene Expression Omnibus (GEO) with the accession number GSE42861. The GALA II whole blood methylation dataset can be downloaded from GEO with the accession number GSE77716. The accession number for the blood cell references is GSE35069. The purified methylation data and mixed samples used to generate the semisimulated dataset are taken from GSE110554.
Code availability
The software and detailed documentations are available on Bioconductor with the software HIREewas page [http://www.bioconductor.org/packages/release/bioc/html/HIREewas.html].
References
 1.
Rakyan, V. K., Down, T. A., Balding, D. J. & Beck, S. Epigenomewide association studies for common human diseases. Nat. Rev. Genet. 12, 529–541 (2011).
 2.
Verma, M. Epigenomewide association studies (EWAS) in cancer. Curr. Genom. 13, 308–313 (2012).
 3.
Liu, Y. et al. Epigenomewide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31, 142–147 (2013).
 4.
Gao, X., Jia, M., Zhang, Y., Breitling, L. P. & Brenner, H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin. Epigenetics 7, 113 (2015).
 5.
Joehanes, R. et al. Epigenetic signatures of cigarette smoking. Circ. Cardiovasc. Genet. 9, 436–447 (2016).
 6.
Wahl, S. et al. Epigenomewide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017).
 7.
Teschendorff, A. E. et al. Agedependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–446 (2010).
 8.
Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 14, 3156 (2013).
 9.
Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, 86 (2012).
 10.
Jaffe, A. E. & Irizarry, R. A. Accounting for cellular heterogeneity is critical in epigenomewide association studies. Genome Biol. 15, R31 (2014).
 11.
Zou, J., Lippert, C., Heckerman, D., Aryee, M. & Listgarten, J. Epigenomewide association studies without the need for celltype composition. Nat. Methods 11, 309–311 (2014).
 12.
McGregor, K. et al. An evaluation of methods correcting for celltype heterogeneity in DNA methylation studies. Genome Biol. 17, 84 (2016).
 13.
Rahmani, E. et al. Sparse PCA corrects for cell type heterogeneity in epigenomewide association studies. Nat. Methods 13, 443–445 (2016).
 14.
Teschendorff, A. E. & Relton, C. L. Statistical and integrative systemlevel analysis of DNA methylation data. Nat. Rev. Genet. 19, 129–147 (2017).
 15.
Accomando, W. P., Wiencke, J. K., Houseman, E. A., Nelson, H. H. & Kelsey, K. T. Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 15, R50 (2014).
 16.
Houseman, E. A. et al. Referencefree deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics 17, 259 (2016).
 17.
Zheng, X., Zhang, N., Wu, H.J. & Wu, H. Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies. Genome Biol. 18, 17 (2017).
 18.
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
 19.
Houseman, E. A., Molitor, J. & Marsit, C. J. Referencefree cell mixture adjustments in analysis of DNA methylation data. Bioinformatics 30, 1431–1439 (2014).
 20.
Pan, W. & Shen, X. Penalized modelbased clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164 (2007).
 21.
Zheng, S. C. et al. Correcting for celltype heterogeneity in epigenomewide association studies: revisiting previous analyses. Nat. Methods 14, 216–217 (2017).
 22.
Rahmani, E. et al. Correcting for celltype heterogeneity in DNA methylation: a comprehensive evaluation. Nat. Methods 14, 218–219 (2017).
 23.
Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).
 24.
Salas, L. A. et al. An optimized library for referencebased deconvolution of wholeblood biospecimens assayed using the illumina humanmethylationepic beadarray. Genome Biol. 19, 64 (2018).
 25.
Neumann, E. et al. Identification of differentially expressed genes in rheumatoid arthritis by a combination of complementary DNA array and rna arbitrarily primedpolymerase chain reaction. Arthritis Rheumatol. 46, 52–63 (2002).
 26.
Fasanelli, F. et al. Hypomethylation of smokingrelated genes is associated with future lung cancer in four prospective cohorts. Nat. Commun. 6, 10192 (2015).
 27.
Ambatipudi, S. et al. Tobacco smokingassociated genomewide DNA methylation changes in the epic study. Epigenomics 8, 599–618 (2016).
 28.
PinoYanes, M. et al. Genetic ancestry influences asthma susceptibility and lung function among latinos. J. Allergy Clin. Immunol. 135, 228–235 (2015).
 29.
Li, Y. et al. cMyb enhances breast cancer invasion and metastasis through the wnt/βcatenin/axin2 pathway. Cancer Res. 76, 3364–3375 (2016).
 30.
Zhang, X.H., Tee, L. Y., Wang, X.G., Huang, Q.S. & Yang, S.H. Offtarget effects in CRISPR/Cas9mediated genome engineering. Mol. Ther. Nucleic Acids 4, e264 (2015).
 31.
Bolstad, B. M., Irizarry, R. A., Åstrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).
 32.
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977).
 33.
Kiselev, V. Yu, Yiu, A. & Hemberg, M. scmap: projection of singlecell RNAseq data across data sets. Nat. Methods 15, 359 (2018).
 34.
Reinius, L. E. et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS ONE, 7, e41361 (2012).
 35.
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
 36.
Maksimovic, J., Gordon, L. & Oshlack, A. Swan: Subsetquantile within array normalization for illumina infinium humanmethylation450 beadchips. Genome Biol. 13, R44 (2012).
 37.
Rahmani, E. et al. BayesCCE: a Bayesian framework for estimating celltype composition from DNA methylation without the need for methylation reference. Genome Biol. 19, 141 (2018).
Acknowledgements
X.L. was supported in part by Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (19XNLG08), and the fund for building worldclass universities (disciplines) of Renmin University of China. X.L. is grateful for the Hong Kong Ph.D. Fellowship (PF1311656) from the Hong Kong Research Grants Council when X.L. was a Ph.D. student at the Chinese University of Hong Kong. C.Y. was supported in part by the National Science Funding of China [61501389]; the Hong Kong Research Grants Council [22302815, 12316116, 12301417 and 16307818]; The Hong Kong University of Science and Technology [startup grant R9405 and IGN17SC02]. Y.W. was supported in part by the Early Career Scheme 24301416 and General Research Fund 14306417 from the Research Grants Council of the Hong KongSpecial Administrative Region and Direct Grants from the Research Committee of the Chinese University of Hong Kong. We acknowledge Mingxuan Cai for his contribution to part of the HIREewas code. We are grateful to the HighPerformance Computing Platform of Renmin University of China and the Department of Statistics at the Chinese University of Hong Kong for providing computing resources.
Author information
Affiliations
Contributions
Y.W. and C.Y. conceived the study. X.L. and Y.W. developed the method. Y.W. and X.L. proved the model identifiability. X.L. implemented the algorithm and prepared the software package. X.L. and C.Y. analyzed the data. X.L., Y.W., and C.Y. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information: Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Luo, X., Yang, C. & Wei, Y. Detection of celltypespecific riskCpG sites in epigenomewide association studies. Nat Commun 10, 3113 (2019). https://doi.org/10.1038/s4146701910864z
Received:
Accepted:
Published:
Further reading

Celltypeaware analysis of RNAseq data
Nature Computational Science (2021)

Epigenetic Potential in Native and Introduced Populations of House Sparrows (Passer domesticus)
Integrative and Comparative Biology (2020)

Testing celltypespecific mediation effects in genomewide epigenetic studies
Briefings in Bioinformatics (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.