Introduction

Epigenome-wide association studies (EWAS) aim to identify cytosine-phosphate-guanine (CpG) sites associated with phenotypes of interest, such as disease status1,2,3, smoking history4,5, body mass index6, and age7,8. However, because the samples in EWAS are measured at the bulk level rather than at the single-cell level, the obtained methylome for each sample shows the signals aggregated from distinct cell types3,9,10, which leads to two main challenges in the analysis of EWAS data. On the one hand, the cell type compositions differ among samples and can be associated with phenotypes3,10. Both binary phenotypes, such as the diseased or normal status3, and continuous phenotypes, such as age10, have been found to affect the cell type compositions. As a result, ignoring the cellular heterogeneity in EWAS can lead to many spurious associations10,11,12,13. On the other hand, the phenotype may change the methylation level of a CpG site in some but not all of the cell types. Identification of the exact cell types that carry the risk-CpG sites can deepen our understandings of disease mechanisms. However, such identification is challenging because only the aggregated-level signals can be observed.

To the best of our knowledge, no existing statistical method for EWAS can detect cell-type-specific associations despite active research to account for cell-type heterogeneity. The existing approaches can be categorized into two schools14: reference-based and reference-free methods. The reference-based methods9,15 require the reference methylation profiles for each cell type to be known a priori, and they regress the aggregated methylation levels observed from each sample on the same set of references to learn the sample’s cellular compositions. However, because samples have different attributes, such as age and gender, the methylation levels of a given cell type can vary among samples. It is thus problematic to assume that all of the samples have the same set of reference profiles10,14. Furthermore, high-quality references are difficult to obtain for most EWAS due to the existence of unknown cell types, the high cost of cell sorting, and confounding effects14. Consequently, a large amount of recent EWAS literature was devoted to identification of risk-CpG sites without the need for the reference methylation profiles.

The reference-free methods can generally be further divided into two classes according to whether they estimate the cell-type mixing proportions directly. The direct-decomposition-based procedures consist of two stages. In the first stage, they simultaneously estimate the cellular compositions of each sample and the cell-type-specific reference methylomes via quadratic programming16; and in the second stage, they treat the estimated cell-type proportions as covariates with additive effects in the linear models to conduct association tests. However, when estimating cellular compositions during the first stage, the direct-decomposition-based methods also do not consider samples’ phenotype information, thus suffering from the same problem of biasing the cellular composition estimates as the reference-based approaches9. Moreover, similar to tumor purity17, we argue that the estimated cellular composition has a multiplicative rather than an additive effect on the observed methylation level (Methods). The second class of methods, which is exemplified by SVA18, RefFreeEWAS19, and ReFACTor13, does not carry out cell-type decompositions. They resort to singular value decomposition, which includes the principal component analysis, to construct surrogates for the underlying cell-type composition. EWASher, a linear mixed model, also belongs to this class because it is equivalent to the use of principal components as fixed-effect covariates11. However, the use of principal components as the covariates in the regression undergoes the same issue of additive effects as the direct-decomposition-based methods. Therefore, the existing reference-free methods have low power in detecting risk-CpG sites12.

Although the existing methods aim to address the cellular heterogeneity problem in EWAS and claim whether a CpG site is associated with phenotypes at the aggregate level, none of them can identify the risk-CpG sites for each individual cell type, thus missing the opportunity to obtain finer-grained results in EWAS.

Here, we propose a method, HIRE, to identify the association in EWAS at a HIgh REsolution: detecting whether a CpG site has any associations with the phenotypes in each cell type (Methods). The keys to HIRE’s success are twofold. First, HIRE links the underlying cell-type-specific methylation profiles for each sample to the sample’s phenotypes, thus avoiding the bias in estimating the cellular composition by the reference-based and direct-decomposition-based methods. Second, HIRE correctly characterizes the cellular compositions as the multiplicative effects, whereas the existing methods inappropriately treat the cell proportions as additive effects (Methods). HIRE is applicable to EWAS with binary phenotypes, continuous phenotypes, or both. By helping researchers understand in which cell types the CpG sites are affected by a disease, HIRE can ultimately facilitate the development of epigenetic therapies by targeting the specifically affected cell types.

Results

Method overview

HIRE is a hierarchical model that closely follows the data generation process. Its elaborate modeling depicts how phenotypes affect the methylation levels of each sample. Here, we briefly introduce the method. The technical details are provided in the Methods section and the Supplementary Methods.

Let us first review the cornerstone in most EWAS approaches. These methods model the observed methylation levels of the m CpG sites for sample i, Oi = (O1i, O2i, …, Omi)T, as the weighted average of the methylation profiles of K cell types, ui = (ui1, ui2, …, uiK). The weights are the cellular compositions pi = (p1i, p2i, …, pKi)T of sample i (see the top panel of Fig. 1a). However, regardless of whether the reference is known a priori or not, the existing methods assume that the cell-type-specific methylation profiles uis remain the same for all samples: ui = M, for i = 1, …, n. Unfortunately, because the methylation levels can actually change with covariates such as age and disease status, ignoring the covariates’ effects and enforcing static reference methylomes can bias the estimation of pi and thus affect all downstream analyses14. More importantly, the assumption that cell-type-specific methylation profiles are the same for each sample prevents the detection of cell-type-specific risk-CpG sites.

Fig. 1
figure 1

A simple cartoon illustration of the HIRE model with three cell types (K = 3) and two phenotypes (disease status and age; q = 2). a Data generation procedure for the observed methylation vector Oi for sample i (i = 1, …, n). In the top panel, Oi is the convolution of cell-type-specific methylation profiles ui with cellular compositions pi. Both ui and pi depend on the attributes of sample i. The bottom panel describes how sample i’s phenotypes affect ui via two phenotype-effect matrices B1 and B2. In B1 and B2, the white square represents zero, which indicates that the phenotype exerts no influence on the corresponding methylation level in ui. b Inputs and outputs of HIRE. We input the observed methylation matrix O, the phenotype data matrix X, and a predetermined cell type number K into HIRE, and HIRE outputs the estimates for the cellular compositions \(\widehat {\mathbf{p}}\), the baseline methylation profiles \(\widehat {\boldsymbol{\mu }}\), the phenotype effects \(\widehat {\mathbf{B}}_\ell \), and the penalized BIC value. In addition, HIRE tests whether there is any association between CpG site j and phenotype \(\ell \) in cell type k\(H_0:\beta _{jk\ell } = 0\) vs \(H_1:\beta _{jk\ell } \ \ne\ 0\)—and provides the p-values

For association detection at the aggregate level, after estimation of pi using the deconvolution-based approach or its surrogates from principal component-based methods, the existing methods examine a linear model in which the phenotypes \({\mathbf{x}}_i = (x_{i1}, \ldots ,x_{i\ell }, \ldots ,x_{iq})^T\) and the cellular proportions pi exert additive effects on the methylation level Oi:

(1)

A CpG-site j is then associated with phenotype \(\ell \) if we reject the null hypothesis that the covariate coefficient \(T_{j\ell }\) equals zero.

In contrast, HIRE further models the effect of each phenotype on each cell type as shown in the bottom panel of Fig. 1a. In cell type k, sample i’s cell-type-specific methylation profile, uik, is the summation of the corresponding baseline cell-type-specific methylation levels, μk, and the phenotype effects \({\mathbf{B}}_{k\ell }x_{i\ell }\) on sample i from all the l = 1, …, q phenotypes: \({\mathbf{u}}_{ik} = {\boldsymbol{\mu }}_k + \mathop {\sum}_{l = 1}^q {{\mathbf{B}}_{k\ell }} x_{i\ell }\), where \(x_{i\ell }\) is the phenotype \(\ell \) of sample i and \({\mathbf{B}}_{k\ell } = (\beta _{1k\ell }, \ldots ,\beta _{mk\ell })^T\)—the kth column of \({\mathbf{B}}_\ell \)—reflects the association of phenotype \(\ell \) with each of the m CpG sites in cell type k. Thus, by collecting the baseline cell-type-specific methylation profiles to μ = (μ1, …, μk) and denoting the m by K phenotype coefficient matrix \((\beta _{jk\ell }:1 \le j \le m,1 \le k \le K)\) by \({\mathbf{B}}_\ell \), we now have:

(2)

A comparison of \({\mathrm{x}}_i\) in Eq. (1) and \({x_{i \ell }}{\boldsymbol{p}}_i ,{\ell} = 1,…,q\) in Eq. (2) reveals that via the two-layer hierarchical model HIRE correctly captures the multiplicative effects of the cellular compositions on the phenotype effects (see also Methods and the Supplementary Methods). As a result, HIRE achieves greater statistical power for association detection at the aggregate level and enables the fine-scale resolutions that were previously infeasible. We mathematically prove that the HIRE model is identifiable under mild conditions that are easily met in reality (see Theorem 1 and its proof in Methods).

Figure 1b summarizes the inputs and outputs of HIRE. Given the methylation measurements at the aggregate level of n samples, HIRE can estimate all parameters of interests —pi (i = 1, …, n), μ, and \({\mathbf{B}}_\ell \) (\(\ell = 1, \ldots ,q\)). HIRE then determines whether any association exists between CpG site j and phenotype \(\ell \) in each individual cell type by testing the hypotheses \(H_0:\beta _{jk\ell } = 0\) versus \(H_1:\beta _{jk\ell } \ \ne\ 0\). When the null hypothesis \(H_0:\beta _{jk\ell } = 0\) is rejected, HIRE calls CpG site j as a risk-CpG site for phenotype \(\ell \) in cell type k. The detection of cell-type-specific risk-CpG sites cannot be performed with any of the existing state-of-the-art methods.

Moreover, HIRE allows users to prespecify the number of cell types K. When K is unknown, HIRE selects the number of cell types according to the penalized Bayesian information criterion (pBIC)20 (Supplementary Methods).

Simulation

As the definition of the gold standard for real data is debatable21,22, we designed extensive simulation studies to evaluate the performance of HIRE and compared it with commonly used methods—unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor (Methods). We generated datasets in which the observed methylation was a mixture of several cell types and each sample was accompanied with a diseased or normal status and a continuous age attribute. We deliberately designed some cell types to have similar baseline methylation profiles to mimic cell types from the same cell lineage. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated two scenarios in which (1) all phenotype effects \(\beta _{jk\ell }\)s are zero—the true null case—to compare the ability of each method to control false positives; and (2) a small portion of \(\beta _{jk\ell }\)s are non-zero—the true alternative case—to study each method’s power to detect risk-CpG sites. Under the true alternative, both the binary and the continuous phenotypes were assumed to have cell-type-specific risk-CpG sites and to affect the cell-type proportions among the samples10. We further simulated phenotype effects with various directions and magnitudes.

Under the true null, HIRE, EWASHer, and ReFACTor control the false positive rates (FPRs) very well: none are greater than 0.05% (Table 1 and Supplementary Figs. 19). In comparison, RefFreeEWAS often has FPRs greater than 0.1% and thus does not perform as well as HIRE, and the unadjusted analysis and SVA further suffer from the dramatic inflation of false positives. For the true alternative settings, given that the FPRs are well-controlled, with FPRs below 0.05%, HIRE achieves the highest true positive rates (TPR) of all methods in every simulation setting (see also Fig. 2a and Supplementary Figs. 1017). As expected, as the sample size increases, HIRE’s power increases. For example, when the data include five cell types, HIRE can identify 89.6% of the risk-CpG sites with 300 samples, and HIRE can detect almost all risk-CpG sites when the sample size reaches 600, which is a typical sample size for EWAS. Although EWASHer and ReFACTor have low FPRs, they miss a large proportion of risk-CpG sites. EWASHer’s maximum TPR is only 35.33%, and ReFACTor’s maximum TPR is slightly over 60%. However, in those cases, HIRE’s power is greater than 95%. Consistent with the true null scenario, in the true alternative, RefFreeEWAS has inflated FPRs compared to HIRE, and the unadjusted analysis and SVA always have huge false positives. Therefore, HIRE substantially improves the power of association detection at the aggregate level compared with existing methods.

Table 1 Performance of HIRE and other competing methods in simulation studies
Fig. 2
figure 2

Association detection performance of HIRE and commonly used methods in the true alternative setting with K = 3 and n = 180. Source data are provided as a Source Data file. In all figures, red corresponds to HIRE; yellow indicates the unadjusted analysis; brown represents SVA; purple refers to RefFreeEWAS; dark blue indicates EWASher; and light blue corresponds to ReFACTor. a ROC curves of HIRE and commonly used methods. HIRE has the largest area under the curve among all of the methods. b True cell-type-specific association pattern with disease status for 10,000 simulated CpG sites; columns correspond to cell types, and the rows represent the CpG sites. Dark cells correspond to risk-CpG sites, and grey cells are CpG sites not associated with the disease status. c Detected cell-type-specific association pattern with disease status by HIRE. Darkness represents \(-{\mathrm{log}}_{10}(p - {\mathrm{value}})\) di The p-value density plots for association with disease status in the simulation dataset for d HIRE, e unadjusted analysis, f SVA, g RefFreeEWAS, h EWASHer, and i ReFACTor. jo The Q-Q plots for association with disease status for j HIRE, k unadjusted analysis, l SVA, m RefFreeEWAS, n EWASHer, and o ReFACTor

In the multiple hypothesis testing, the p-values from the truly null features should follow a uniform distribution on (0, 1), whereas those for the truly alternative features are concentrated near zero23. Both the histograms (Fig. 2d–i) and the Q-Q plots (Fig. 2j–o) show that the p-value distribution of HIRE is the best fit to the underlying truth—there are only a small proportion of signals, followed by RefFreeEWAS and ReFACTor. EWASHer easily overcorrects signals with its p-value density having a dip near zero (Fig. 2h), thus failing to detect the true associations. In contrast, the unadjusted analysis and SVA generate very small p-values clustered near zero, resulting in inflated type I errors.

In addition to the traditional association detection at the aggregate level, HIRE can identify the association for each CpG site with the phenotypes under each cell type. Table 2 shows the FPR and TPR of HIRE for each cell type in various simulation settings. Such fine analysis is not possible with the other methods. Consistent with association detection at the aggregate level, HIRE always controls the FPR well. When K = 3 and n = 180, HIRE accurately detects the risk-CpG sites associated with disease status a TPR of greater than 83% and an FPR of 0.01% or less in all three cell types. Similarly, most of the CpG sites affected by age are also correctly identified in each cell type. HIRE’s learned cell-type-specific association patterns closely matches the underlying true associations (see Fig. 2b, c and Supplementary Figs. 1826). Once again, HIRE’s power decreases with the number of cell types and increases with the sample size. When the samples consist of seven cell types and the proportion of the least abundant cell type is as low as 4.2%, given a typical current EWAS with around 600 samples, HIRE can detect most cell-type-specific risk-CpG sites reasonably well. Moreover, HIRE’s estimates for the baseline methylation profiles, cellular compositions, and phenotype effects have little bias (Supplementary Figs. 2762); therefore, HIRE can provide accurate estimates and is powerful in detecting cell-type-specific risk-CpG sites.

Table 2 Performance of HIRE in detecting cell-type-specific risk-CpG sites

In the HIRE model, we assume that different CpG sites are independent, and we investigate the performance of HIRE when such a model assumption is violated and dependences exist among nearby CpG sites. Specifically, we assume that every 50 consecutive CpG sites belongs to a block. For CpG sites within the same block, their random noises \({\boldsymbol{\epsilon }}\) follow a multivariate normal distribution with mean zero and 50 × 50 covariance matrix Σ, and Σ’s corresponding correlation matrix has its (i, j) entry equal to ρ|ij|. We vary ρ to 0.8, 0.6, and 0.4. A comparison of Supplementary Tables 13 with Supplementary Table 4 shows that even when strong correlations exit among nearby CpG sites, HIRE still provides good performances in controlling the FPR and detecting the risk-CpG sites under the model misspecification setting.

To further evaluate HIRE’s performance on experimentally mixed samples, we conducted another semi-simulated dataset that includes six samples mixed with six purified cell types in predetermined proportions24. Once again, HIRE successfully recovers the six underlying reference cell types and estimates the cellular compositions well (see Methods).

Real data analysis

HIRE also provides greater insight into real data than previous studies. The rheumatoid arthritis (RA) dataset3 contains methylation profiles collected from the whole blood of 354 patients with RA and 335 normal participants. In addition to the RA status, other attributes such as gender, smoking history, age, and batch information are available. We first corrected the batch effects and then applied HIRE to the dataset (Methods). Figure 3a displays the p-values regarding the association with the RA status for each CpG site in each cell type, in which HIRE selected six cell types (Supplementary Fig. 63a), consistent with the number of cell types in the previous study13. Despite potential batch effects and biological variability, three of the six cell types can be matched to known blood cell references—cell type 1 was matched to CD4+ T cells, cell types 2 and 4 were matched to neutrophils, and the remaining three cell types cannot be aligned to the references (Methods and Supplementary Fig. 64). HIRE detected 63 risk-CpG sites in cell type 3—the largest number of associations across all cell types—but no risk-CpG sites in cell type 1 (Supplementary Table 5). Therefore, the disease status affected some but not necessarily all cell types. Note that the significant CpG site cg06373940 called by HIRE is located on gene ERCC3. The level of ERCC3’s corresponding protein has been reported to increase in RA synovium25. Moreover, we found that five CpG sites had a significant association with smoking history (Supplementary Fig. 65 and Supplementary Table 6). One of them is cg05575921, which was recently linked to smoking in two other independent studies of blood samples26,27. However, these findings were missed by the association detection at the aggregate level in previous analyses of the same dataset11,13. The p-value density plots and Q-Q plots for the commonly used methods are also displayed in Fig. 3c–n; they present patterns similar to those observed in the simulation study except for an obvious overcorrection by ReFACTor.

Fig. 3
figure 3

Application of HIRE and commonly used methods to two real methylation datasets: RA and GALA II. Source data are provided as a Source Data file. a Cell-type-specic association pattern with RA status detected by HIRE in the RA dataset. Darkness represents the −log10(p−value). b Cell-type-specic association pattern with gender detected by HIRE in the GALA II dataset. The darkness represents the −log10(p−value). ch The p-value density plots for association with RA status in the RA dataset for c HIRE, d unadjusted analysis, e SVA, f RefFreeEWAS, g EWASHer, and h ReFACTor. i-n Q-Q plots for association with RA status in the RA dataset for i HIRE, j unadjusted analysis, k SVA, l RefFreeEWAS, m EWASHer, and n ReFACTor

The high resolution provided by HIRE makes it a powerful tool for EWAS studies. Rahmani et al. used ReFACTor13 to analyze the GALA II blood methylation dataset28, which consists of 573 samples collected from a pediatric Latino population. Each sample includes the gender information and belongs to one of the following four populations: Mexican, Mixed Latino, Puerto Rican, and Other Latino. We applied HIRE to the dataset to investigate whether any cell-type-specific CpG sites were associated with gender and ethnicity. We created three dummy variables to represent the four ethnic groups. By taking the indicators of ethnicity as phenotypes in the model, HIRE automatically and simultaneously accounts for the population differences in cell composition and cell-type-specific methylation levels. HIRE correctly selected the number of cell types as six as reported in the previous study13 (Supplementary Fig. 63b). According to cell-type alignment, cell types 1 and 5 can be annotated as CD4+ T cells; cell types 2, 3, and 4 belong to neutrophils; and cell type 6 was annotated as CD56+ natural killer cell (CD56+ NK) using the references (Supplementary Fig. 66). HIRE found that 1936 CpG sites were associated with ethnicity across all cell types (Supplementary Fig. 67) and identified 14, 52, 155, 15, 18, and 14 risk-CpG sites for gender in cell types 1–6, respectively (Fig. 3b). Gene set enrichment analysis showed that the genes that harbored risk-CpG sites for gender were significantly enriched in seven canonical pathways (Supplementary Table 7), of which the PID_CMYB_PATHWAY was ranked the highest. The transcription factor c − MYB in the PID_CMYB_PATHWAY enhances the progression of breast cancer29; therefore, the different occurrence rates of breast cancers in men and women may be linked to the differences at the epigenome level. In comparison, only one pathway was found to be enriched with the genes that host the risk-CpG sites claimed by ReFACTor at the aggregate level (Supplementary Table 8). All of these observations highlight the importance of the finer-scale resolutions of HIRE.

Discussion

In reality, the phenotype may affect a risk-CpG site in some but not all of the cell types. HIRE can detect the cell-type-specific association pattern with each phenotype for EWAS. The identification of cell-type-specific risk-CpG sites will help epigenetic therapies to target the affected cell types in a more effective manner.

Statistically, instead of assuming fixed reference methylomes for all samples as the existing methods do9,13,16, HIRE allows each sample’s cell-type-specific methylation profiles to depend on its phenotypes. As a result, HIRE correctly models the multiplicative effects of the cellular compositions on the observed methylation levels, whereas the existing approaches all misspecify the cellular compositions as additive effects (Methods). As a result, HIRE enables the detection of cell-type-specific risk-CpG sites that cannot be feasibly detected with existing state-of-the-art methods. As a byproduct, HIRE also improves the statistical power of association detection at the aggregate level relative to existing state-of-the-art methods. Computationally, the time complexity of one iteration by HIRE is O(nmKp + nK3), which thus provides fast convergence when K is moderate. The statistical and computational advantages equip HIRE to be scaled up for large-cohort EWAS.

So far, in the EWAS community, no gold-standard exists for the comparison of various methods. Ideally, we would like to have epigenetic spike-in experiments in which purified cell types are isolated, CpG-sites are epigenetically edited on a per-cell-type basis, and cell types are finally mixed in predetermined proportions. Given such experiments, the underlying knowledge of which CpGs are differentially methylated in each cell type and the cell mixing proportions for each sample are known. However, biotechnologies for epigenetic editing, such as CRISPR-Cas, are still not mature at this stage, with many off-target modifications30. Therefore, most computational EWAS studies refer to numerical simulation studies rather than to experimental studies when evaluating the performance of their algorithms12,13. Here, we follow the example of previous comparative studies and design our simulation studies to serve as the computational counterpart of experimental spike-in studies. With the rapid advances in epigenetic editing, we hope the community can devote greater effort in the near future to the creation of a gold-standard dataset, such as those generated in the early years for gene expression microarray studies31.

The beta-values that represent methylation levels always lie between zero and one. As previous approaches to EWAS often assume normal distribution for the beta-values and show good performances in real applications9,13, in HIRE, we also assume that the beta-values follow a normal distribution. Consequently, the fitted methylation level may lie outside the range of [0, 1]. Nevertheless, we do in fact constrain the baseline methylation profiles μjks to the closed interval [0, 1] and force the cellular compositions pkis to be non-negative and to add up to one: \(\mathop {\sum}_{k = 1}^K {p_{ki}} = 1\). As a result, because the phenotypes have no effect on most CpG sites, most observations, Ojis, have their means \(\mathop {\sum}_{k = 1}^K {\mu _{jk}} p_{ki}\)s in [0, 1]. In fact, for both the RA dataset and the GALA II dataset, more than 99.99% of the fitted methylation values \(\hat O_{ji}\)s based on HIRE estimates lie between zero and one. Therefore, the normal assumption fits the data reasonably well and does not have a large effect on the performance of HIRE.

One major issue for all of the cell-type deconvolution methods is that deconvolution cannot be achieved if the cellular compositions do not vary among samples. For example, assuming that the samples are mixtures of two cell types and pi = p for all of the samples, then the observed methylation profile Oi equals ui1p1 + ui2p2 = (ui1 + p2C)p1 + (ui2 − p1C)p2 := \(\widetilde {\mathbf{u}}_{i1}p_1 + \widetilde {\mathbf{u}}_{i2}p_2\) for any constant C. As a result, ui1 and ui2 are not estimable. In our paper, we show mathematically that HIRE is identifiable under mild conditions in Theorem 1 and that condition (b) of Theorem 1 formulates the requirement for the variability of the cellular compositions (Methods). HIRE can accurately estimate cellular compositions of tissues with great cellular heterogeneity, such as blood. Although the mild conditions in Theorem 1 are easily met for real DNA methylation data, identification of both sufficient and necessary conditions for model identifiability is a theoretically interesting and challenging statistical problem that we will investigate in a future study.

HIRE requires a moderate sample size to obtain precise estimates because HIRE needs to learn (1 + 2K + qK)m + (K − 1)n parameters with a total of mn observed values. Our simulation studies show that HIRE performs very well at the aggregate level with 180 samples (Table 1). If the sample size drops below 150, say to 120, HIRE can still control the FPR well but begins to lose power (Supplementary Table 9). For small sample sizes, we have also developed a special case of HIRE by reparameterizing all \(\sigma _{gk}^2\)s as one single parameter σ2, and we found that such a variance-stabilized approach can achieve even better inflation control (see Supplementary Figs. 7176) and power comparable to HIRE (see Supplementary Table 10). Like the two datasets analyzed in the real application, a typical sample size for a current EWAS exceeds 500, thus guaranteeing a high TPR for HIRE. Given the decreasing cost of EWAS, we recommend that researchers collect at least 200 samples for their studies for association detection at the aggregate level and 600 samples for identification of cell-type-specific risk-CpG sites. A larger sample size can further boost the power.

With the popularity of EWAS, we believe that HIRE will be widely applied, and we hope that HIRE can motivate more researchers to mine out finer-scale results from EWAS.

Methods

Multiplicative effects of cellular composition on methylation

In this section, we illustrate that the effects of the cell-type composition are actually multiplicative. Let us assume that the beta-values that represent the methylation levels are observed across m CpG sites for n samples. As the measured sample comprises cells of various types, the observed beta-value is a weighted average of the mean methylation levels of distinct cell types, and the weights correspond to the proportions of each cell type. Let Oji denote the measurement at CpG site j for sample i. If we assume that there exist K cell types in all samples and that the mean methylation level for CpG site j in cell type k is μjk, then

$$O_{ji} = \mathop {\sum}\limits_{k = 1}^K {\mu _{jk}} p_{ki} + \epsilon _{ji},$$

where pki is the proportion of cell type k in sample i with a natural constraint \(\mathop {\sum}_{k = 1}^K {p_{ki}} = 1\), and \(\epsilon _{ji}\) is a random error.

Let us consider a case-control EWAS. Without loss of generality, we assume that CpG site j is differentially methylated between cases and controls in cell type 1 with a mean shift δj1 and that it is not differentially methylated in the remaining cell types. As a result, for case samples,

$$O_{ji} = (\mu _{j1} + \delta _{j1})p_{1i} + \mathop {\sum}\limits_{k = 2}^K {\mu _{jk}} p_{ki} + \epsilon _{ji} = \delta _{j1}p_{1i} + \mathop {\sum}\limits_{k = 1}^K {\mu _{jk}} p_{ki} + \epsilon _{ji}.$$

If we then use Zi to indicate the case-control status of sample i, the observed methylation level becomes

$$O_{ji} = \delta _{j1}p_{1i}Z_i + \mathop {\sum}\limits_{k = 1}^K {\mu _{jk}} p_{ki} + \epsilon _{ji}.$$
(3)

Therefore, the proportions of cell type 1—p1i, i = 1, …, n—have multiplicative effects rather than additive effects on the mean difference between the case and control samples.

The existing methods, which either estimate the cell type proportions explicitly or approximate them implicitly with surrogate variables, add the estimated proportions and the case-control indicator Zi as the covariates to the regression as follows:

$$O_{ji} = \alpha _j + \tau _jZ_i + \mathop {\sum}\limits_{k = 1}^{K - 1} {b_{jk}} \hat p_{ki} + \epsilon _{ji},$$
(4)

where bjks are the regression coefficients. As a result, CpG site j is called differentially methylated on the basis of hypothesis testing for τj = 0. In general, τj in Eq. (4) is not equal to δj1 in Eq. (3). Please see the Supplementary Notes for a numerical example. Moreover, testing for τj = 0 loses the information regarding cell type in which CpG site j may be at risk. To account for the multiplicative effects, we propose the HIRE model that conserves the individual cell-type level information, which is introduced in the next section.

The HIRE model

HIRE uses a hierarchical model to closely follow the data generation process for the EWAS data. To begin, we assume that the baseline methylation level for CpG site j in cell type k is μjk. For sample i with phenotypes xi = (xi1, …, xiq), the mean methylation value for CpG site j in cell type k is assumed to be \(\mu _{jk} + \mathop {\sum}_{\ell = 1}^q {\beta _{jk\ell }} x_{i\ell }\). In other words, the phenotypes have linear effects where \(\beta _{jk\ell }\) characterizes the influence of phenotype \(\ell \) on CpG site j in cell type k. Let uijk represent the signal from CpG site j in cell type k for sample i with xi. We assume that uijk follows a normal distribution with mean \(\mu _{jk} + \mathop {\sum}_{\ell = 1}^q {\beta _{jk\ell }} x_{i\ell }\) and standard deviation σjk,

$$u_{ijk}\sim N \left (\mu _{jk} + \mathop {\sum}\limits_{\ell = 1}^q {\beta _{jk\ell }} x_{i\ell },\sigma _{jk}^2\right).$$
(5)

After uijks are generated for all of the K cell types, the observed methylation value Oji is sampled as follows:

$$O_{ji}\sim N\left(\mathop {\sum}\limits_{k = 1}^K {u_{ijk}} p_{ki},\sigma _{\epsilon j}^2\right).$$
(6)

Collectively, O = {Oji : 1 ≤ j ≤ m, 1 ≤ i ≤ n} denote the observed data; u = {(uij1, …, uijK)T : 1 ≤ i ≤ n, 1 ≤ j ≤ m} are the missing data; and μj = (μj1, …, μjK)T, \({\mathbf{B}}^{(j)} = (\beta _{jk\ell })_{K \times q}\), \(\sigma _{\epsilon j}^2\), the diagonal matrix \(\Sigma _j = diag(\sigma _{j1}^2, \ldots ,\sigma _{jK}^2)\) for j = 1, …, m, and pi = (p1i, …, pKi)T for i = 1, …, n are the parameters. With \({\mathbf{\Theta }} = \{ {\mathbf{p}}_i,{\boldsymbol{\mu }}_j,{\mathbf{B}}^{(j)},\Sigma _j,\sigma _{\epsilon ,j}^2:1 \le j \le m,1 \le i \le n\} \), the complete data log-likelihood function, lc, can be expressed as follows:

$$\begin{array}{l}l_c({\mathbf{\Theta }}|{\mathbf{O}},{\mathbf{u}}) = \mathop {\sum}\limits_{i = 1}^n {\mathop {\sum}\limits_{j = 1}^m {\left\{ { - \frac{1}{2}{\mathrm{log}}\sigma _{\epsilon ,j}^2 - \frac{{(O_{ji} - {\mathbf{u}}_{ij}^T{\mathbf{p}}_i)^2}}{{2\sigma _{\epsilon ,j}^2}}} \right.} } - \frac{1}{2}\mathop {\sum}\limits_{k = 1}^K {\mathrm{log}}\sigma _{jk}^2\\ \left. { - \frac{1}{2}({\mathbf{u}}_{ij} - {\boldsymbol{\mu }}_j - {\mathbf{B}}^{(j)}{\mathbf{x}}_i)^T\Sigma _j^{ - 1}({\mathbf{u}}_{ij} - {\boldsymbol{\mu }}_j - {\mathbf{B}}^{(j)}{\mathbf{x}}_i)} \right\} + Constant.\end{array}$$

Accordingly, we develop a generalized expectation-maximization algorithm32 to estimate the parameters. In the expectation-maximization algorithm, a good initialization can lead to faster convergence than random starts. We adopt the cellular composition estimations from the methylation matrix decomposition algorithm16 with slight modifications as the initializations. The initial values for the baseline methylation profiles μjk are accordingly estimated by simple linear regressions. As the number of risk-CpG sites is often small, all of the phenotype effects \(\beta _{jk\ell }\) are set to zero at the beginning. For the standard deviations, the initial values are randomly sampled from inverse gamma distributions with small means. We choose the number of cell types K by using a variant of the penalized Bayesian information criterion (pBIC)20 (see details in Supplementary Methods).

For each phenotype \(\ell \), we can conduct the hypothesis test \(H_0:\beta _{jk\ell } = 0\) versus \(H_1:\beta _{jk\ell } \ \ne \ 0\) for any cell type k and any CpG site j. Combining Eqs. (5) and (6), we obtain the following equations:

$$E\left[ {O_{ji}} \right] = \mu _{j1} + \mathop {\sum}\limits_{k = 2}^K {(\mu _{jk} - \mu _{j1})} p_{ki} + \mathop {\sum}\limits_{k = 1}^K {\mathop {\sum}\limits_{\ell = 1}^q {\beta _{jk\ell }} } x_{i\ell }p_{ki},\:\:\:i = 1, \ldots ,n.$$
(7)

We can then take (Oj1, …, Ojn) as the response vector and concatenate 1n, (pk1, …, pkn) (k = 2, …, K) and \((x_{1\ell }p_{k1}, \ldots ,x_{n\ell }p_{kn})\) \((\ell = 1, \ldots ,q;k = 1, \ldots ,K)\) to a n × (p + 1) · K design matrix in the linear regression. We plug in the estimated cellular compositions \(\hat p_{ki\ell }\) and conduct the hypothesis test for \(\beta _{jk\ell } = 0\) using the two-sided t-tests in the linear models. We claim that CpG site j has an association with phenotype \(\ell \) at the aggregate level if phenotype \(\ell \) affects CpG site j in at least one of the K cell types. Note that in the regression we incorporate the estimated cellular compositions into the linear model as multiplicative effects rather than additive effects.

More technical details of the method and the algorithm are available in the Supplementary Methods.

Data simulation

We compared the performance of HIRE with five previous methods—unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor—in 18 simulation settings. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated the true null case and the true alternative case. As a result, we have in total 3 (the number of sample sizes) × 3 (the number of cell types) × 2 (the true null case and the true alternative case) = 18 simulation settings. For each setting, we considered 10,000 CpG sites and simultaneously accounted for the following factors.

Cell lineage. We first constructed the baseline methylation matrix μ = (μjk)m×K, in which each column corresponds to the baseline methylation levels of a cell type. To mimic the phenomenon in which cell types from the same lineage have similar methylation profiles, we assumed that Ksim of the total K cell types were similar. Specifically, without loss of generality, we assumed that the first Ksim cell types came from the same cell lineage and that the remaining K − Ksim cell types are irrelevant to one another. We set Ksim to 2, 2, and 3 for K = 3, 5, and 7, respectively. We generated μjk for cell types k = 1, Ksim + 1, …, K from the beta distribution beta(3, 6) on each CpG site j independently. For each of the remaining cell types k′ = 2, …, Ksim, we randomly selected 20% of the CpG sites and drew their μjks independently from beta(3, 6); and for the remaining 80% of CpG sites, we let their μjk be μj1 plus a very small randomness, thus inducing the similarities among cell types 1 to Ksim.

Discrete and continuous phenotypes. We further generated a discrete and a continuous phenotype x = (x1, x2)T for each individual i (i = 1, …, n). We let the first n/3 individuals be the control samples with xi1 = 0 for i = 1, …, n/3 and the remaining 2n/3 individuals serve as cases with xi1 = 1 for i = n/3 + 1, …, n. The continuous phenotypes x2 = (x12, …, xi2, …, xn2)T were independently drawn from a Unif(20, 50) to act as age.

Phenotype effects with different magnitudes and directions. We then simulated the phenotype effect \(\beta _{jk\ell }\) of each phenotype \(\ell \) on CpG site j in cell type k. For the true null cases, all of the \(\beta _{jk\ell }\)s are zero. For a true alternative setting, we set nonzero phenotype effects as follows.

For phenotype 1—the case/control status, we let it affect the first 10 CpG sites in all of the cell types: βjk1 ≠ 0 for j = 1, …, 10 and k = 1, …, K. We then assumed that the next 10 CpG sites were influenced by the disease status in the first Ksim cell types which come from the same lineage but not the other cell types: βjk1 ≠ 0 (k = 1, …, Ksim) and βjk1 = 0 (k = Ksim + 1, …, K) for any j = 11, …, 20. Furthermore, for cell type k {Ksim + 1, …, K}, we let the disease status affect CpG sites j = 20 + 10(k − Ksim − 1) + 1, …, 20 + 10(k − Ksim) only in cell type k. We generated the cell-type-specific effects of age in a similar fashion for CpG site loci 21 to 40 + 10(K − Ksim).

For each nonzero βjk1, we let βjk1 = rjk · ωjk, where ωjk ~ Unif(0.07, 0.15) and rjk takes values of 1 and −1 with equal probabilities. Thus, βjk1s can have both positive and negative effects. In the same spirit, we generated nonzero βjk2s with \(r_{jk}^\prime \)s and \(\omega _{jk}^\prime \)s where \(\omega _{jk}^\prime \sim Unif(0.007,0.015)\).

Association between phenotypes and cellular compositions. Notice that the phenotypes may be associated with the cellular composition. Therefore, when K = 3, we drew pi = (p1i, …, pKi) from a Dirichlet distribution Dir(4, 4, 2 + 0.1xi2) if sample i is a control and pi ~ Dir(4, 4, 5 + 0.1xi2) if it is a case; when K = 5, we let pi ~ Dir(3, 3, 3, 3, 2 + 0.1xi2) for a control sample and pi ~ Dir(3, 3, 3, 3, 5 + 0.1xi2) for a case sample; and when K = 7, we sampled pi ~ Dir(1, 3, 3, 3, 2, 2, 2 + 0.1xi2) for controls and pi ~ Dir(1, 3, 3, 3, 2, 2, 5 + 0.1xi2) for cases.

Finally, we generated the observed value Oji for CpG site j of sample i as follows: sample uijk from N(μjk + βjk1xi1 + βjk2xi2, 0.012) for k = 1, …, K; and sample Oji from \(N(\mathop {\sum}_{k = 1}^K {u_{ijk}} p_{ki},0.01^2)\). In case Oji lies outside the interval (0, 1), we truncate it to zero if Oji is lower than zero and to one if Oji is greater than one.

Semi-simulated dataset including samples with known cell mix proportions

The GEO dataset GSE11055424 contains purified cell-type-specific methylation profiles for six cell types: neutrophils, monocytes, B cells, CD4+ T, CD8+ T, and NK. Moreover, GSE110554 includes mixed samples whose methylation signals were aggregated from the six cell types with predetermined cell mix proportions. Therefore, because of the known cell type and cellular proportion information, GSE110554 is an ideal dataset with which to test HIRE’s performance.

In GSE110554, the number of mixed samples is much smaller than the typical size of an EWAS and, as discussed in the manuscript, HIRE usually requires hundreds of samples to obtain accurate and stable results. Therefore, to increase the sample size, we first generated a simulated methylation dataset with 600 samples using the purified methylation profiles. We focused on 10k CpG sites, including the 450 IDOL CpG sites, which were previously identified as the optimal library of CpG sites for estimation of leukocyte subtype proportions24, and another 9550 CpG sites whose methylation values across the purified cell types fell within the range of [0.2, 0.8] and had large standard deviations11. We then combined the 600 samples and six mixed samples (generated by method A)24 available in GSE110554 to compose a semi-simulated dataset.

After applying HIRE to the semi-simulated data, we annotated the estimated cell types based on the methylation profiles from GSE110554. Supplementary Figure 69 shows the heatmap for the Pearson correlation matrix between inferred cell types and the underlying truth. The correlation signals on the diagonal are the strongest in each row. HIRE successfully recovers the six underlying cell types. We also compared the estimated cellular compositions with the underlying true proportions for the six mixed samples. Each panel in Supplementary Fig. 70 displays a scatter plot between the cellular proportion estimates and the true mix proportions for a given cell type; they all indicate that HIRE obtains good estimates for cellular compositions.

Cell type matching protocol

Assume that we have the reference methylation profiles for the H annotated cell types. We first denote the methylation profile for cell type h as ϕh = (ϕ1h, …, ϕmh). We aim to annotate μk using the references. Following the previous study33, first, we calculate the cosine similarity, the Pearson correlation, and the Spearman correlation between μk and ϕh for each cell type h {1, …, H}. Notice that the three similarity measures lie between −1 and 1, and a high positive value indicates great similarity between two vectors. Second, for each similarity measure \(\ell \) (\(\ell = 1,2,3\)), we identify the cell type \(h_\ell \) that has the maximal degree of similarity with μk. If at least two out of the three similarity measures identify the same reference cell type \(\tilde h\) and their corresponding similarity values are greater than 0.5, then we annotate μk with the reference cell type \(\tilde h\). Otherwise, μk is believed to belong to a new cell type that is not included in the references. We repeat the above process for each methylation profile μk estimated from HIRE.

Blood cell references

The two real data sets analyzed in our applications were obtained from whole blood. Therefore, we prepared the references from a whole blood methylation study34 with GEO accession code GSE35069. The study included seven isolated blood cell subpopulations—CD4+ T cells, CD8+ T cells, CD14+ monocytes, CD19+ B cells, CD56+ NK cells, neutrophils, and eosinophils—for six individuals. Accordingly, we define the reference profile ϕh for cell type h as the average methylation profile of these individuals, i.e., \({\boldsymbol{\phi }}_h: = \frac{1}{6}\mathop {\sum}_{i = 1}^6 {{\boldsymbol{\phi }}_{hi}} \).

Data preprocessing

The RA dataset is publicly available in GEO with accession number GSE42861. The dataset measures the methylation levels of the whole blood. The methylation data have been normalized by Illumina’s control probe scaling procedure (see Liu et al.3 “Illumina 450K microarray data preprocessing” section for details). The dataset includes 689 samples, and the RA status, age, gender, smoking history, and batch information are available for each sample. We removed two samples GSM1051535 and GSM1051691 because their smoking information is missing. CpG sites with a high methylation mean (>0.8) and a low methylation mean (<0.2) were discarded11,13. We adjusted the data for batch effects using COMBAT35. The correction process was justified because we did not observe a high degree of co-linearity between the RA status and the batches (Supplementary Fig. 68). The 10,000 most variable CpG sites were kept. For the RA status, we denoted RA patients with 1 and the normal control subjects with 0; we represented men with 1 and women with 0; for the smoking history, we used (0, 0, 0) to refer to “never,” (1, 0, 0) to “ex,” (0, 1, 0) to “current,” and (0, 0, 1) to “occasional” smokers.

We downloaded the GALA II dataset from Gene Expression Omnibus (GEO) with accession number GSE77716. The dataset contains the whole-blood DNA methylation beta-values from 573 samples. The beta-values have been normalized by SWAN36 and corrected for batch effects by COMBAT35. There are two types of covariates: gender and ethnicity. Ethnicity includes Mexican, Mixed Latino, Puerto Rican, and Other Latino. Out of the 573 samples, one sample “GSM2057284” has no gender information, so we removed it. As suggested by previous studies11,13, CpG sites with a mean methylation value of less than 0.2 or higher than 0.8 were filtered out. We selected the 10,000 most variable of the remaining CpG sites. For gender, we denoted men with 1 and women with 0. For the ethnicity variables, we used three dummy variables to represent the four ethnicity categories. In particular, (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1) corresponded to Mexican, Mixed Latino, Puerto Rican, and Other Latino, respectively.

For ReFACTor and EWASHer, according to their rules, we first filtered out CpG sites that were consistently hypomethylated or consistently hypermethylated and then regressed out the known covariates. We finally used the residuals to perform their analysis. Note that in their software these steps are processed automatically. For RefFreeEWAS, SVA, and the unadjusted analysis, the phenotypes and the covariates were regarded as the fixed effects in the regression model. In detail, for ReFACTor, in both GALA II and RA datasets, the cell type number “K” was specified to be six, which was the same as in their paper13. For RefFreeEWAS, we fixed the dimensionality of latent space “d” at six in the real data. For SVA, we also fixed the number of surrogate variables to six.

Gene enrichment analysis was carried out on the Broad Institute website http://software.broadinstitute.org/gsea/msigdb/annotate.jsp. The canonical pathways were selected as the basis gene sets, and only pathways with a false discovery rate of less than 0.05 were reported.

Identifiability of HIRE

Although the non-negative matrix factorization (NNMF) O = μP has been widely applied in cell type deconvolution16, where O is the observed methylation matrix, μ is the unknown methylation profile, and P is the unknown cellular compositions, model identifiability is rarely discussed. During the review period of our paper, Rahmani et al.37 provided a setting under which the NMMF model is not identifiable.

Why then does NNMF always provide satisfactory cell type deconvolution results in real practice, and why can HIRE estimate all those parameters well? Here, we show mathematically that the HIRE model is identifiable under mild conditions that are easily met in reality.

Let us first introduce some notations and definitions. In the HIRE model, the whole parameter set is denoted by \({\mathbf{\Theta }}: = \{ {\mathbf{P}}_i,{\boldsymbol{\mu }}_j,{\mathbf{B}}_\ell ^{(j)},\sigma _{jk}^2,\sigma _{\epsilon j}^2:1 \le j \le m,1 \le i \le n,1 \le k \le K,1 \le \ell \le q\} \), where pi is the cellular composition vector of sample i, μj is the baseline methylation vector of CpG site j, \({\mathbf{B}}_\ell ^{(j)}\) is the phenotype \(\ell \) effect vector on CpG site j, \(\sigma _{jk}^2\) is the cell-type-k noise variance on CpG site j, and \(\sigma _{\epsilon j}^2\) is the overall noise variance on CpG site j.

The observed data in our study are the methylation matrix O = {Oij:1 ≤ i ≤ n, 1 ≤ j ≤ m} and the covariate matrix \({\mathbf{X}} = ({\mathbf{x}}_1, \ldots ,{\mathbf{x}}_\ell , \ldots ,{\mathbf{x}}_q)\), where \({\mathbf{x}}_\ell \) is the column vector that indicates phenotype-\(\ell \) for the n samples. The observed likelihood function \(({\mathbf{\Theta }}|{\mathbf{O}}) = \mathop {\prod}_{i = 1}^n {\mathop {\prod}_{j = 1}^m N } (O_{ji}:{\mathbf{P}}_i^T {\boldsymbol{\mu }}_{j} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T {\mathbf{B}}_\ell ^{(j)},\mathop {\sum}_{k = 1}^K {\sigma _{jk}^2} \, P_{ik}^2 + \sigma _{\epsilon j}^2)\) (see Eq. (S7) in the Supplementary Methods), where N(O : η, τ2) indicates the normal density with mean η and variance τ2 at value O.

We further define 1K = (1, 1, …, 1)T as a K-dimension column vector with all entries being one, an n by K matrix J1 as \({\mathbf{1}}_n{\mathbf{1}}_K^T\), and an n by K matrix \({\mathbf{J}}_{x_\ell }\) as \({\mathbf{x}}_\ell {\mathbf{1}}_K^T\) for each \(1 \le \ell \le q\). We use to represent the entry-wise matrix product for two matrices M and N with the same dimension, i.e., \(({\mathbf{M}} \odot {\mathbf{N}})_{ij}: = {\mathbf{M}}_{ij}{\mathbf{N}}_{ij}\).

Theorem 1. If (a) for each cell type k, there exists a CpG site rk such that \({\mathbf{B}}_\ell ^{(r_k)} = 0\) for any phenotype \(\ell \) and \(\mu _{r_kk} = 1\) while \(\mu _{r_kk{\prime}} = 0\) for k′ ≠ k, and (b) the cellular compositions P satisfies that \(rank(({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)) = (q + 1)K\) and \(rank(({\mathbf{1}}_n,{\mathbf{P}}^T) \odot ({\mathbf{1}}_n,{\mathbf{P}}^T)) = K + 1\), then the HIRE model is identifiable. In other words, \(L({\mathbf{\Theta }}|{\mathbf{O}}) = L(\widetilde {\mathbf{\Theta }}|{\mathbf{O}})\) for any O implies \({\mathbf{\Theta }} = \widetilde {\mathbf{\Theta }}\).

Proof: First, by integrating out all O elements except Oji, \(L({\mathbf{\Theta }}|{\mathbf{O}}) = L(\widetilde {\mathbf{\Theta }}|{\mathbf{O}})\) implies \(N(O_{ji}:{\mathbf{P}}_i^T {\boldsymbol{\mu }}_{j} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T {\mathbf{B}}_\ell ^{(j)},\mathop {\sum}_{k = 1}^K {\sigma _{jk}^2} P_{ki}^2 + \sigma _{\epsilon j}^2)\) = \(N(O_{ji}:\widetilde {\mathbf{P}}_i^T \widetilde \mu_{j} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} \widetilde {\mathbf{P}}_i^T\widetilde {\mathbf{B}}_\ell ^{(j)},\mathop {\sum}_{k = 1}^K {\tilde \sigma _{jk}^2} \tilde P_{ki}^2 + \tilde \sigma _{\epsilon j}^2)\). Because the univariate normal distribution is identifiable, we have

$${\mathbf{P}}_i^T{\boldsymbol{\mu }}_j + \mathop {\sum}\limits_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T{\mathbf{B}}_\ell ^{(j)} = \widetilde {\mathbf{P}}_i^T\widetilde {\boldsymbol{\mu }}_j + \mathop {\sum}\limits_{\ell = 1}^q {x_{i\ell }} \widetilde {\mathbf{P}}_i^T\widetilde {\mathbf{B}}_\ell ^{(j)},$$
(8)
$$\mathop {\sum}\limits_{k = 1}^K {\sigma _{jk}^2} P_{ki}^2 + \sigma _{\epsilon j}^2 = \mathop {\sum}\limits_{k = 1}^K {\tilde \sigma _{jk}^2} \tilde P_{ki}^2 + \tilde \sigma _{\epsilon j}^2.$$
(9)

Taking j = rk in Eq. (8), we have \(LHS = {\mathbf{P}}_i^T{\boldsymbol{\mu }}_{r_k} + \mathop {\sum}_{\ell = 1}^q {x_{i\ell }} {\mathbf{P}}_i^T{\mathbf{B}}_\ell ^{(r_k)} = {\mathbf{P}}_i^T{\boldsymbol{\mu }}_{r_k} = 0 + P_{ki}\cdot 1 + 0 = P_{ki}\) and similarly \(RHS = \tilde P_{ki}\), so \(P_{ki} = \tilde P_{ki}\), which holds for any i and k. Hence, we obtain \({\mathbf{P}} = \widetilde {\mathbf{P}}\). Next, we rewrite Eq. (8) into a matrix form.

$$({\mathbf{P}}_i^T,x_{i1}{\mathbf{P}}_i^T, \ldots ,x_{iq}{\mathbf{P}}_i^T)\left( {\begin{array}{*{20}{c}} {{\boldsymbol{\mu }}_j} \\ {{\mathbf{B}}_1^{(j)}} \\ \vdots \\ {{\mathbf{B}}_q^{(j)}} \end{array}} \right) = ({\mathbf{P}}_i^T,x_{i1}{\mathbf{P}}_i^T, \ldots ,x_{iq}{\mathbf{P}}_i^T)\left( {\begin{array}{*{20}{c}} {\widetilde {\boldsymbol{\mu }}_j} \\ {\widetilde {\mathbf{B}}_1^{(j)}} \\ \vdots \\ {\widetilde {\mathbf{B}}_q^{(j)}} \end{array}} \right),\;\;i = 1, \ldots ,n.$$

By combining these n equations, it follows that

$$\begin{array}{l}({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)\left( {\begin{array}{*{20}{c}} {{\boldsymbol{\mu }}_j} \\ {{\mathbf{B}}_1^{(j)}} \\ \vdots \\ {{\mathbf{B}}_q^{(j)}} \end{array}} \right)\\ = ({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)\left( {\begin{array}{*{20}{c}} {\widetilde {\boldsymbol{\mu }}_j} \\ {\widetilde {\mathbf{B}}_1^{(j)}} \\ \vdots \\ {\widetilde {\mathbf{B}}_q^{(j)}} \end{array}} \right).\end{array}$$
(10)

Because the rank of \(A: = ({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)\) is (q + 1)K (full column rank), A has a left inverse A−1. Multiplying Eq. (10) by A−1 from the left on both sides, we obtain \({\boldsymbol{\mu }}_j = \widetilde {\boldsymbol{\mu }}_j\) and \({\mathbf{B}}_\ell ^{(j)} = \widetilde {\mathbf{B}}_\ell ^{(j)}\) for \(1 \le \ell \le q\). Therefore, we have \({\boldsymbol{\mu }} = \widetilde {\boldsymbol{\mu }}\), \({\mathbf{B}} = \widetilde {\mathbf{B}}\).

In addition, because Eq. (9) holds for any i, we can also rewrite it into a matrix form.

$$\left( {\begin{array}{*{20}{c}} 1 & {P_{11}^2} & \ldots & {P_{K1}^2} \\ 1 & {P_{12}^2} & \ldots & {P_{K2}^2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & {P_{1n}^2} & \ldots & {P_{Kn}^2} \end{array}} \right)\left( {\begin{array}{*{20}{c}} {\sigma _{\epsilon j}^2} \\ {\sigma _{j1}^2} \\ \vdots \\ {\sigma _{jK}^2} \end{array}} \right) = \left( {\begin{array}{*{20}{c}} 1 & {P_{11}^2} & \ldots & {P_{K1}^2} \\ 1 & {P_{12}^2} & \ldots & {P_{K2}^2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & {P_{1n}^2} & \ldots & {P_{Kn}^2} \end{array}} \right)\left( {\begin{array}{*{20}{c}} {\tilde \sigma _{\epsilon j}^2} \\ {\tilde \sigma _{j1}^2} \\ \vdots \\ {\tilde \sigma _{jK}^2} \end{array}} \right)$$

The left matrix is equal to \(({\mathbf{1}}_n,{\mathbf{P}}^T) \odot ({\mathbf{1}}_n,{\mathbf{P}}^T)\) which has a full column rank; therefore, it has a left inverse. Consequently, \(\sigma _{\epsilon j}^2 = \tilde \sigma _{\epsilon j}^2\) and \(\sigma _{jk}^2 = \tilde \sigma _{jk}^2\). As a result, \({\mathbf{\Theta }} = \widetilde {\mathbf{\Theta }}\), and we have proven the identifiability of HIRE. \(\square \)

Conditions (a) and (b) are easily met for DNA methylation data. Condition (a) requires that for each cell type k, there exists a CpG site that is not associated with any phenotype and is only methylated in cell type k but not methylated in any other cell type. Given the 450K CpG sites assayed by the microarray, we can expect that such CpG sites are not absent at all. Moreover, condition (a) can also be relaxed to the condition that for each cell type k, there exists a CpG site rk such that \({\mathbf{B}}_\ell ^{(r_k)} = 0\) for any phenotype \(\ell \) and \(\mu _{r_kk} = 1\) while \(\mu _{r_kk\prime } = 0\) for k′ ≠ k or there exists a CpG site rk such that \({\mathbf{B}}_\ell ^{(r_k)} = 0\) for any phenotype \(\ell \) and \(\mu _{r_kk} = 0\) while \(\mu _{r_kk{\prime}} = 1\) for k′ ≠ k. The proof follows in a similar manner.

For condition (b), intuitively, the rank requirement of \(({\mathbf{1}}_n,{\mathbf{P}}^T) \odot ({\mathbf{1}}_n,{\mathbf{P}}^T)\) asks the cellular compositions to vary across subjects, which guards against the case in which all the subjects have the same cellular compositions and hence no cell type deconvolution is possible; the rank requirement on \(({\mathbf{J}}_1 \odot {\mathbf{P}}^T,{\mathbf{J}}_{x_1} \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_\ell } \odot {\mathbf{P}}^T, \ldots ,{\mathbf{J}}_{x_q} \odot {\mathbf{P}}^T)\) is the same requirement as those in a standard linear regression, which requires that no collinearity exists among the covariates. Because the sample size n is much larger than the underlying cell type number K and the phenotype number q, the two rank requirements can commonly be satisfied in reality.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.