Detection of cell-type-specific risk-CpG sites in epigenome-wide association studies

In epigenome-wide association studies, the measured signals for each sample are a mixture of methylation profiles from different cell types. Current approaches to the association detection claim whether a cytosine-phosphate-guanine (CpG) site is associated with the phenotype or not at aggregate level and can suffer from low statistical power. Here, we propose a statistical method, HIgh REsolution (HIRE), which not only improves the power of association detection at aggregate level as compared to the existing methods but also enables the detection of risk-CpG sites for individual cell types.

Although the existing methods aim to address the cellular heterogeneity problem in EWAS and claim whether a CpG site is associated with phenotypes at the aggregate level, none of them can identify the risk-CpG sites for each individual cell type, thus missing the opportunity to obtain finer-grained results in EWAS.
Here, we propose a method, HIRE, to identify the association in EWAS at a HIgh REsolution: detecting whether a CpG site has any associations with the phenotypes in each cell type (Methods). The keys to HIRE's success are twofold. First, HIRE links the underlying cell-type-specific methylation profiles for each sample to the sample's phenotypes, thus avoiding the bias in estimating the cellular composition by the reference-based and directdecomposition-based methods. Second, HIRE correctly characterizes the cellular compositions as the multiplicative effects, whereas the existing methods inappropriately treat the cell proportions as additive effects (Methods). HIRE is applicable to EWAS with binary phenotypes, continuous phenotypes, or both. By helping researchers understand in which cell types the CpG sites are affected by a disease, HIRE can ultimately facilitate the development of epigenetic therapies by targeting the specifically affected cell types.

Results
Method overview. HIRE is a hierarchical model that closely follows the data generation process. Its elaborate modeling depicts how phenotypes affect the methylation levels of each sample. Here, we briefly introduce the method. The technical details are provided in the Methods section and the Supplementary Methods.
Let us first review the cornerstone in most EWAS approaches. These methods model the observed methylation levels of the m CpG sites for sample i, O i = (O 1i , O 2i , …, O mi ) T , as the weighted average of the methylation profiles of K cell types, u i = (u i1 , u i2 , …, u iK ). The weights are the cellular compositions p i = (p 1i , p 2i , …, p Ki ) T of sample i (see the top panel of Fig. 1a). However, regardless of whether the reference is known a priori or not, the existing methods assume that the cell-type-specific methylation profiles u i s remain the same for all samples: u i = M, for i = 1, …, n. Unfortunately, because the methylation levels can actually change with covariates such as age and disease status, ignoring the covariates' effects and enforcing static reference methylomes can bias the estimation of p i and thus affect all downstream analyses 14 . More importantly, the assumption that cell-typespecific methylation profiles are the same for each sample prevents the detection of cell-type-specific risk-CpG sites.
For association detection at the aggregate level, after estimation of p i using the deconvolution-based approach or its surrogates from principal component-based methods, the existing methods examine a linear model in which the phenotypes x i ¼ ðx i1 ; ; x i' ; ; x iq Þ T and the cellular proportions p i exert additive effects on the methylation level O i : ð1Þ A CpG-site j is then associated with phenotype ' if we reject the null hypothesis that the covariate coefficient T j' equals zero. In contrast, HIRE further models the effect of each phenotype on each cell type as shown in the bottom panel of Fig. 1a. In cell type k, sample i's cell-type-specific methylation profile, u ik , is the summation of the corresponding baseline cell-type-specific methylation levels, μ k , and the phenotype effects B k' x i' on sample i from all the l = 1, …, q phenotypes: u ik ¼ μ k þ P q l¼1 B k' x i' , where x i' is the phenotype ' of sample i and B k' ¼ ðβ 1k' ; ; β mk' Þ T -the kth column of B ' -reflects the association of phenotype ' with each of the m CpG sites in cell type k. Thus, by collecting the baseline cell-type-specific methylation profiles to μ = (μ 1 , …, μ k ) and denoting the m by K phenotype coefficient matrix ðβ jk' : 1 j m; 1 k KÞ by B ' , we now have:

ð2Þ
A comparison of x i in Eq. (1) and x i' p i ; ' ¼ 1; …; q in Eq. (2) reveals that via the two-layer hierarchical model HIRE correctly captures the multiplicative effects of the cellular compositions on the phenotype effects (see also Methods and the Supplementary Methods). As a result, HIRE achieves greater statistical power for association detection at the aggregate level and enables the finescale resolutions that were previously infeasible. We mathematically prove that the HIRE model is identifiable under mild conditions that are easily met in reality (see Theorem 1 and its proof in Methods). Figure 1b summarizes the inputs and outputs of HIRE. Given the methylation measurements at the aggregate level of n samples, HIRE can estimate all parameters of interests -p i (i = 1, …, n), μ, and B ' (' ¼ 1; ; q). HIRE then determines whether any association exists between CpG site j and phenotype ' in each individual cell type by testing the hypotheses H 0 : β jk' ¼ 0 versus H 1 : β jk' ≠ 0. When the null hypothesis H 0 : β jk' ¼ 0 is rejected, HIRE calls CpG site j as a risk-CpG site for phenotype ' in cell type k. The detection of cell-type-specific risk-CpG sites cannot be performed with any of the existing state-of-the-art methods.
Moreover, HIRE allows users to prespecify the number of cell types K. When K is unknown, HIRE selects the number of cell types according to the penalized Bayesian information criterion (pBIC) 20 (Supplementary Methods).
Simulation. As the definition of the gold standard for real data is debatable 21,22 , we designed extensive simulation studies to evaluate the performance of HIRE and compared it with commonly used methods-unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor (Methods). We generated datasets in which the observed methylation was a mixture of several cell types and each sample was accompanied with a diseased or normal status and a continuous age attribute. We deliberately designed some cell types to have similar baseline methylation profiles to mimic cell types from the same cell lineage. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated two scenarios in which (1) all phenotype effects β jk' s are zerothe true null case-to compare the ability of each method to control false positives; and (2) a small portion of β jk' s are nonzero-the true alternative case-to study each method's power to detect risk-CpG sites. Under the true alternative, both the binary and the continuous phenotypes were assumed to have cell-typespecific risk-CpG sites and to affect the cell-type proportions among the samples 10 . We further simulated phenotype effects with various directions and magnitudes.
Under the true null, HIRE, EWASHer, and ReFACTor control the false positive rates (FPRs) very well: none are greater than 0.05% (Table 1 and Supplementary Figs. 1-9). In comparison, RefFreeEWAS often has FPRs greater than 0.1% and thus does not perform as well as HIRE, and the unadjusted analysis and SVA further suffer from the dramatic inflation of false positives. For the true alternative settings, given that the FPRs are wellcontrolled, with FPRs below 0.05%, HIRE achieves the highest true positive rates (TPR) of all methods in every simulation setting (see also Fig. 2a   In B 1 and B 2 , the white square represents zero, which indicates that the phenotype exerts no influence on the corresponding methylation level in u i . b Inputs and outputs of HIRE. We input the observed methylation matrix O, the phenotype data matrix X, and a predetermined cell type number K into HIRE, and HIRE outputs the estimates for the cellular compositions b p, the baseline methylation profiles b μ, the phenotype effects b B ' , and the penalized BIC value. In addition, HIRE tests whether there is any association between CpG site j and phenotype ' in cell type k For example, when the data include five cell types, HIRE can identify 89.6% of the risk-CpG sites with 300 samples, and HIRE can detect almost all risk-CpG sites when the sample size reaches 600, which is a typical sample size for EWAS. Although EWASHer and ReFACTor have low FPRs, they miss a large proportion of risk-CpG sites. EWASHer's maximum TPR is only 35.33%, and ReFACTor's maximum TPR is slightly over 60%. However, in those cases, HIRE's power is greater than 95%. Consistent with the true null scenario, in the true alternative, RefFreeEWAS has inflated FPRs compared to HIRE, and the unadjusted analysis and SVA always have huge false positives. Therefore, HIRE substantially improves the power of association detection at the aggregate level compared with existing methods. In the multiple hypothesis testing, the p-values from the truly null features should follow a uniform distribution on (0, 1), whereas those for the truly alternative features are concentrated near zero 23 . Both the histograms (Fig. 2d-i) and the Q-Q plots ( Fig. 2j-o) show that the p-value distribution of HIRE is the best fit to the underlying truth-there are only a small proportion of signals, followed by RefFreeEWAS and ReFACTor. EWASHer easily overcorrects signals with its p-value density having a dip near zero (Fig. 2h), thus failing to detect the true associations. In contrast, the unadjusted analysis and SVA generate very small p-values clustered near zero, resulting in inflated type I errors.
In addition to the traditional association detection at the aggregate level, HIRE can identify the association for each CpG site with the phenotypes under each cell type. Table 2 shows the FPR and TPR of HIRE for each cell type in various simulation settings. Such fine analysis is not possible with the other methods. Consistent with association detection at the aggregate level, HIRE always controls the FPR well. When K = 3 and n = 180, HIRE accurately detects the risk-CpG sites associated with disease status a TPR of greater than 83% and an FPR of 0.01% or less in all three cell types. Similarly, most of the CpG sites affected by age are also correctly identified in each cell type. HIRE's learned cell-typespecific association patterns closely matches the underlying true associations (see Fig. 2b, c and Supplementary Figs. [18][19][20][21][22][23][24][25][26]. Once again, HIRE's power decreases with the number of cell types and increases with the sample size. When the samples consist of seven cell types and the proportion of the least abundant cell type is as low as 4.2%, given a typical current EWAS with around 600 samples, HIRE can detect most cell-type-specific risk-CpG sites reasonably well. Moreover, HIRE's estimates for the baseline methylation profiles, cellular compositions, and phenotype effects have little bias (Supplementary Figs. 27-62); therefore, HIRE can provide accurate estimates and is powerful in detecting cell-typespecific risk-CpG sites.
In the HIRE model, we assume that different CpG sites are independent, and we investigate the performance of HIRE when such a model assumption is violated and dependences exist among nearby CpG sites. Specifically, we assume that every 50 consecutive CpG sites belongs to a block. For CpG sites within the same block, their random noises ϵ follow a multivariate normal distribution with mean zero and 50 × 50 covariance matrix Σ, and Σ's corresponding correlation matrix has its (i, j) entry equal to ρ |i−j| . We vary ρ to 0.8, 0.6, and 0.4. A comparison of Supplementary Tables 1-3 with Supplementary Table 4 shows that even when strong correlations exit among nearby CpG sites, HIRE still provides good performances in controlling the FPR and detecting the risk-CpG sites under the model misspecification setting.
To further evaluate HIRE's performance on experimentally mixed samples, we conducted another semi-simulated dataset that includes six samples mixed with six purified cell types in predetermined proportions 24 . Once again, HIRE successfully recovers the six underlying reference cell types and estimates the cellular compositions well (see Methods). Performance of HIRE and other competing methods in simulation studies in detecting risk-CpG sites at the aggregate level. For the true null cases in which no CpG site is at risk, the average of the false positive rates (FPRs) based on five replicates is reported. For the true alternative cases, the averages of the FPRs and the true positive rates (TPRs) based on five replicates are reported. The number of CpG sites at risk is 30, 50, and 60 for the cell type number K = 3, 5, and 7, respectively. HIRE calls a CpG site as significant at the aggregate level if it is at risk in at least one cell type. We used Bonferroni correction for each method to control the family-wise error rate (FWER) below α = 0.01. Because HIRE can provide the p-values of CpG sites for all cell types and phenotypes, the p-value threshold for significance is α/(mKq), where m = 10,000 is the number of CpG sites and q = 2 is the phenotype number. For the other five methods, the p-value threshold is set to α/m. Notice that "0" represents exact zero, and "0.00%" indicates a very small positive number that is rounded down to zero using four decimal places Real data analysis. HIRE also provides greater insight into real data than previous studies. The rheumatoid arthritis (RA) dataset 3 contains methylation profiles collected from the whole blood of 354 patients with RA and 335 normal participants. In addition to the RA status, other attributes such as gender, smoking history, age, and batch information are available. We first corrected the batch effects and then applied HIRE to the dataset (Methods). Expected -log10(P) Expected -log 10 (P) Expected − log 10 (P) Expected -log 10(P) Expected -log10(P) Expected -log10(P) Unadjusted SVA

P-values P-values P-values P-values P-values P-values
Observed -log the number of cell types in the previous study 13 . Despite potential batch effects and biological variability, three of the six cell types can be matched to known blood cell references-cell type 1 was matched to CD4+ T cells, cell types 2 and 4 were matched to neutrophils, and the remaining three cell types cannot be aligned to the references (Methods and Supplementary Fig. 64). HIRE detected 63 risk-CpG sites in cell type 3-the largest number of associations across all cell types-but no risk-CpG sites in cell type 1 (Supplementary Table 5). Therefore, the disease status affected some but not necessarily all cell types. Note that the significant CpG site cg06373940 called by HIRE is located on gene ERCC3. The level of ERCC3's corresponding protein has been reported to increase in RA synovium 25 . Moreover, we found that five CpG sites had a significant association with smoking history (Supplementary Fig. 65 and Supplementary Table 6). One of them is cg05575921, which was recently linked to smoking in two other independent studies of blood samples 26,27 . However, these findings were missed by the association detection at the aggregate level in previous analyses of the same dataset 11,13 . The p-value density plots and Q-Q plots for the commonly used methods are also displayed in Fig. 3c-n; they present patterns similar to those observed in the simulation study except for an obvious overcorrection by ReFACTor.
The high resolution provided by HIRE makes it a powerful tool for EWAS studies. Rahmani et al. used ReFACTor 13 to analyze the GALA II blood methylation dataset 28 , which consists of 573 samples collected from a pediatric Latino population. Each sample includes the gender information and belongs to one of the following four populations: Mexican, Mixed Latino, Puerto Rican, and Other Latino. We applied HIRE to the dataset to investigate whether any cell-type-specific CpG sites were associated with gender and ethnicity. We created three dummy variables to represent the four ethnic groups. By taking the indicators of ethnicity as phenotypes in the model, HIRE automatically and simultaneously accounts for the population differences in cell composition and cell-type-specific methylation levels. HIRE correctly selected the number of cell types as six as reported in the previous study 13 (Supplementary Fig. 63b). According to celltype alignment, cell types 1 and 5 can be annotated as CD4+ T cells; cell types 2, 3, and 4 belong to neutrophils; and cell type 6 was annotated as CD56+ natural killer cell (CD56+ NK) using the references (Supplementary Fig. 66). HIRE found that 1936 Fig. 2 Association detection performance of HIRE and commonly used methods in the true alternative setting with K = 3 and n = 180. Source data are provided as a Source Data file. In all figures, red corresponds to HIRE; yellow indicates the unadjusted analysis; brown represents SVA; purple refers to RefFreeEWAS; dark blue indicates EWASher; and light blue corresponds to ReFACTor. a ROC curves of HIRE and commonly used methods. HIRE has the largest area under the curve among all of the methods. b True cell-type-specific association pattern with disease status for 10,000 simulated CpG sites; columns correspond to cell types, and the rows represent the CpG sites. Dark cells correspond to risk-CpG sites, and grey cells are CpG sites not associated with the disease status. c Detected cell-type-specific association pattern with disease status by HIRE. Darkness represents Àlog 10 ðp À valueÞ d-i The p-value density plots for association with disease status in the simulation dataset for d HIRE, e unadjusted analysis, f SVA, g RefFreeEWAS, h EWASHer, and i ReFACTor. j-o The Q-Q plots for association with disease status for j HIRE, k unadjusted analysis, l SVA, m RefFreeEWAS, n EWASHer, and o ReFACTor   CpG sites were associated with ethnicity across all cell types (Supplementary Fig. 67) and identified 14, 52, 155, 15, 18, and 14 risk-CpG sites for gender in cell types 1-6, respectively (Fig. 3b). Gene set enrichment analysis showed that the genes that harbored risk-CpG sites for gender were significantly enriched in seven canonical pathways (Supplementary Table 7), of which the PID_CMYB_PATHWAY was ranked the highest. The transcription factor c − MYB in the PID_CMYB_PATHWAY enhances the progression of breast cancer 29 ; therefore, the different occurrence rates of breast cancers in men and women may be linked to the differences at the epigenome level. In comparison, only one pathway was found to be enriched with the genes that host the risk-CpG sites claimed by ReFACTor at the aggregate level (Supplementary Table 8). All of these observations highlight the importance of the finer-scale resolutions of HIRE.

Discussion
In reality, the phenotype may affect a risk-CpG site in some but not all of the cell types. HIRE can detect the cell-type-specific association pattern with each phenotype for EWAS. The identification of cell-type-specific risk-CpG sites will help epigenetic therapies to target the affected cell types in a more effective manner.
Statistically, instead of assuming fixed reference methylomes for all samples as the existing methods do 9,13,16 , HIRE allows each sample's cell-type-specific methylation profiles to depend on its phenotypes. As a result, HIRE correctly models the multiplicative effects of the cellular compositions on the observed methylation levels, whereas the existing approaches all misspecify the cellular compositions as additive effects (Methods). As a result, HIRE enables the detection of cell-type-specific risk-CpG sites that cannot be feasibly detected with existing state-of-the-art methods. As a byproduct, HIRE also improves the statistical power of association detection at the aggregate level relative to existing state-of-the-art methods. Computationally, the time complexity of one iteration by HIRE is O(nmKp + nK 3 ), which thus provides fast convergence when K is moderate. The statistical and computational advantages equip HIRE to be scaled up for large-cohort EWAS.
So far, in the EWAS community, no gold-standard exists for the comparison of various methods. Ideally, we would like to have epigenetic spike-in experiments in which purified cell types are isolated, CpG-sites are epigenetically edited on a per-cell-type basis, and cell types are finally mixed in predetermined proportions. Given such experiments, the underlying knowledge of which CpGs are differentially methylated in each cell type and the cell mixing proportions for each sample are known. However, biotechnologies for epigenetic editing, such as CRISPR-Cas, are still not mature at this stage, with many off-target modifications 30 . Therefore, most computational EWAS studies refer to numerical simulation studies rather than to experimental studies when evaluating the performance of their algorithms 12,13 . Here, we follow the example of previous comparative studies and design our simulation studies to serve as the computational counterpart of experimental spike-in studies. With the rapid advances in epigenetic editing, we hope the community can devote greater effort in the near future to the creation of a gold-standard dataset, such as those generated in the early years for gene expression microarray studies 31 .
The beta-values that represent methylation levels always lie between zero and one. As previous approaches to EWAS often assume normal distribution for the beta-values and show good performances in real applications 9,13 , in HIRE, we also assume that the beta-values follow a normal distribution. Consequently, the fitted methylation level may lie outside the range of [0, 1]. Nevertheless, we do in fact constrain the baseline methylation profiles μ jk s to the closed interval [0, 1] and force the cellular compositions p ki s to be non-negative and to add up to one: P K k¼1 p ki ¼ 1. As a result, because the phenotypes have no effect on most CpG sites, most observations, O ji s, have their means P K k¼1 μ jk p ki s in [0, 1]. In fact, for both the RA dataset and the GALA II dataset, more than 99.99% of the fitted methylation valuesÔ ji s based on HIRE estimates lie between zero and one. Therefore, the normal assumption fits the data reasonably well and does not have a large effect on the performance of HIRE.
One major issue for all of the cell-type deconvolution methods is that deconvolution cannot be achieved if the cellular compositions do not vary among samples. For example, assuming that the samples are mixtures of two cell types and p i = p for all of the samples, then the observed methylation profile O i equals u i1 p 1 + u i2 p 2 = (u i1 + p 2 C)p 1 + (u i2 − p 1 C)p 2 := e u i1 p 1 þ e u i2 p 2 for any constant C. As a result, u i1 and u i2 are not estimable. In our paper, we show mathematically that HIRE is identifiable under mild conditions in Theorem 1 and that condition (b) of Theorem 1 formulates the requirement for the variability of the cellular compositions (Methods). HIRE can accurately estimate cellular compositions of tissues with great cellular heterogeneity, such as blood. Although the mild conditions in Theorem 1 are easily met for real DNA methylation data, identification of both sufficient and necessary conditions for model identifiability is a theoretically interesting and challenging statistical problem that we will investigate in a future study.
HIRE requires a moderate sample size to obtain precise estimates because HIRE needs to learn (1 + 2K + qK)m + (K − 1)n parameters with a total of mn observed values. Our simulation studies show that HIRE performs very well at the aggregate level with 180 samples (Table 1). If the sample size drops below 150, say to 120, HIRE can still control the FPR well but begins to lose power (Supplementary Table 9). For small sample sizes, we have also developed a special case of HIRE by reparameterizing all σ 2 gk s as one single parameter σ 2 , and we found that such a variancestabilized approach can achieve even better inflation control (see Supplementary Figs. 71-76) and power comparable to HIRE (see Supplementary Table 10). Like the two datasets analyzed in the real application, a typical sample size for a current EWAS exceeds 500, thus guaranteeing a high TPR for HIRE. Given the decreasing cost of EWAS, we recommend that researchers collect at least 200 samples for their studies for association detection at the aggregate level and 600 samples for identification of cell-typespecific risk-CpG sites. A larger sample size can further boost the power.
With the popularity of EWAS, we believe that HIRE will be widely applied, and we hope that HIRE can motivate more researchers to mine out finer-scale results from EWAS.

Methods
Multiplicative effects of cellular composition on methylation. In this section, we illustrate that the effects of the cell-type composition are actually multiplicative. Let us assume that the beta-values that represent the methylation levels are observed across m CpG sites for n samples. As the measured sample comprises cells of various types, the observed beta-value is a weighted average of the mean methylation levels of distinct cell types, and the weights correspond to the proportions of each cell type. Let O ji denote the measurement at CpG site j for sample i. If we assume that there exist K cell types in all samples and that the mean methylation level for CpG site j in cell type k is μ jk , then where p ki is the proportion of cell type k in sample i with a natural constraint P K k¼1 p ki ¼ 1, and ϵ ji is a random error.
Let us consider a case-control EWAS. Without loss of generality, we assume that CpG site j is differentially methylated between cases and controls in cell type 1 with a mean shift δ j1 and that it is not differentially methylated in the remaining cell types. As a result, for case samples, If we then use Z i to indicate the case-control status of sample i, the observed methylation level becomes Therefore, the proportions of cell type 1-p 1i , i = 1, …, n-have multiplicative effects rather than additive effects on the mean difference between the case and control samples. The existing methods, which either estimate the cell type proportions explicitly or approximate them implicitly with surrogate variables, add the estimated proportions and the case-control indicator Z i as the covariates to the regression as follows: where b jk s are the regression coefficients. As a result, CpG site j is called differentially methylated on the basis of hypothesis testing for τ j = 0. In general, τ j in Eq. (4) is not equal to δ j1 in Eq. (3). Please see the Supplementary Notes for a numerical example. Moreover, testing for τ j = 0 loses the information regarding cell type in which CpG site j may be at risk. To account for the multiplicative effects, we propose the HIRE model that conserves the individual cell-type level information, which is introduced in the next section.
The HIRE model. HIRE uses a hierarchical model to closely follow the data generation process for the EWAS data. To begin, we assume that the baseline methylation level for CpG site j in cell type k is μ jk . For sample i with phenotypes x i = (x i1 , …, x iq ), the mean methylation value for CpG site j in cell type k is assumed to be μ jk þ P q '¼1 β jk' x i' . In other words, the phenotypes have linear effects where β jk' characterizes the influence of phenotype ' on CpG site j in cell type k. Let u ijk represent the signal from CpG site j in cell type k for sample i with x i . We assume that u ijk follows a normal distribution with mean μ jk þ P q '¼1 β jk' x i' and standard deviation σ jk , After u ijk s are generated for all of the K cell types, the observed methylation value O ji is sampled as follows: Collectively, O = {O ji : 1 ≤ j ≤ m, 1 ≤ i ≤ n} denote the observed data; u = {(u ij1 , …, u ijK ) T : 1 ≤ i ≤ n, 1 ≤ j ≤ m} are the missing data; and μ j = (μ j1 , …, μ jK ) T , B ðjÞ ¼ ðβ jk' Þ K q , σ 2 ϵj , the diagonal matrix Σ j ¼ diagðσ 2 j1 ; ; σ 2 jK Þ for j = 1, …, m, and p i = (p 1i , …, p Ki ) T for i = 1, …, n are the parameters. With Θ ¼ fp i ; μ j ; B ðjÞ ; Σ j ; σ 2 ϵ;j : 1 j m; 1 i ng, the complete data log-likelihood function, l c , can be expressed as follows: Accordingly, we develop a generalized expectation-maximization algorithm 32 to estimate the parameters. In the expectation-maximization algorithm, a good initialization can lead to faster convergence than random starts. We adopt the cellular composition estimations from the methylation matrix decomposition algorithm 16 with slight modifications as the initializations. The initial values for the baseline methylation profiles μ jk are accordingly estimated by simple linear regressions. As the number of risk-CpG sites is often small, all of the phenotype effects β jk' are set to zero at the beginning. For the standard deviations, the initial values are randomly sampled from inverse gamma distributions with small means. We choose the number of cell types K by using a variant of the penalized Bayesian information criterion (pBIC) 20 (see details in Supplementary Methods).
For each phenotype ', we can conduct the hypothesis test H 0 : β jk' ¼ 0 versus H 1 : β jk' ≠ 0 for any cell type k and any CpG site j. Combining Eqs. (5) and (6), we obtain the following equations: We can then take (O j1 , …, O jn ) as the response vector and concatenate 1 n , (p k1 , …, p kn ) (k = 2, …, K) and ðx 1' p k1 ; ; x n' p kn Þ ð' ¼ 1; ; q; k ¼ 1; ; KÞ to a n × (p + 1) · K design matrix in the linear regression. We plug in the estimated cellular compositionsp ki' and conduct the hypothesis test for β jk' ¼ 0 using the two-sided t-tests in the linear models. We claim that CpG site j has an association with phenotype ' at the aggregate level if phenotype ' affects CpG site j in at least one of the K cell types. Note that in the regression we incorporate the estimated cellular compositions into the linear model as multiplicative effects rather than additive effects.
More technical details of the method and the algorithm are available in the Supplementary Methods. Data simulation. We compared the performance of HIRE with five previous methods-unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTorin 18 simulation settings. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated the true null case and the true alternative case. As a result, we have in total 3 (the number of sample sizes) × 3 (the number of cell types) × 2 (the true null case and the true alternative case) = 18 simulation settings. For each setting, we considered 10,000 CpG sites and simultaneously accounted for the following factors.
Cell lineage. We first constructed the baseline methylation matrix μ = (μ jk ) m×K , in which each column corresponds to the baseline methylation levels of a cell type. To mimic the phenomenon in which cell types from the same lineage have similar methylation profiles, we assumed that K sim of the total K cell types were similar. Specifically, without loss of generality, we assumed that the first K sim cell types came from the same cell lineage and that the remaining K − K sim cell types are irrelevant to one another. We set K sim to 2, 2, and 3 for K = 3, 5, and 7, respectively. We generated μ jk for cell types k = 1, K sim + 1, …, K from the beta distribution beta(3, 6) on each CpG site j independently. For each of the remaining cell types k ′ = 2, …, K sim , we randomly selected 20% of the CpG sites and drew their μ jk′ s independently from beta (3,6); and for the remaining 80% of CpG sites, we let their μ jk′ be μ j1 plus a very small randomness, thus inducing the similarities among cell types 1 to K sim .
Discrete and continuous phenotypes. We further generated a discrete and a continuous phenotype x = (x 1 , x 2 ) T for each individual i (i = 1, …, n). We let the first n/3 individuals be the control samples with x i1 = 0 for i = 1, …, n/3 and the remaining 2n/3 individuals serve as cases with x i1 = 1 for i = n/3 + 1, …, n. The continuous phenotypes x 2 = (x 12 , …, x i2 , …, x n2 ) T were independently drawn from a Unif(20, 50) to act as age.
Phenotype effects with different magnitudes and directions. We then simulated the phenotype effect β jk' of each phenotype ' on CpG site j in cell type k. For the true null cases, all of the β jk' s are zero. For a true alternative setting, we set nonzero phenotype effects as follows.
For phenotype 1-the case/control status, we let it affect the first 10 CpG sites in all of the cell types: β jk1 ≠ 0 for j = 1, …, 10 and k = 1, …, K. We then assumed that the next 10 CpG sites were influenced by the disease status in the first K sim cell types which come from the same lineage but not the other cell types: β jk1 ≠ 0 (k = 1, …, K sim ) and β jk1 = 0 (k = K sim + 1, …, K) for any j = 11, …, 20. Furthermore, for cell type k ∈ {K sim + 1, …, K}, we let the disease status affect CpG sites j = 20 + 10(k − K sim − 1) + 1, …, 20 + 10(k − K sim ) only in cell type k. We generated the cell-typespecific effects of age in a similar fashion for CpG site loci 21 to 40 + 10(K − K sim ).
Finally, we generated the observed value O ji for CpG site j of sample i as follows: sample u ijk from N(μ jk + β jk1 x i1 + β jk2 x i2 , 0.01 2 ) for k = 1, …, K; and sample O ji from Nð P K k¼1 u ijk p ki ; 0:01 2 Þ. In case O ji lies outside the interval (0, 1), we truncate it to zero if O ji is lower than zero and to one if O ji is greater than one.
Semi-simulated dataset including samples with known cell mix proportions. The GEO dataset GSE110554 24 contains purified cell-type-specific methylation profiles for six cell types: neutrophils, monocytes, B cells, CD4+ T, CD8+ T, and NK. Moreover, GSE110554 includes mixed samples whose methylation signals were aggregated from the six cell types with predetermined cell mix proportions. Therefore, because of the known cell type and cellular proportion information, GSE110554 is an ideal dataset with which to test HIRE's performance.
In GSE110554, the number of mixed samples is much smaller than the typical size of an EWAS and, as discussed in the manuscript, HIRE usually requires hundreds of samples to obtain accurate and stable results. Therefore, to increase the sample size, we first generated a simulated methylation dataset with 600 samples using the purified methylation profiles. We focused on 10k CpG sites, including the 450 IDOL CpG sites, which were previously identified as the optimal library of CpG sites for estimation of leukocyte subtype proportions 24 , and another 9550 CpG sites whose methylation values across the purified cell types fell within the range of [0.2, 0.8] and had large standard deviations 11 . We then combined the 600 samples and six mixed samples (generated by method A) 24 available in GSE110554 to compose a semi-simulated dataset.
After applying HIRE to the semi-simulated data, we annotated the estimated cell types based on the methylation profiles from GSE110554. Supplementary Figure 69 shows the heatmap for the Pearson correlation matrix between inferred cell types and the underlying truth. The correlation signals on the diagonal are the strongest in each row. HIRE successfully recovers the six underlying cell types. We also compared the estimated cellular compositions with the underlying true proportions for the six mixed samples. Each panel in Supplementary Fig. 70 displays a scatter plot between the cellular proportion estimates and the true mix proportions for a given cell type; they all indicate that HIRE obtains good estimates for cellular compositions.
Cell type matching protocol. Assume that we have the reference methylation profiles for the H annotated cell types. We first denote the methylation profile for cell type h as ϕ h = (ϕ 1h , …, ϕ mh ). We aim to annotate μ k using the references. Following the previous study 33 , first, we calculate the cosine similarity, the Pearson correlation, and the Spearman correlation between μ k and ϕ h for each cell type h ∈ {1, …, H}. Notice that the three similarity measures lie between −1 and 1, and a high positive value indicates great similarity between two vectors. Second, for each similarity measure ' (' ¼ 1; 2; 3), we identify the cell type h ' that has the maximal degree of similarity with μ k . If at least two out of the three similarity measures identify the same reference cell typeh and their corresponding similarity values are greater than 0.5, then we annotate μ k with the reference cell typeh. Otherwise, μ k is believed to belong to a new cell type that is not included in the references. We repeat the above process for each methylation profile μ k estimated from HIRE.
Blood cell references. The two real data sets analyzed in our applications were obtained from whole blood. Therefore, we prepared the references from a whole blood methylation study 34 with GEO accession code GSE35069. The study included seven isolated blood cell subpopulations-CD4+ T cells, CD8+ T cells, CD14 + monocytes, CD19+ B cells, CD56+ NK cells, neutrophils, and eosinophils-for six individuals. Accordingly, we define the reference profile ϕ h for cell type h as the average methylation profile of these individuals, i.e., ϕ h :¼ 1 Data preprocessing. The RA dataset is publicly available in GEO with accession number GSE42861. The dataset measures the methylation levels of the whole blood. The methylation data have been normalized by Illumina's control probe scaling procedure (see Liu et al. 3 "Illumina 450K microarray data preprocessing" section for details). The dataset includes 689 samples, and the RA status, age, gender, smoking history, and batch information are available for each sample. We removed two samples GSM1051535 and GSM1051691 because their smoking information is missing. CpG sites with a high methylation mean (>0.8) and a low methylation mean (<0.2) were discarded 11,13 . We adjusted the data for batch effects using COMBAT 35 . The correction process was justified because we did not observe a high degree of co-linearity between the RA status and the batches (Supplementary Fig. 68). The 10,000 most variable CpG sites were kept. For the RA status, we denoted RA patients with 1 and the normal control subjects with 0; we represented men with 1 and women with 0; for the smoking history, we used (0, 0, 0) to refer to "never," (1, 0, 0) to "ex," (0, 1, 0) to "current," and (0, 0, 1) to "occasional" smokers. We downloaded the GALA II dataset from Gene Expression Omnibus (GEO) with accession number GSE77716. The dataset contains the whole-blood DNA methylation beta-values from 573 samples. The beta-values have been normalized by SWAN 36 and corrected for batch effects by COMBAT 35 . There are two types of covariates: gender and ethnicity. Ethnicity includes Mexican, Mixed Latino, Puerto Rican, and Other Latino. Out of the 573 samples, one sample "GSM2057284" has no gender information, so we removed it. As suggested by previous studies 11,13 , CpG sites with a mean methylation value of less than 0.2 or higher than 0.8 were filtered out. We selected the 10,000 most variable of the remaining CpG sites. For gender, we denoted men with 1 and women with 0. For the ethnicity variables, we used three dummy variables to represent the four ethnicity categories. In particular, (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1) corresponded to Mexican, Mixed Latino, Puerto Rican, and Other Latino, respectively.
For ReFACTor and EWASHer, according to their rules, we first filtered out CpG sites that were consistently hypomethylated or consistently hypermethylated and then regressed out the known covariates. We finally used the residuals to perform their analysis. Note that in their software these steps are processed automatically. For RefFreeEWAS, SVA, and the unadjusted analysis, the phenotypes and the covariates were regarded as the fixed effects in the regression model. In detail, for ReFACTor, in both GALA II and RA datasets, the cell type number "K" was specified to be six, which was the same as in their paper 13 . For RefFreeEWAS, we fixed the dimensionality of latent space "d" at six in the real data. For SVA, we also fixed the number of surrogate variables to six.
Gene enrichment analysis was carried out on the Broad Institute website http:// software.broadinstitute.org/gsea/msigdb/annotate.jsp. The canonical pathways were selected as the basis gene sets, and only pathways with a false discovery rate of less than 0.05 were reported.
Identifiability of HIRE. Although the non-negative matrix factorization (NNMF) O = μP has been widely applied in cell type deconvolution 16 , where O is the observed methylation matrix, μ is the unknown methylation profile, and P is the unknown cellular compositions, model identifiability is rarely discussed. During the review period of our paper, Rahmani et al. 37 provided a setting under which the NMMF model is not identifiable.
Why then does NNMF always provide satisfactory cell type deconvolution results in real practice, and why can HIRE estimate all those parameters well? Here, we show mathematically that the HIRE model is identifiable under mild conditions that are easily met in reality.
Let us first introduce some notations and definitions. In the HIRE model, the whole parameter set is denoted by Θ :¼ fP i ; μ j ; B ðjÞ ' ; σ 2 jk ; σ 2 ϵj : 1 j m; 1 i n; 1 k K; 1 ' qg, where p i is the cellular composition vector of sample i, μ j is the baseline methylation vector of CpG site j, B ðjÞ ' is the phenotype ' effect vector on CpG site j, σ 2 jk is the cell-type-k noise variance on CpG site j, and σ 2 ϵj is the overall noise variance on CpG site j. The observed data in our study are the methylation matrix O = {O ij :1 ≤ i ≤ n, 1 ≤ j ≤ m} and the covariate matrix X ¼ ðx 1 ; ; x ' ; ; x q Þ, where x ' is the column vector that indicates phenotype-' for the n samples. The observed likelihood function We further define 1 K = (1, 1, …, 1) T as a K-dimension column vector with all entries being one, an n by K matrix J 1 as 1 n 1 T K , and an n by K matrix J x ' as x ' 1 T K for each 1 ' q. We use ⊙ to represent the entry-wise matrix product for two matrices M and N with the same dimension, i.e., ðM NÞ ij :¼ M ij N ij . Theorem 1. If (a) for each cell type k, there exists a CpG site r k such that B ðr k Þ ' ¼ 0 for any phenotype ' and μ r k k ¼ 1 while μ r k k′ ¼ 0 for k′ ≠ k, and (b) the cellular compositions P satisfies that rankððJ 1 P T ; J x 1 P T ; ; J x ' P T ; ; J x q P T ÞÞ ¼ ðq þ 1ÞK and rankðð1 n ; P T Þ ð1 n ; P T ÞÞ ¼ K þ Taking j = r k in Eq. (8), we have LHS ¼ P T i μ r k þ P q '¼1 x i' P T i B ðr k Þ ' ¼ P T i μ r k ¼ 0 þ P ki Á 1 þ 0 ¼ P ki and similarly RHS ¼P ki , so P ki ¼P ki , which holds for any i and k. Hence, we obtain P ¼ e P. Next, we rewrite Eq. (8) into a matrix form.

ARTICLE
Because the rank of A :¼ ðJ 1 P T ; J x 1 P T ; ; J x ' P T ; ; J x q P T Þ is (q + 1)K (full column rank), A has a left inverse A −1 . Multiplying Eq. (10) by A −1 from the left on both sides, we obtain μ j ¼ e μ j and B ðjÞ ' ¼ e B ðjÞ ' for 1 ' q. Therefore, we have μ ¼ e μ, B ¼ e B. In addition, because Eq. (9) holds for any i, we can also rewrite it into a matrix form. . . .
The left matrix is equal to ð1 n ; P T Þ ð1 n ; P T Þ which has a full column rank; therefore, it has a left inverse. Consequently, σ 2 ϵj ¼σ 2 ϵj and σ 2 jk ¼σ 2 jk . As a result, Θ ¼ e Θ, and we have proven the identifiability of HIRE. & Conditions (a) and (b) are easily met for DNA methylation data. Condition (a) requires that for each cell type k, there exists a CpG site that is not associated with any phenotype and is only methylated in cell type k but not methylated in any other cell type. Given the 450K CpG sites assayed by the microarray, we can expect that such CpG sites are not absent at all. Moreover, condition (a) can also be relaxed to the condition that for each cell type k, there exists a CpG site r k such that B ðr k Þ ' ¼ 0 for any phenotype ' and μ r k k ¼ 1 while μ r k k′ ¼ 0 for k′ ≠ k or there exists a CpG site r k such that B ðr k Þ ' ¼ 0 for any phenotype ' and μ r k k ¼ 0 while μ r k k′ ¼ 1 for k′ ≠ k. The proof follows in a similar manner.
For condition (b), intuitively, the rank requirement of ð1 n ; P T Þ ð1 n ; P T Þ asks the cellular compositions to vary across subjects, which guards against the case in which all the subjects have the same cellular compositions and hence no cell type deconvolution is possible; the rank requirement on ðJ 1 P T ; J x 1 P T ; ; J x ' P T ; ; J x q P T Þ is the same requirement as those in a standard linear regression, which requires that no collinearity exists among the covariates. Because the sample size n is much larger than the underlying cell type number K and the phenotype number q, the two rank requirements can commonly be satisfied in reality.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The RA whole blood methylation dataset is available in the Gene Expression Omnibus (GEO) with the accession number GSE42861. The GALA II whole blood methylation dataset can be downloaded from GEO with the accession number GSE77716. The accession number for the blood cell references is GSE35069. The purified methylation data and mixed samples used to generate the semisimulated dataset are taken from GSE110554.

Code availability
The software and detailed documentations are available on Bioconductor with the software HIREewas page [http://www.bioconductor.org/packages/release/bioc/html/ HIREewas.html].