Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Correspondence
  • Published:

Correcting for cell-type heterogeneity in epigenome-wide association studies: revisiting previous analyses

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

References

  1. Jaffe, A.E. & Irizarry, R.A. Genome Biol. 15, R31 (2014).

    Article  Google Scholar 

  2. Houseman, E.A. et al. BMC Bioinformatics 13, 86 (2012).

    Article  Google Scholar 

  3. Houseman, E.A., Molitor, J. & Marsit, C.J. Bioinformatics 30, 1431–1439 (2014).

    Article  CAS  Google Scholar 

  4. Leek, J.T. & Storey, J.D. PLoS Genet. 3, e161 (2007).

    Article  Google Scholar 

  5. Zou, J., Lippert, C., Heckerman, D., Aryee, M. & Listgarten, J. Nat. Methods 11, 309–311 (2014).

    Article  CAS  Google Scholar 

  6. Gagnon-Bartsch, J.A. & Speed, T.P. Biostatistics 13, 539–552 (2012).

    Article  Google Scholar 

  7. Teschendorff, A.E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).

    Article  CAS  Google Scholar 

  8. Rahmani, E. et al. Nat. Methods 13, 443–445 (2016).

    Article  CAS  Google Scholar 

  9. Liu, Y. et al. Nat. Biotechnol. 31, 142–147 (2013).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

A.E.T. was supported by a Royal Society Newton Advanced Fellowship (164914), the National Natural Science Foundation of China (31571359) and the Chinese Academy of Sciences. S.B. was supported by the EU 7th Framework Program BLUEPRINT Project (282510). D.C.K. was supported by the US National Institutes of Health (1KL2TR000119) and Kansas IDeA Network of Biomedical Research Excellence (K-INBRE) Bioinformatics Core, supported in part by the National Institute of General Medical Science award P20GM103418.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew E Teschendorff.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 SVD analysis in a breast cancer-tissue EWAS.

Left panel: fraction of data variation (fracV, yaxis) explained by each of the top 35 principal components (x-axis) in the Illumina 450k dataset of 50 normal breast and 305 breast cancer samples1. Right-panel: Heatmap of P-value associations between each of the top 35 principal components and various factors, including normal cancer (N/C) status and proportions of cell subtypes, including breast epithelial cells (BrEpi), Adipocytes (Adip), B-cells (B), Natural-Killer Cells (NK), CD4+ and CD8+ T-cells (CD4T, CD8T), Monocytes and Neutrophils. Colors indicate level of statistical significance, as shown, obtained from running a multivariate model of each component against all factors together as covariates. Observe how the top 2 components of variation correlate most strongly with N/C status than with cell-type composition. It is only PC-3 that correlates more strongly with cell-type composition (in this case, changes in adipose/breast epithelial cell content).

Supplementary Figure 2 Modelling cell-type composition.

A) R2 values for the linear model predicting flow-cytometric (FC) estimates for blood cell subtypes, as a function of the number of successive ReFACTor components included in the model (red data points and line) for the GALA2 dataset used in Rahmani et al2. ReFACTor was run with k=6 and numcomp=10, as in the Rahmani study. Horizontal brown line indicates the R2 for the ReFACTor model which only includes components that have significant likelihood ratio test (LRT) Pvalues (P<0.05), as indicated. Horizontal blue line denotes the performance of the reference-based constrained projection (CP) method of Houseman. Last panel shows the average performance over the blood cell subtypes. B) As A), but for the 18 whole blood samples (for which the mixing cellular proportions of 6 blood cell subtypes are exactly known in the case of 12 samples, whereas for the other 6 there were FC estimates) of Koestler et al 3, running ReFACTOR with k=6 and numcomp=6. C) Mean R2 values of estimating cell-type composition in five different in-silico mixture experiments, where 100 insilico random mixtures were generated by mixing together purified blood cell subtypes from a pool of 1573 such samples drawn from over 6 different cohorts. For each in-silico experiment a total of 30 Monte-Carlo runs were performed, and P-values shown are from a two-tailed paired Wilcoxon rank sum test. The 2 methods being compared are ReFACTor (k=6) and CP.

Supplementary Figure 3 Evaluation of cell-type composition of ReFACTor with and without RMT estimate for numcomp parameter and comparison to CP method.

Mean R2 values (averaged over the R2 values for each of 6 blood cell subtypes), as inferred with one of 3 different methods (ReFACTor (with k=6 & numcomp=6), ReFACTor (with k=6 & numcomp from RMT) and Reference-based Constrained Projection (RefBased-CP) in the data from Koestler et al3. A) for all 18 samples (12 experimentally generated mixtures + 6 whole blood samples), B) for the 12 experimentally generated mixtures, C) for the 6 whole blood samples with flow-cytometric estimates. ReFACTor was run using k=6 and default parameters as specified in text. In order to avoid overfitting and to avoid biased inflated R2 values, the parameter numcomp was also estimated using RMT4, resulting in numcomp=2 for A), numcomp=1 for B) and numcomp=2 for C). Validation of these RMT estimates is given in SI fig.S4.

Supplementary Figure 4 Validation of RMT estimates in small datasets.

A) Density plot of folded eigenvalues from the application of RMT4 to the reduced data matrices obtained by application of ReFACTor to Koestler et al's data. ReFACTor was run with k=6, and selecting 500 informative features, so from left to right, the matrices were of size 500 x 18, 500 x 12 and 500 x 6. Green curves show the RMT estimate, red curves show the observed eigenvalue distribution. The number of peaks (indicated by vertical red lines) above the theoretical RMT maximum estimate is the number of significant components of variation. Thus, RMT estimates 2, 1 and 2 components, respectively. B) Validation of the RMT estimates using the Buja-Eyuboglu algorithm 5, which compares the observed eigenvalue distribution (black) to the corresponding distribution obtained by empirically scrambling up/permuting the data matrix (green). The number of components with fraction of variation (fracV) above the empirical null is the Buja-Eyuboglu estimate, which we see agrees with the analytical RMT estimate.

Supplementary Figure 5 Overfitting of ReFACTor in modelling cell-type composition.

Barplots show average R2 values from ReFACTor using a training/test set strategy. Each value is the average over 50 runs, each run with a different training/test set partition. Each training set was made up of 50 randomly selected samples, with the rest of samples (n=28) making up the test set. ReFACTor was run on the training set using k=6 and numcomp=10, and selecting the top-500 informative features. Multivariate regression with the 10 inferred components as predictors and the FACS cell-type estimates as response lead to an estimate of the R2 value for each training set. For the test set, we inferred the 10 components by using the test data over the 500 informative features (as inferred from the training set) and using the PCA-projection matrix as inferred in the training set. R2 value in the test set was obtained by first constructing the cell-type composition estimate as the linear combination of these 10 components with the linear regression coefficients as estimated from the training set, and then correlating the estimate with the FACS estimates. In the figure we also show the average values over training and test sets combined.

Supplementary Figure 6 Evaluation of cell-type composition in the 18 whole blood samples of Koestler et al.

R2 values for the linear model predicting cell proportions estimates of 6 blood cell subtypes, as a function of the number of successive ReFACTOR components included in the model (red data points and line), and running ReFACTOR with k=6 and numcomp=10. Horizontal brown line is for the ReFACTor model which only includes components that have significant likelihood ratio test (LRT) P-values (P<0.05), as indicated. Horizontal blue line denotes the performance of the reference-based CP method. Last panel shows the average performance over all blood cell subtypes.

Supplementary Figure 7 Evaluation of cell-type composition estimates in 12 reconstructed whole blood samples of Koestler et al.

R2 values for the linear model predicting exact cell proportion estimates of 6 blood cell subtypes for the 12 experimentally reconstructed mixtures, as a function of the number of successive ReFACTOR components included in the model (red data points and line), and running ReFACTOR with k=6 and numcomp=6. Horizontal brown line is for the ReFACTor model which only includes components that have significant likelihood ratio test (LRT) P-values (P<0.05), as indicated. Horizontal blue line denotes the performance of the reference-based CP methods. Last panel shows the average performance over the blood cell subtypes.

Supplementary Figure 8 Evaluation of cell-type composition estimates in the 6 whole blood samples of Koestler et al.

R2 values for the linear model predicting flow cytometric estimates of 6 blood cell subtypes for the 6 whole blood samples with FACS estimates, as a function of the number of successive ReFACTOR components included in the model (red data points and line), and running ReFACTOR with k=6 and numcomp=6. Horizontal brown line is for the ReFACTor model which only includes components that have significant likelihood ratio test (LRT) P-values (P<0.05), as indicated. Horizontal blue line denotes the performance of the reference-based CP method. Last panel shows the average performance over all blood cell subtypes.

Supplementary Figure 9 SVD analysis on the 18 samples of Koestler et al.

A) Fraction of variation (fVAR) carried by each of the principal components from the application of PCA to Koestler's et al data (18 samples in total, with 12 representing experimental whole blood mixtures (MIX) of 6 purified blood cell subtypes and 6 samples representing true whole blood (WB) specimens for which flow cytometric estimates were available. B) Weight in PC-2 correlates with nature of whole blood sample, i.e whether an experimental mixture (MIX) or a true whole blood (WB) sample. P-values is from a Wilcoxon rank sum test. C) Heatmap of R2 values between the top 10 PCs and the known cell-type proportions of the samples and nature of sample (MIX/WB). Note how PC-2, which carries substantially more variation than lower ranked PCs, is strongly correlated with the type of whole blood sample. This is a scenario where ReFACTor may break down as it assumes that all variation in PCs is associated with changes in cell-type composition (see e.g. SI fig.S2B).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–9, Supplementary Tables 1–2 and Supplementary Methods (PDF 1133 kb)

Supplementary Data 1

refbreast. (XLS 97 kb)

Supplementary Data 2

goldstandard-Breast-NC. (XLS 82 kb)

Supplementary Data 3

refblood. (XLS 82 kb)

Supplementary Software 1

DoCP.R: R-script to run the reference-based CP algorithm

Supplementary Software 2

DoReFACTor.R

Supplementary Software 3

EvalReFACTor-LRT.R

Supplementary Software 4

EvalReFACTor-RMT.R

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, S., Beck, S., Jaffe, A. et al. Correcting for cell-type heterogeneity in epigenome-wide association studies: revisiting previous analyses. Nat Methods 14, 216–217 (2017). https://doi.org/10.1038/nmeth.4187

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.4187

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing