Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Machine learning for deciphering cell heterogeneity and gene regulation

Abstract

Epigenetics studies inheritable and reversible modifications of DNA that allow cells to control gene expression throughout their development and in response to environmental conditions. In computational epigenomics, machine learning is applied to study various epigenetic mechanisms genome wide. Its aim is to expand our understanding of cell differentiation, that is their specialization, in health and disease. Thus far, most efforts focus on understanding the functional encoding of the genome and on unraveling cell-type heterogeneity. Here, we provide an overview of state-of-the-art computational methods and their underlying statistical concepts, which range from matrix factorization and regularized linear regression to deep learning methods. We further show how the rise of single-cell technology leads to new computational challenges and creates opportunities to further our understanding of epigenetic regulation.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Chromatin organization and epigenomic readouts.
Fig. 2: Deconvolution of complex DNA methylation data.
Fig. 3: Feature generation and modeling options for gene expression prediction using epigenomics data.
Fig. 4: Workflow of single-cell epigenomics methods.

Similar content being viewed by others

References

  1. Alberts, B. et al. Molecular Biology of the Cell 4th edn (Garland, 2002).

  2. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  3. Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 14, 3156 (2013).

    Article  Google Scholar 

  4. Hannum, G. et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49, 359–367 (2013).

    Article  Google Scholar 

  5. Stefansson, O. A. et al. A DNA methylation-based definition of biologically distinct breast cancer subtypes. Mol. Oncol. 9, 555–568 (2015).

    Article  Google Scholar 

  6. Capper, D. et al. DNA methylation-based classification of central nervous system tumours. Nature 555, 469–474 (2018).

    Article  Google Scholar 

  7. Yang, C., Zhang, Y., Xu, X. & Li, W. Molecular subtypes based on DNA methylation predict prognosis in colon adenocarcinoma patients. Aging 11, 11880–11892 (2019).

    Article  Google Scholar 

  8. Koelsche, C. et al. Sarcoma classification by DNA methylation profiling. Nat. Commun. 12, 498 (2021).

    Article  Google Scholar 

  9. Moran, S. et al. Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol. 17, 1386–1395 (2016).

    Article  Google Scholar 

  10. Sheffield, N. C. et al. DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma. Nat. Med. 23, 386–395 (2017).

    Article  Google Scholar 

  11. Klughammer J. et al. The DNA methylation landscape of glioblastoma disease progression shows extensive heterogeneity in time and space. Nat. Med. 24, 1611–1624 (2018).

  12. Huynh, J. L. et al. Epigenome-wide differences in pathology-free regions of multiple sclerosis-affected brains. Nat. Neurosci. 17, 121–130 (2014).

    Article  Google Scholar 

  13. Rakyan V. K. et al. Identification of type 1 diabetes–associated DNA methylation variable positions that precede disease diagnosis. PLoS Genet. 7, e1002300 (2011).

  14. Pidsley, R. et al. Methylomic profiling of human brain tissue supports a neurodevelopmental origin for schizophrenia. Genome Biol. 15, 483 (2014).

    Article  Google Scholar 

  15. Stunnenberg, H. G. International Human Epigenome Consortium & Hirst, M. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167, 1145–1149 (2016).

    Article  Google Scholar 

  16. Harris, R. A. et al. Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat. Biotechnol. 28, 1097–1105 (2010).

    Article  Google Scholar 

  17. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).

    Article  Google Scholar 

  18. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

    Article  Google Scholar 

  19. Kempfer, R. & Pombo, A. Methods for mapping 3D chromosome architecture. Nat. Rev. Genet. 21, 207–226 (2020).

    Article  Google Scholar 

  20. Cazaly, E. et al. Making sense of the epigenome using data integration approaches. Front. Pharmacol. 10, 126 (2019).

    Article  Google Scholar 

  21. Yong, W.-S., Hsu, F.-M. & Chen, P.-Y. Profiling genome-wide DNA methylation. Epigenetics Chromatin 9, 26 (2016).

    Article  Google Scholar 

  22. Nakato, R. & Sakata, T. Methods for ChIP-seq analysis: a practical workflow and advanced applications. Methods https://doi.org/10.1016/j.ymeth.2020.03.005 (2020).

  23. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).

    Article  Google Scholar 

  24. Sheffield, N. C. & Bock, C. LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics 32, 587–589 (2016).

    Article  Google Scholar 

  25. McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).

    Article  Google Scholar 

  26. Finotello F. & Trajanoski Z. Quantifying tumor-infiltrating immune cells from transcriptomics data. Cancer Immunol. Immunother. 67, 1031–1040 (2018).

  27. Sturm, G. et al. Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology. Bioinformatics 35, i436–i445 (2019).

    Article  Google Scholar 

  28. Sompairac N. et al. Independent component analysis for unraveling the complexity of cancer omics datasets. Int. J. Mol. Sci. 20, 4414 (2019).

  29. Li, H. et al. DeconPeaker, a deconvolution model to identify cell types based on chromatin accessibility in ATAC-Seq data of mixture samples. Front. Genet. 11, 392 (2020).

    Article  Google Scholar 

  30. Hüebschmann D. et al. Deciphering programs of transcriptional regulation by combined deconvolution of multiple omics layers. Preprint at bioRxiv https://doi.org/10.1101/199547 (2017).

  31. Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).

    Article  Google Scholar 

  32. Rahmani, E. et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods 13, 443–445 (2016).

    Article  Google Scholar 

  33. Zou, J., Lippert, C., Heckerman, D., Aryee, M. & Listgarten, J. Epigenome-wide association studies without the need for cell-type composition. Nat. Methods 11, 309–311 (2014).

    Article  Google Scholar 

  34. Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinf. 13, 86 (2012).

    Article  Google Scholar 

  35. Teschendorff, A. E., Breeze, C. E., Zheng, S. C. & Beck, S. A comparison of reference-based algorithms for correcting cell-type heterogeneity in epigenome-wide association studies. BMC Bioinf. 18, 105 (2017).

    Article  Google Scholar 

  36. Teschendorff, A. E., Zhu, T., Breeze, C. E. & Beck, S. EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data. Genome Biol. 21, 221 (2020).

    Article  Google Scholar 

  37. Arneson, D., Yang, X. & Wang, K. MethylResolver—a method for deconvoluting bulk DNA methylation profiles into known and unknown cell contents. Commun. Biol. 3, 422 (2020).

    Article  Google Scholar 

  38. Chakravarthy, A. et al. Pan-cancer deconvolution of tumour composition using DNA methylation. Nat. Commun. 9, 3220 (2018).

    Article  Google Scholar 

  39. Kaushal, A. et al. Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinf. 18, 216 (2017).

    Article  Google Scholar 

  40. Jaffe, A. E. & Irizarry, R. A. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 15, R31 (2014).

    Article  Google Scholar 

  41. Reinius, L. E. et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS ONE 7, e41361 (2012).

    Article  Google Scholar 

  42. Scherer, M. et al. Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz. Nat. Protoc. 15, 3240–3263 (2020).

    Article  Google Scholar 

  43. Houseman E. A. et al. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinf. 17, 259 (2016).

  44. Onuchic, V. et al. Epigenomic deconvolution of breast tumors reveals metabolic coupling between constituent cell types. Cell Rep. 17, 2075–2086 (2016).

    Article  Google Scholar 

  45. Lutsik, P. et al. MeDeCom: discovery and quantification of latent components of heterogeneous methylomes. Genome Biol. 18, 55 (2017).

    Article  Google Scholar 

  46. Sun, Z., Cunningham, J., Slager, S. & Kocher, J.-P. Base resolution methylome profiling: considerations in platform selection, data preprocessing and analysis. Epigenomics 7, 813–828 (2015).

    Article  Google Scholar 

  47. Fortin, J.-P., Triche, T. J. Jr & Hansen, K. D. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics 33, 558–560 (2017).

    Google Scholar 

  48. Rahmani, E. et al. BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference. Genome Biol. 19, 141 (2018).

    Article  Google Scholar 

  49. Li, Z. & Wu, H. TOAST: improving reference-free cell composition estimation by cross-cell type differential analysis. Genome Biol. 20, 190 (2019).

    Article  Google Scholar 

  50. Rahmani, E. et al. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nat. Commun. 10, 1673 (2019).

    Article  Google Scholar 

  51. Thompson, M., Chen, Z. J., Rahmani, E. & Halperin, E. CONFINED: distinguishing biological from technical sources of variation by leveraging multiple methylation datasets. Genome Biol. 20, 138 (2019).

    Article  Google Scholar 

  52. Scherer M. et al. Quantitative comparison of within-sample heterogeneity scores for DNA methylation data. Nucleic Acids Res. 48, e46 (2020).

  53. Scott, C. A. et al. Identification of cell type-specific methylation signals in bulk whole genome bisulfite sequencing data. Genome Biol. 21, 156 (2020).

    Article  Google Scholar 

  54. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 10, 252–263 (2009).

    Article  Google Scholar 

  55. Gallagher, M. D. & Chen-Plotkin, A. S. The post-GWAS era: from association to function. Am. J. Hum. Genet. 102, 717–730 (2018).

    Article  Google Scholar 

  56. Ouyang, Z., Zhou, Q. & Wong, W. H. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl Acad. Sci. USA 106, 21521–21526 (2009).

    Article  Google Scholar 

  57. González, A. J., Setty, M. & Leslie, C. S. Early enhancer establishment and regulatory locus complexity shape transcriptional programs in hematopoietic differentiation. Nat. Genet. 47, 1249–1259 (2015).

    Article  Google Scholar 

  58. Schmidt, F., Kern, F. & Schulz, M. H. Integrative prediction of gene expression with chromatin accessibility and conformation data. Epigenet. Chromatin. 13, 4 (2020).

    Article  Google Scholar 

  59. Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).

    Article  Google Scholar 

  60. Okonechnikov, K., Erkek, S., Korbel, J. O., Pfister, S. M. & Chavez, L. InTAD: chromosome conformation guided analysis of enhancer target genes. BMC Bioinf. 20, 60 (2019).

    Article  Google Scholar 

  61. Stelzer, G. et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr. Protoc. Bioinform. 54, 1.30.1–1.30.33 (2016).

    Article  Google Scholar 

  62. McLeay, R. C., Lesluyes, T., Cuellar Partida, G. & Bailey, T. L. Genome-wide in silico prediction of gene expression. Bioinformatics 28, 2789–2796 (2012).

    Article  Google Scholar 

  63. Natarajan, A., Yardimci, G. G., Sheffield, N. C., Crawford, G. E. & Ohler, U. Predicting cell-type-specific gene expression from regions of open chromatin. Genome Res. 22, 1711–1722 (2012).

    Article  Google Scholar 

  64. Costa, I. G., Roider, H. G., do Rego, T. G., de Carvalho, F. & de, A. T. Predicting gene expression in T cell differentiation from histone modifications and transcription factor binding affinities by linear mixture models. BMC Bioinf. 12, S29 (2011).

    Article  Google Scholar 

  65. Li, Y., Liang, M. & Zhang, Z. Regression analysis of combined gene expression regulation in acute myeloid leukemia. PLoS Comput. Biol. 10, e1003908 (2014).

    Article  Google Scholar 

  66. Jiang, P., Freedman, M. L., Liu, J. S. & Liu, X. S. Inference of transcriptional regulation in cancers. Proc. Natl Acad. Sci. USA 112, 7731–7736 (2015).

    Article  Google Scholar 

  67. Schmidt, F. et al. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res. 45, 54–66 (2017).

    Article  Google Scholar 

  68. Kumar, V. et al. Uniform, optimal signal processing of mapped deep-sequencing data. Nat. Biotechnol. 31, 615–622 (2013).

    Article  Google Scholar 

  69. Singh, R., Lanchantin, J., Robins, G. & Qi, Y. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32, i639–i648 (2016).

    Article  Google Scholar 

  70. Davis, C. A. et al. The Encyclopedia of DNA Elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018).

    Article  Google Scholar 

  71. Bujold, D. et al. The International Human Epigenome Consortium Data Portal. Cell Syst. 3, 496–499.e2 (2016).

    Article  Google Scholar 

  72. Cao, Q. et al. Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nat. Genet. 49, 1428–1436 (2017).

    Article  Google Scholar 

  73. Hait, T. A., Amar, D., Shamir, R. & Elkon, R. FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer-promoter map. Genome Biol. 19, 56 (2018).

    Article  Google Scholar 

  74. Schmidt F. et al. Integrative analysis of epigenetics data identifies gene-specific regulatory elements. Preprint at bioRxiv https://doi.org/10.1101/585125 (2019).

  75. Baumgarten, N. et al. EpiRegio: analysis and retrieval of regulatory elements linked to genes. Nucleic Acids Res. 48, W193–W199 (2020).

    Article  Google Scholar 

  76. Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).

    Article  Google Scholar 

  77. Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).

    Article  Google Scholar 

  78. de Boer, C. G. & Regev, A. BROCKMAN: deciphering variance in epigenomic regulators by k-mer factorization. BMC Bioinf. 19, 253 (2018).

    Article  Google Scholar 

  79. Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).

    Article  Google Scholar 

  80. Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).

    Article  Google Scholar 

  81. Urrutia, E., Chen, L., Zhou, H. & Jiang, Y. Destin: toolkit for single-cell analysis of chromatin accessibility. Bioinformatics 35, 3818–3820 (2019).

    Article  Google Scholar 

  82. Li, B. et al. APEC: an accesson-based method for single-cell chromatin accessibility analysis. Genome Biol. 21, 116 (2020).

    Article  Google Scholar 

  83. Jansen, C. et al. Building gene regulatory networks from scATAC-seq and scRNA-seq using linked self organizing maps. PLoS Comput. Biol. 15, e1006555 (2019).

    Article  Google Scholar 

  84. Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).

    Article  Google Scholar 

  85. Corces, M. R. et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).

    Article  Google Scholar 

  86. Duren, Z. et al. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc. Natl Acad. Sci. USA 115, 7723–7728 (2018).

    Article  Google Scholar 

  87. Welch, J. D., Hartemink, A. J. & Prins, J. F. MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol. 18, 138 (2017).

    Article  Google Scholar 

  88. Jin, S., Zhang, L. & Nie, Q. scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome Biol. 21, 25 (2020).

    Article  Google Scholar 

  89. Argelaguet R. et al. MOFA: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).

  90. Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887.e17 (2019).

    Article  Google Scholar 

  91. Yang Z., Li S., Zha X., Sun J. & Wang Y. A source-type harmonic energy unbalance suppression method based on carrier frequency optimization for cascaded multilevel APF. In 2016 IEEE Energy Conversion Congress and Exposition (ECCE) (2016).

  92. Wang, C. et al. Integrative analyses of single-cell transcriptome and regulome using MAESTRO. Genome Biol. 21, 198 (2020).

    Article  Google Scholar 

  93. Cao, K., Bai, X., Hong, Y. & Wan, L. Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 36, i48–i56 (2020).

    Article  Google Scholar 

  94. Stark S. G. et al. SCIM: universal single-cell matching with unpaired feature sets. Bioinformatics 36, i919–i927 (2020).

  95. Chen, H. et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun. 10, 1903 (2019).

    Article  Google Scholar 

  96. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).

    Article  Google Scholar 

  97. Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).

    Article  Google Scholar 

  98. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871.e8 (2018).

    Article  Google Scholar 

  99. Clark, S. J. et al. scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 781 (2018).

    Article  Google Scholar 

  100. Miro-Blanch, J. & Yanes, O. Epigenetic regulation at the interplay between gut microbiota and host metabolism. Front Genet. 10, 638 (2019).

    Article  Google Scholar 

  101. Nguyen, N. D. & Wang, D. Multiview learning for understanding functional multiomics. PLoS Comput. Biol. 16, e1007677 (2020).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the e:Med research and funding concept (grant Sys_CARE [01ZX1908A]) to M.L. and J.B.; VILLUM Young Investor Grant 13154 and European Union’s Horizon 2020 project RepoTrial 777111 to J.B.; the Bavarian State Ministry of Science and the Arts as part of the Bavarian Research Institute for Digital Transformation (bidt) to O.L.; BMBF project de.NBI-epi (031L0101D) to M.S. and J.W.; DZHK (German Centre for Cardiovascular Research, 81Z0200101), the Cardio-Pulmonary Institute (CPI) EXC 2026 to M.H.S. and the Agency for Science, Technology and Research, Singapore (Enabling Data Analytics Technologies for Next-Generation Pathology from 3D Transcriptomics – SERC Data Analytics, 1727600056) to F.S.

Author information

Authors and Affiliations

Authors

Contributions

M.L. conceived this Review, supervised and contributed to the writing of the manuscript. M.S., F.S. and O.L. collected information about the tools and databases and wrote the corresponding parts of the manuscript. M.H.S., J.B. and J.W. provided critical feedback on the manuscript. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Markus List.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Lukas Chavez and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Fernando Chirigati was the primary editor on this Review and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Scherer, M., Schmidt, F., Lazareva, O. et al. Machine learning for deciphering cell heterogeneity and gene regulation. Nat Comput Sci 1, 183–191 (2021). https://doi.org/10.1038/s43588-021-00038-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-021-00038-7

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing