Fast intratumor heterogeneity inference from single-cell sequencing data

Kızılkale, Can; Rashidi Mehrabadi, Farid; Sadeqi Azer, Erfan; Pérez-Guijarro, Eva; Marie, Kerrie L.; Lee, Maxwell P.; Day, Chi-Ping; Merlino, Glenn; Ergün, Funda; Buluç, Aydın; Sahinalp, S. Cenk; Malikić, Salem

doi:10.1038/s43588-022-00298-x

Brief Communication
Published: 08 September 2022

Fast intratumor heterogeneity inference from single-cell sequencing data

Nature Computational Science volume 2, pages 577–583 (2022)Cite this article

1244 Accesses
1 Citations
10 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

We introduce HUNTRESS, a computational method for mutational intratumor heterogeneity inference from noisy genotype matrices derived from single-cell sequencing data, the running time of which is linear with the number of cells and quadratic with the number of mutations. We prove that, under reasonable conditions, HUNTRESS computes the true progression history of a tumor with high probability. On simulated and real tumor sequencing data, HUNTRESS is demonstrated to be faster than available alternatives with comparable or better accuracy. Additionally, the progression histories of tumors inferred by HUNTRESS on real single-cell sequencing datasets agree with the best known evolution scenarios for the associated tumors.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Analysis of the HGSOC dataset¹⁶.**

**Fig. 2: Analysis of the AML dataset.**

Single-cell analysis reveals context-dependent, cell-level selection of mtDNA

Article Open access 24 April 2024

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

Article 18 April 2024

Best practices for single-cell analysis across modalities

Article 31 March 2023

Data availability

Our study makes use of two publicly available datasets introduced in previous studies^16,17. For the leukemia dataset, single-cell and bulk sequencing data have been deposited at NCBI BioProject ID PRJNA648656 and SNP array data at NCBI GEO ID GSE156934. For the HGSOC dataset, the single-cell FASTQs have been deposited in the European Genome-phenome Archive under accession no. EGA: EGAS00001003190. The OV2295 datasets are available at Zenodo²⁰. Source data for the performance results for Figs. 1 and 2 are available in Supplementary Table 1. Simulated data generated and used in this study for obtaining the results shown in Extended Data Figs. 1–10 are available at Zenodo²¹. Source data are provided with this paper.

Code availability

The open-source implementation of HUNTRESS is available at Zenodo²².

Change history

16 September 2022
In the version of this article initially published, the email address shown for Salem Malikić was incorrect and has been amended in the HTML and PDF versions of the article.

References

Kuipers, J., Jahn, K. & Beerenwinkel, N. Advances in understanding tumour evolution through single-cell sequencing. Biochim. Biophys. Acta 1867, 127–138 (2017).
Google Scholar
Schwartz, R. & Schäffer, A. A. The evolution of tumour phylogenetics: principles and practice. Nat. Rev. Genet. 18, 213–229 (2017).
Article Google Scholar
Jahn, K., Kuipers, J. & Beerenwinkel, N. Tree inference for single-cell data. Genome Biol. 17, 86 (2016).
Article Google Scholar
Ross, E. M. & Markowetz, F. OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol. 17, 69 (2016).
Article Google Scholar
Zafar, H., Tzen, A., Navin, N., Chen, K. & Nakhleh, L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 18, 178 (2017).
Article Google Scholar
Zafar, H., Navin, N., Chen, K. & Nakhleh, L. Siclonefit: Bayesian inference of population structure, genotype and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res. 29, 1847–1859 (2019).
Article Google Scholar
Wu, Y. Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach. Bioinformatics 36, 742–750 (2020).
Google Scholar
Malikic, S., Jahn, K., Kuipers, J., Sahinalp, S. C. & Beerenwinkel, N. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat. Commun. 10, 2750 (2019).
Article Google Scholar
Malikić, S., Mehrabadi, F. R., Azer, E. S., Ebrahimabadi, M. H. & Sahinalp, S. C. Studying the history of tumor evolution from single-cell sequencing data by exploring the space of binary matrices. J. Comput. Biol. 28, 857–879 (2021).
Article MathSciNet Google Scholar
El-Kebir, M. SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics 34, i671–i679 (2018).
Article Google Scholar
Malikic, S. et al. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res. 29, 1860–1877 (2019).
Article Google Scholar
Edrisi, M., Zafar, H. & Nakhleh, L. in 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), Vol. 143 of Leibniz International Proceedings in Informatics (LIPIcs) (eds Huber, K. T. & Gusfield, D.) 22:1–22:13 (National Science Foundation, 2019).
Sadeqi Azer, E. et al. PhISCS-BnB: a fast branch and bound algorithm for the perfect tumor phylogeny reconstruction problem. Bioinformatics 36, i169–i176 (2020).
Article Google Scholar
Ciccolella, S. et al. gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinformatics 21, 413 (2020).
Article Google Scholar
Azer, E. S., Ebrahimabadi, M. H., Malikić, S., Khardon, R. & Sahinalp, S. C. Tumor phylogeny topology inference via deep learning. iScience 23, 101655 (2020).
Article Google Scholar
Laks, E. et al. Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing. Cell 179, 1207–1221 (2019).
Article Google Scholar
Morita, K. et al. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics. Nat. Commun. 11, 5327 (2020).
Article Google Scholar
Singer, J., Kuipers, J., Jahn, K. & Beerenwinkel, N. Single-cell mutation identification via phylogenetic inference. Nat. Commun. 9, 5144 (2018).
Article Google Scholar
Gusfield, D. Efficient algorithms for inferring evolutionary trees. Networks 21, 19–28 (1991).
Article MathSciNet Google Scholar
McPherson, A. W. Clonal decomposition and DNA replication states defined by scaled single cell genome sequencing. Zenodo (2019); https://doi.org/10.5281/zenodo.3445364
Malikic, S., Mehrabadi, F. R. & Kizilkale, C. Fast intratumor heterogeneity inference from single-cell sequencing data (simulated data – extended data figures). Zenodo (2022); https://doi.org/10.5281/zenodo.6829082
Kizilkale, C., Buluc, A. & Rashidi, F. PASSIONLab/HUNTRESS: HUNTRESS. Zenodo (2022); https://doi.org/10.5281/zenodo.6803392

Download references

Acknowledgements

This work is supported in part by the Intramural Research Program of the National Institutes of Health, National Cancer Institute (to F.R.M., E.P.-G., K.L.M., M.P.L., C.-P.D., G.M., S.C.S. and S.M.) and utilized the computational resources of the NIH Biowulf high-performance computing cluster (http://hpc.nih.gov) and Gurobi (http://www.gurobi.com) to solve some optimization problems. Additionally, F.R.M. was supported in part by Indiana U. Grand Challenges Precision Health Initiative. C.K. and A.B. were supported by the Advanced Scientific Computing Research (ASCR) Program of the Department of Energy Office of Science under contract no. DE-AC02- 05CH11231.

Author information

Erfan Sadeqi Azer
Present address: Google LLC, Sunnyvale, CA, USA
These authors contributed equally: Can Kizilkale, Farid Rashidi Mehrabadi.

Authors and Affiliations

Department of Electrical Engineering and Computer Sciences UC Berkeley, Berkeley, CA, USA
Can Kızılkale & Aydın Buluç
Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Can Kızılkale & Aydın Buluç
Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Farid Rashidi Mehrabadi, S. Cenk Sahinalp & Salem Malikić
Department of Computer Science, Indiana University, Bloomington, IN, USA
Farid Rashidi Mehrabadi, Erfan Sadeqi Azer & Funda Ergün
Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Eva Pérez-Guijarro, Kerrie L. Marie, Maxwell P. Lee, Chi-Ping Day & Glenn Merlino

Authors

Can Kızılkale
View author publications
You can also search for this author in PubMed Google Scholar
Farid Rashidi Mehrabadi
View author publications
You can also search for this author in PubMed Google Scholar
Erfan Sadeqi Azer
View author publications
You can also search for this author in PubMed Google Scholar
Eva Pérez-Guijarro
View author publications
You can also search for this author in PubMed Google Scholar
Kerrie L. Marie
View author publications
You can also search for this author in PubMed Google Scholar
Maxwell P. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Ping Day
View author publications
You can also search for this author in PubMed Google Scholar
Glenn Merlino
View author publications
You can also search for this author in PubMed Google Scholar
Funda Ergün
View author publications
You can also search for this author in PubMed Google Scholar
Aydın Buluç
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar
Salem Malikić
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.C.S. and A.B. jointly initiated and supervised the project. The algorithmic approach was developed by C.K. HUNTRESS was implemented by C.K. and was tested by F.R.M. Theorem 1, related lemmas and their proofs are by S.C.S., S.M., F.E. and C.K. Experimental results on real data and external simulators are by F.R.M. The internal simulator was developed and the simulated data were generated by S.M. The experimental results on the internal simulator are by C.K. and F.R.M. The bulk of the paper was written by S.M., S.C.S., F.R.M., C.K. and F.E. with feedback from all co-authors. E.P.-G., K.L.M., M.P.L., E.S.A., C.-P.D. and G.M. contributed to the interpretation of the results and biological implications.

Corresponding authors

Correspondence to S. Cenk Sahinalp or Salem Malikić.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Yufeng Wu, Hamim Zafar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A running time assessment of HUNTRESS on simulated data with no false positives, in comparison to ScisTree, SPhyR, and PhISCS-BnB, as well as its slower but more general variants, PhISCS-I, and PhISCS-B.

All numbers on y-axis are in log₁₀ scale. Here n, m and fn, respectively, denote the number of cells, the number of mutations and the false negative error rate in single-cell data. For each setting of n and m, we report the distribution of the running time for each tool - over 10 distinct trees of tumor progression, with false negative error rates of 0.05, 0.1 and 0.2. Each tool was run with a time limit of 8 hours (those cases that exceed the time limit are not included here). The corresponding accuracy measures for each setting are shown in Extended Data Figure 2 (ancestor descendant accuracy measure) and Extended Data Figure 3 (different-lineages accuracy measure).

Source data

Extended Data Fig. 2 Comparison of ancestor-descendant (AD) accuracy measures for HUNTRESS, PhISCS-BnB, ScisTree and SPhyR on simulated data with no false positives.

Here n, m and fn, respectively, denote the number of cells, the number of mutations and the false negative error rate in single-cell sequencing. For each setting of n and m, we report the ancestor-descendent (AD) accuracy measure for each tool with respect to the ground truth. The experiments were performed over 10 distinct trees of tumor progression, using a false negative error rate of 0.05, 0.1 or 0.2. Each tool was run with a time limit of 8 hours (those cases that exceed the time limit are not included here). Note that we have not included results of PhISCS-I and PhISCS-B as their accuracy values are identical to that of PhISCS-BnB (on these instances on which they completed the task) due to the same underlying objective function and optimality guarantee that they all provide.

Source data

Extended Data Fig. 3 Comparison of different-lineages (DL) accuracy measure distributions for HUNTRESS, PhISCS-BnB, ScisTree and SPhyR on simulated data with no false positives.

Here n, m and fn, respectively, denote the number of cells, the number of mutations and the false negative error rate for single-cell sequencing. For each setting of n and m, we report the different-lineages (DL) accuracy measure for each tool with respect to the ground truth. The experiments were performed over 10 distinct trees of tumor progression, with a false negative error rate varying across 0.05, 0.1 and 0.2. Each tool was run with a time limit of 8 hours (those cases that exceed the time limit are not included here). Note that we have not included results of PhISCS-I and PhISCS-B separately as their accuracy values match those of PhISCS-BnB (on these instances on which they completed the task) due to the same underlying objective function and optimality guarantee that they all provide.

Source data

Extended Data Fig. 4 Ancestor-Descendant (AD) accuracy measure distributions for HUNTRESS, ScisTree and SPhyR on simulated data with false positives, false negatives and missing entries.

Here n, m, fn, fp and na respectively, denote the number of cells, the number of mutations, the false negative, false positive error and missing entry rates in single-cell sequencing data. For each setting we report the distribution for each tool over 10 distinct trees of tumor progression. Each tool was allowed to run with a time limit of 48 hours (those cases that exceed the time limit are not included here). Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density.

Source data

Extended Data Fig. 5 Different Lineages (DL) accuracy measure distributions for HUNTRESS, ScisTree and SPhyR on simulated data with false positives, false negatives and missing entries.

Here n, m, fn, fp and na respectively, denote the number of cells, the number of mutations, the false negative, false positive error and missing entry rates in single-cell sequencing data. For each setting we report the distribution for each tool over 10 distinct trees of tumor progression. Each tool was allowed to run with a time limit of 48 hours (those cases that exceed the time limit are not included here). Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density.

Source data

Extended Data Fig. 6 Running time distributions for HUNTRESS, ScisTree and SPhyR on simulated data with false positives, false negatives and missing entries.

Here n, m, fn, fp and na respectively, denote the number of cells, the number of mutations, the false negative, false positive error and missing entry rates in single-cell sequencing data. For each setting we report the distribution for each tool over 10 distinct trees of tumor progression. For any given task each tool was allowed to run with a time limit of 48 hours (those cases that exceed the time limit are not included here). Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density.

Source data

Extended Data Fig. 7 Ancestor-Descendant (AD) and Different Lineages (DL) accuracy measures as well as running time distributions for HUNTRESS, ScisTree and SPhyR on simulated datasets with doublets.

Here n=1000, m=300, fn=0.2 or 0.05, fp=0.001, na=0.05, and the doublet rate is set to 0.03. Each distribution is over 10 distinct trees of tumor progression. For any given task each tool was allowed to run with a time limit of 48 hours. Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density.

Source data

Extended Data Fig. 8 Ancestor-Descendant (AD) and Different Lineages (DL) accuracy measures, as well as the running time distributions for HUNTRESS on large simulated datasets.

In these simulations we set n=5000, m=500, fn=0.05 or 0.2, fp=0.001 and na=0.05. For each setting, the distributions are reported over 10 distinct trees of tumor progression. Note that because ScisTree could not finish any of the tasks within the time limit of 48 hours and SPhyR failed to generate any output, they are not presented in the figure. Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density.

Source data

Extended Data Fig. 9 Ancestor-Descendant (AD) and Different Lineages (DL) accuracy measures as well as running time distributions for HUNTRESS, ScisTree and SPhyR on simulated datasets with a high(er) false positive rate of fp=0.003.

Here n=1000, m=300, fn=0.2 or 0.05, and na=0.05. Each distribution is over 10 distinct trees of tumor progression. For any given task each tool was allowed to run with a time limit of 48 hours. Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density.

Source data

Extended Data Fig. 10 Running time, ancestor-descendant and different lineages accuracy measure distributions for HUNTRESS and SPhyR on simulations with parameters similar to those observed in the AML dataset (the Tapestri platform).

Here n=5000, m=50, fn=0.2 or 0.05, fp=0.01 or 0.003 and na=0.1. All simulated data used in this figure were generated by the simulator developed for OncoNEM. For any given task each tool was allowed to run with a time limit of 48 hours. Note that each (box) plot shows the maximum, center and minimum value, as well as the median half of the data depicted.

Source data

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, Tables 1–7, Algorithms 1–5, proofs and discussions.

Source data

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 6

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Source Data Extended Data Fig. 8

Statistical source data.

Source Data Extended Data Fig. 9

Statistical source data.

Source Data Extended Data Fig. 10

Statistical source data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kızılkale, C., Rashidi Mehrabadi, F., Sadeqi Azer, E. et al. Fast intratumor heterogeneity inference from single-cell sequencing data. Nat Comput Sci 2, 577–583 (2022). https://doi.org/10.1038/s43588-022-00298-x

Download citation

Received: 09 June 2021
Accepted: 14 July 2022
Published: 08 September 2022
Issue Date: September 2022
DOI: https://doi.org/10.1038/s43588-022-00298-x

This article is cited by

Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model
- Yunheng Han
- Erin K. Molloy
Algorithms for Molecular Biology (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

Change history

16 September 2022

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links