Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes

Abstract

Genome instability and aberrant alterations of transcriptional programs both play important roles in cancer. Single-cell RNA sequencing (scRNA-seq) has the potential to investigate both genetic and nongenetic sources of tumor heterogeneity in a single assay. Here we present a computational method, Numbat, that integrates haplotype information obtained from population-based phasing with allele and expression signals to enhance detection of copy number variations from scRNA-seq. Numbat exploits the evolutionary relationships between subclones to iteratively infer single-cell copy number profiles and tumor clonal phylogeny. Analysis of 22 tumor samples, including multiple myeloma, gastric, breast and thyroid cancers, shows that Numbat can reconstruct the tumor copy number profile and precisely identify malignant cells in the tumor microenvironment. We identify genetic subpopulations with transcriptional signatures relevant to tumor progression and therapy resistance. Numbat requires neither sample-matched DNA data nor a priori genotyping, and is applicable to a wide range of experimental settings and cancer types.

This is a preview of subscription content, access via your institution

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Population-based haplotype phasing enables sensitive detection of subclonal allelic imbalances in single-cell transcriptomes.
Fig. 2: Numbat achieves accurate copy number inference via joint evaluation of gene expression, allele fraction and previous haplotype phasing information.
Fig. 3: Iterative strategy to identify tumor subclones.
Fig. 4: Numbat reveals additional complexity in tumor subclones through allele-specific copy number analysis.
Fig. 5: Tracking clonal evolution of a therapy-resistant MM using Numbat.

Data availability

The scRNA-seq and WGS validation data from the WASHU MM study can be accessed through SRA (PRJNA694128). The scRNA-seq data from the MDA CopyKAT study can be accessed through GEO (GSE148673) and SRA (PRJNA625321). The NCI-N87 scDNA-seq and scRNA-seq datasets are available on GEO (GSE142750) and SRA (PRJNA498809). The HCA collection of reference expression profiles can be obtained from Synapse under ID syn21041850. The 1000G phasing panel can be downloaded from the IGSR FTP site (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release). The TOPMed phasing panel can be accessed through the TOPMed Imputation Server (https://imputation.biodatacatalyst.nhlbi.nih.gov/).

Code availability

The Numbat algorithm is available at https://github.com/kharchenkolab/numbat. The analysis scripts and notebooks used to reproduce results included in the paper are available at https://github.com/kharchenkolab/NumbatAnalysis.

References

  1. Mansoori, B., Mohammadi, A., Davudian, S., Shirjang, S. & Baradaran, B. The different mechanisms of cancer drug resistance: a brief review. Adv. Pharm. Bull. 7, 339–348 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Fan, J. et al. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res. 28, 1217–1227 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol. 39, 599–608 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Serin Harmanci, A., Harmanci, A. O. & Zhou, X. CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data. Nat. Commun. 11, 89 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Trinh, M. K. et al. Precise identification of cancer cells from allelic imbalances in single cell transcriptomes. Commun. Biol. 5, 884 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Reinius, B. & Sandberg, R. Random monoallelic expression of autosomal genes: stochastic transcription and allele-level regulation. Nat. Rev. Genet. 16, 653–664 (2015).

    Article  CAS  PubMed  Google Scholar 

  8. Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Loh, P.-R. et al. Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations. Nature 559, 350–355 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hujoel, M. L. A. et al. Influences of rare copy number variation on human complex traits. Preprint at bioRxiv https://doi.org/10.1101/2021.10.21.465308 (2021).

  13. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Vattathil, S. & Scheet, P. Haplotype-based profiling of subtle allelic imbalance with SNP arrays. Genome Res. 23, 152–158 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. The 1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  17. Edsgärd, D., Reinius, B. & Sandberg, R. scphaser: haplotype inference using single-cell RNA-seq data. Bioinformatics 32, 3038–3040 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Larsson, A. J. M. et al. Transcriptional bursts explain autosomal random monoallelic expression and affect allelic imbalance. PLoS Comput. Biol. 17, e1008772 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Castel, S. E. et al. A vast resource of allelic expression data spanning human tissues. Genome Biol. 21, 234 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Ha, G. et al. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome Res. 24, 1881–1893 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Yau, C. OncoSNP-SEQ: a statistical approach for the identification of somatic copy number alterations from next-generation sequencing of cancer genomes. Bioinformatics 29, 2482–2484 (2013).

    Article  CAS  PubMed  Google Scholar 

  22. Shen, R. & Seshan, V. E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Res. 44, e131 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Singer, J., Kuipers, J., Jahn, K. & Beerenwinkel, N. Single-cell mutation identification via phylogenetic inference. Nat. Commun. 9, 5144 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Salehi, S. et al. Clonal fitness inferred from time-series modelling of single-cell cancer genomes. Nature. 585–590 (2021)..

  25. Dorri, F. et al. Efficient Bayesian inference of phylogenetic trees from large scale, low-depth genome-wide single-cell data. Preprint at bioRxiv https://doi.org/10.1101/2020.05.06.058180 (2021).

  26. Wu, Y. Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach. Bioinformatics 36, 742–750 (2020).

    CAS  PubMed  Google Scholar 

  27. Osta, W. A. et al. EpCAM is overexpressed in breast cancer and is a potential target for breast cancer gene therapy. Cancer Res. 64, 5818–5824 (2004).

    Article  CAS  PubMed  Google Scholar 

  28. Guo, D. et al. Cytokeratin-8 in anaplastic thyroid carcinoma: more than a simple structural cytoskeletal protein. Int. J. Mol. Sci. 19, 577 (2018).

    Article  PubMed Central  Google Scholar 

  29. Andor, N. et al. Joint single cell DNA-seq and RNA-seq of gastric cancer cell lines reveals rules of in vitro evolution. NAR Genom. Bioinform. 2, lqaa016 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Wu, C.-Y. et al. Integrative single-cell analysis of allele-specific copy number alterations and chromatin accessibility in cancer. Nat. Biotechnol. 39, 1259–1269 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Zaccaria, S. & Raphael, B. J. Characterizing allele- and haplotype-specific copy numbers in single cells with CHISEL. Nat. Biotechnol. 39, 207–214 (2021).

    Article  CAS  PubMed  Google Scholar 

  32. Kwok, A. W. C. et al. MQuad enables clonal substructure discovery using single cell mitochondrial variants. Nat. Commun. 13, 1205 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Ludwig, L. S. et al. Lineage tracing in humans enabled by mitochondrial mutations and single-cell genomics. Cell 176, 1325–1339 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Hideshima, T., Chauhan, D., Schlossman, R., Richardson, P. & Anderson, K. C. The role of tumor necrosis factor alpha in the pathophysiology of human multiple myeloma: therapeutic applications. Oncogene 20, 4519–4527 (2001).

    Article  CAS  PubMed  Google Scholar 

  35. Castro, F., Cardoso, A. P., Gonçalves, R. M., Serre, K. & Oliveira, M. J. Interferon-gamma at the crossroads of tumor immune surveillance or evasion. Front. Immunol. 9, 847 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Alekseyenko, A. A. et al. The oncogenic BRD4-NUT chromatin regulator drives aberrant transcription within large topological domains. Genes Dev. 29, 1507–1523 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. O’Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Tourdot, R. W., Brunette, G. J., Pinto, R. A. & Zhang, C.-Z. Determination of complete chromosomal haplotypes by bulk DNA sequencing. Genome Biol. 22, 139 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Oesper, L., Mahmoody, A. & Raphael, B. J. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 14, R80 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Zaccaria, S. & Raphael, B. J. Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. Nat. Commun. 11, 4301 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Huang, X. & Huang, Y. Cellsnp-lite: an efficient tool for genotyping single cells. Bioinformatics 37, 4569–4571 (2021).

    Article  CAS  Google Scholar 

  44. Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Nilsen, G. et al. Copynumber: efficient algorithms for single- and multi-track copy number segmentation. BMC Genomics 13, 591 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Liu, R. et al. Co-evolution of tumor and immune cells during progression of multiple myeloma. Nat. Commun. 12, 2559 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Miller, C. A. et al. Visualizing tumor evolution with the fishplot package for R. BMC Genomics 17, 880 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

P.V.K., R.S. and T.G. were supported by Synergy grant no. 85629 (KILL-OR-DIFFERENTIATE) from the European Research Council. P.-R.L. was supported by NIH grant no. DP2 ES030554, a Burroughs Wellcome Fund Career Award at the Scientific Interfaces and the Next Generation Fund at the Broad Institute of MIT and Harvard.

Author information

Authors and Affiliations

Authors

Contributions

P.V.K. and T.G. formulated the study and the overall approach. A.K. carried out proof-of-concept tests of population-based phasing. T.G. developed the detailed algorithms with advice from P.V.K., R.S., H.S. and P.-R.L. T.G. implemented the Numbat package with help from E.B. T.G. and P.V.K. drafted the manuscript. All authors provided suggestions and corrections on the manuscript text.

Corresponding author

Correspondence to Peter V. Kharchenko.

Ethics declarations

Competing interests

P.V.K. is an employee of Altos Labs and serves on the Scientific Advisory Board to Celsius Therapeutics, Inc. and Biomage, Inc. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Woong-Yang Park and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Haplotype-aware Hidden Markov models.

a, Phase switch probability as a function of genetic distance, estimated from alleles phased from LoH regions in TNBC4. Genetic distance is measured in centimorgan (cM). Error bar represents 95% CI derived from a binomial test. The center of the error bar represents the observed fraction of phase switches. b, Schematic of conventional and haplotype-aware allele HMM. t, copy number state transition probability. ps, phase transition probability. c, Schematic of the Numbat joint HMM. Only three copy number states (neutral, deletion, amplification) are included for illustrative purposes.

Extended Data Fig. 2 Probabilistic model of gene expression and allele counts from transcriptome sequencing experiments.

cm, number of maternal chromosome copies. cp, number of paternal chromosome copies. λi, observed gene expression magnitude of gene i. \(\lambda _i^ \ast\), reference gene expression magnitude of gene i. μ and σ2, global bias and variance in gene expression. πj, fraction of paternal alleles of SNP j. γ, global inverse overdispersion of allele-specific detection. l, library size. mj, total allele count of SNP j. Xi, observed molecule counts for gene i. Yj, observed paternal allele count for SNP j.

Extended Data Fig. 3 WGS validation of Numbat CNV calls from scRNA-seq data.

For each sample, the DNA profile (top) is juxtaposed with the copy number profile inferred by the Numbat joint HMM (bottom). Gray vertical bars represent centromeres and gap regions. logR, log coverage ratio. BAF, B-allele frequency. logFC, log expression fold-change. pHF, paternal haplotype frequency. BAMP, balanced amplification.

Extended Data Fig. 4 Tumor versus normal cell classification accuracy of Numbat joint model, Numbat expression-only model, and CopyKAT.

Each dot represents a distinct sample (TNBC, n = 5; ATC, n = 4; MM, n = 8). Center line, mean. ATC5 was excluded from the benchmark due to lack of clear expression of tumor marker KRT8.

Extended Data Fig. 5 Numbat reliably distinguishes tumor and normal cells (TNBC series).

The aneuploidy probability is shown as a color gradient (red: high, blue: low). For each sample (row), the series of figures (columns) respectively show the aneuploidy probabilities by expression evidence, those by allele evidence, those by combined evidence, CopyKAT prediction (binary 0 or 1), and marker gene expression in a t-SNE embedding of gene expression profiles.

Extended Data Fig. 6 Numbat reliably distinguishes tumor and normal cells (ATC series).

The aneuploidy probability is shown as a color gradient (red: high, blue: low). For each sample (row), the series of figures (columns) respectively show the aneuploidy probabilities by expression evidence, those by allele evidence, those by combined evidence, CopyKAT prediction (binary 0 or 1), and marker gene expression in a t-SNE embedding of gene expression profiles.

Extended Data Fig. 7 Numbat reliably distinguishes tumor and normal cells (MM series).

The aneuploidy probability is shown as a color gradient (red: high, blue: low). For each sample (row), the series of figures (columns) respectively show the aneuploidy probabilities by expression evidence, those by allele evidence, those by combined evidence, CopyKAT prediction (binary 0 or 1), and marker gene expression in a t-SNE embedding of gene expression profiles.

Extended Data Fig. 8 CNV detection performance as a function of tumor cell fraction.

At each tumor cell fraction, tumor cells were subsampled and mixed with randomly sampled normal cells at the corresponding proportion. Precision, recall and F1 scores were calculated based on the detected segments from scRNA-seq data and the ground truth copy number profiles (from WGS) in 5 multiple myeloma samples. For Numbat, two methods are compared: pseudobulk joint HMM (Numbat-HMM) and iterative optimization (Numbat-iterative) with no minimum pseudobulk size limit. a, Performance for all event types (amplification, deletion, and CNLoH). b, Performance for amplifications. c, Performance for deletions.

Extended Data Fig. 9 Numbat analysis of gastric cell line (NCI-N87) scRNA-seq data and validation by scDNA-seq.

a, Single-cell copy number landscape and subclonal structure reconstructed by scDNA-seq data. Gray vertical bars represent gap regions. A rooted hierarchical clustering tree is shown on the left. Three subclones were defined by cutting the tree with k=3. Red asterisks denote salient subclonal events. b, Single-cell CNV landscape and subclonal structure inferred from the paired scRNA-seq data by Numbat. The original prediction was composed of four subclones. The uppermost two clones were merged and denoted as the ‘major’ clone. Red asterisks denote validated subclonal events. c, Subclone-specific copy number profiles. For each subclone, the top track shows CNV calls made by clone-specific Numbat HMM; the bottom track shows DNA copy number profile of a representative cell from that subclone. Gray vertical bars represent gap regions. d, Numbat recapitulates clonal fractions measured by scDNA-seq. e, Stability and accuracy of Numbat CNV calls for each subclone with respect to parameter variations. F1 scores were computed by comparing DNA profiles for each subclone with the best-matching subclone CNV profiles predicted by Numbat. Circles denote F1 score from initialization with a random tree. Red triangles mark default parameter values.

Extended Data Fig. 10 Single-cell copy number profile and phylogeny reconstructed by Numbat (TNBC and ATC).

Branch lengths correspond to the number of CNV events. Blue dashed line separates predicted tumor and normal cells. Confident subclones are highlighted and marked by red dashed rectangles. The vertical bar on the left of each panel shows cell type ground truth. In TNBC5 and ATC2, the second vertical bar on the left of the panel shows variant allele frequency of a clone-associated mitochondrial mutation. For ATC2, results from the subsampled dataset (including aneuploid cells and 50 randomly sampled normal cells) are shown. In ATC5, some tumor cells were likely mis-annotated as normal in the original annotation.

Supplementary information

Supplementary Information

Supplementary Figs. 1–17, Tables 1–3 and Methods.

Reporting Summary

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, T., Soldatov, R., Sarkar, H. et al. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-022-01468-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-022-01468-y

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer