Abstract
Genome instability and aberrant alterations of transcriptional programs both play important roles in cancer. Single-cell RNA sequencing (scRNA-seq) has the potential to investigate both genetic and nongenetic sources of tumor heterogeneity in a single assay. Here we present a computational method, Numbat, that integrates haplotype information obtained from population-based phasing with allele and expression signals to enhance detection of copy number variations from scRNA-seq. Numbat exploits the evolutionary relationships between subclones to iteratively infer single-cell copy number profiles and tumor clonal phylogeny. Analysis of 22 tumor samples, including multiple myeloma, gastric, breast and thyroid cancers, shows that Numbat can reconstruct the tumor copy number profile and precisely identify malignant cells in the tumor microenvironment. We identify genetic subpopulations with transcriptional signatures relevant to tumor progression and therapy resistance. Numbat requires neither sample-matched DNA data nor a priori genotyping, and is applicable to a wide range of experimental settings and cancer types.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The scRNA-seq and WGS validation data from the WASHU MM study can be accessed through SRA (PRJNA694128). The scRNA-seq data from the MDA CopyKAT study can be accessed through GEO (GSE148673) and SRA (PRJNA625321). The NCI-N87 scDNA-seq and scRNA-seq datasets are available on GEO (GSE142750) and SRA (PRJNA498809). The HCA collection of reference expression profiles can be obtained from Synapse under ID syn21041850. The 1000G phasing panel can be downloaded from the IGSR FTP site (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release). The TOPMed phasing panel can be accessed through the TOPMed Imputation Server (https://imputation.biodatacatalyst.nhlbi.nih.gov/).
Code availability
The Numbat algorithm is available at https://github.com/kharchenkolab/numbat. The analysis scripts and notebooks used to reproduce results included in the paper are available at https://github.com/kharchenkolab/NumbatAnalysis.
References
Mansoori, B., Mohammadi, A., Davudian, S., Shirjang, S. & Baradaran, B. The different mechanisms of cancer drug resistance: a brief review. Adv. Pharm. Bull. 7, 339–348 (2017).
Fan, J. et al. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res. 28, 1217–1227 (2018).
Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol. 39, 599–608 (2021).
Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).
Serin Harmanci, A., Harmanci, A. O. & Zhou, X. CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data. Nat. Commun. 11, 89 (2020).
Trinh, M. K. et al. Precise identification of cancer cells from allelic imbalances in single cell transcriptomes. Commun. Biol. 5, 884 (2022).
Reinius, B. & Sandberg, R. Random monoallelic expression of autosomal genes: stochastic transcription and allele-level regulation. Nat. Rev. Genet. 16, 653–664 (2015).
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).
Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018).
Loh, P.-R. et al. Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations. Nature 559, 350–355 (2018).
Hujoel, M. L. A. et al. Influences of rare copy number variation on human complex traits. Preprint at bioRxiv https://doi.org/10.1101/2021.10.21.465308 (2021).
Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
Vattathil, S. & Scheet, P. Haplotype-based profiling of subtle allelic imbalance with SNP arrays. Genome Res. 23, 152–158 (2013).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
The 1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Edsgärd, D., Reinius, B. & Sandberg, R. scphaser: haplotype inference using single-cell RNA-seq data. Bioinformatics 32, 3038–3040 (2016).
Larsson, A. J. M. et al. Transcriptional bursts explain autosomal random monoallelic expression and affect allelic imbalance. PLoS Comput. Biol. 17, e1008772 (2021).
Castel, S. E. et al. A vast resource of allelic expression data spanning human tissues. Genome Biol. 21, 234 (2020).
Ha, G. et al. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome Res. 24, 1881–1893 (2014).
Yau, C. OncoSNP-SEQ: a statistical approach for the identification of somatic copy number alterations from next-generation sequencing of cancer genomes. Bioinformatics 29, 2482–2484 (2013).
Shen, R. & Seshan, V. E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Res. 44, e131 (2016).
Singer, J., Kuipers, J., Jahn, K. & Beerenwinkel, N. Single-cell mutation identification via phylogenetic inference. Nat. Commun. 9, 5144 (2018).
Salehi, S. et al. Clonal fitness inferred from time-series modelling of single-cell cancer genomes. Nature. 585–590 (2021)..
Dorri, F. et al. Efficient Bayesian inference of phylogenetic trees from large scale, low-depth genome-wide single-cell data. Preprint at bioRxiv https://doi.org/10.1101/2020.05.06.058180 (2021).
Wu, Y. Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach. Bioinformatics 36, 742–750 (2020).
Osta, W. A. et al. EpCAM is overexpressed in breast cancer and is a potential target for breast cancer gene therapy. Cancer Res. 64, 5818–5824 (2004).
Guo, D. et al. Cytokeratin-8 in anaplastic thyroid carcinoma: more than a simple structural cytoskeletal protein. Int. J. Mol. Sci. 19, 577 (2018).
Andor, N. et al. Joint single cell DNA-seq and RNA-seq of gastric cancer cell lines reveals rules of in vitro evolution. NAR Genom. Bioinform. 2, lqaa016 (2020).
Wu, C.-Y. et al. Integrative single-cell analysis of allele-specific copy number alterations and chromatin accessibility in cancer. Nat. Biotechnol. 39, 1259–1269 (2021).
Zaccaria, S. & Raphael, B. J. Characterizing allele- and haplotype-specific copy numbers in single cells with CHISEL. Nat. Biotechnol. 39, 207–214 (2021).
Kwok, A. W. C. et al. MQuad enables clonal substructure discovery using single cell mitochondrial variants. Nat. Commun. 13, 1205 (2022).
Ludwig, L. S. et al. Lineage tracing in humans enabled by mitochondrial mutations and single-cell genomics. Cell 176, 1325–1339 (2019).
Hideshima, T., Chauhan, D., Schlossman, R., Richardson, P. & Anderson, K. C. The role of tumor necrosis factor alpha in the pathophysiology of human multiple myeloma: therapeutic applications. Oncogene 20, 4519–4527 (2001).
Castro, F., Cardoso, A. P., Gonçalves, R. M., Serre, K. & Oliveira, M. J. Interferon-gamma at the crossroads of tumor immune surveillance or evasion. Front. Immunol. 9, 847 (2018).
Alekseyenko, A. A. et al. The oncogenic BRD4-NUT chromatin regulator drives aberrant transcription within large topological domains. Genes Dev. 29, 1507–1523 (2015).
O’Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).
Tourdot, R. W., Brunette, G. J., Pinto, R. A. & Zhang, C.-Z. Determination of complete chromosomal haplotypes by bulk DNA sequencing. Genome Biol. 22, 139 (2021).
Oesper, L., Mahmoody, A. & Raphael, B. J. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 14, R80 (2013).
Zaccaria, S. & Raphael, B. J. Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. Nat. Commun. 11, 4301 (2020).
Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).
Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).
Huang, X. & Huang, Y. Cellsnp-lite: an efficient tool for genotyping single cells. Bioinformatics 37, 4569–4571 (2021).
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Nilsen, G. et al. Copynumber: efficient algorithms for single- and multi-track copy number segmentation. BMC Genomics 13, 591 (2012).
Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).
Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).
Liu, R. et al. Co-evolution of tumor and immune cells during progression of multiple myeloma. Nat. Commun. 12, 2559 (2021).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).
Miller, C. A. et al. Visualizing tumor evolution with the fishplot package for R. BMC Genomics 17, 880 (2016).
Acknowledgements
P.V.K., R.S. and T.G. were supported by Synergy grant no. 85629 (KILL-OR-DIFFERENTIATE) from the European Research Council. P.-R.L. was supported by NIH grant no. DP2 ES030554, a Burroughs Wellcome Fund Career Award at the Scientific Interfaces and the Next Generation Fund at the Broad Institute of MIT and Harvard.
Author information
Authors and Affiliations
Contributions
P.V.K. and T.G. formulated the study and the overall approach. A.K. carried out proof-of-concept tests of population-based phasing. T.G. developed the detailed algorithms with advice from P.V.K., R.S., H.S. and P.-R.L. T.G. implemented the Numbat package with help from E.B. T.G. and P.V.K. drafted the manuscript. All authors provided suggestions and corrections on the manuscript text.
Corresponding author
Ethics declarations
Competing interests
P.V.K. is an employee of Altos Labs and serves on the Scientific Advisory Board to Celsius Therapeutics, Inc. and Biomage, Inc. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Woong-Yang Park and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Haplotype-aware Hidden Markov models.
a, Phase switch probability as a function of genetic distance, estimated from alleles phased from LoH regions in TNBC4. Genetic distance is measured in centimorgan (cM). Error bar represents 95% CI derived from a binomial test. The center of the error bar represents the observed fraction of phase switches. b, Schematic of conventional and haplotype-aware allele HMM. t, copy number state transition probability. ps, phase transition probability. c, Schematic of the Numbat joint HMM. Only three copy number states (neutral, deletion, amplification) are included for illustrative purposes.
Extended Data Fig. 2 Probabilistic model of gene expression and allele counts from transcriptome sequencing experiments.
cm, number of maternal chromosome copies. cp, number of paternal chromosome copies. λi, observed gene expression magnitude of gene i. \(\lambda _i^ \ast\), reference gene expression magnitude of gene i. μ and σ2, global bias and variance in gene expression. πj, fraction of paternal alleles of SNP j. γ, global inverse overdispersion of allele-specific detection. l, library size. mj, total allele count of SNP j. Xi, observed molecule counts for gene i. Yj, observed paternal allele count for SNP j.
Extended Data Fig. 3 WGS validation of Numbat CNV calls from scRNA-seq data.
For each sample, the DNA profile (top) is juxtaposed with the copy number profile inferred by the Numbat joint HMM (bottom). Gray vertical bars represent centromeres and gap regions. logR, log coverage ratio. BAF, B-allele frequency. logFC, log expression fold-change. pHF, paternal haplotype frequency. BAMP, balanced amplification.
Extended Data Fig. 4 Tumor versus normal cell classification accuracy of Numbat joint model, Numbat expression-only model, and CopyKAT.
Each dot represents a distinct sample (TNBC, n = 5; ATC, n = 4; MM, n = 8). Center line, mean. ATC5 was excluded from the benchmark due to lack of clear expression of tumor marker KRT8.
Extended Data Fig. 5 Numbat reliably distinguishes tumor and normal cells (TNBC series).
The aneuploidy probability is shown as a color gradient (red: high, blue: low). For each sample (row), the series of figures (columns) respectively show the aneuploidy probabilities by expression evidence, those by allele evidence, those by combined evidence, CopyKAT prediction (binary 0 or 1), and marker gene expression in a t-SNE embedding of gene expression profiles.
Extended Data Fig. 6 Numbat reliably distinguishes tumor and normal cells (ATC series).
The aneuploidy probability is shown as a color gradient (red: high, blue: low). For each sample (row), the series of figures (columns) respectively show the aneuploidy probabilities by expression evidence, those by allele evidence, those by combined evidence, CopyKAT prediction (binary 0 or 1), and marker gene expression in a t-SNE embedding of gene expression profiles.
Extended Data Fig. 7 Numbat reliably distinguishes tumor and normal cells (MM series).
The aneuploidy probability is shown as a color gradient (red: high, blue: low). For each sample (row), the series of figures (columns) respectively show the aneuploidy probabilities by expression evidence, those by allele evidence, those by combined evidence, CopyKAT prediction (binary 0 or 1), and marker gene expression in a t-SNE embedding of gene expression profiles.
Extended Data Fig. 8 CNV detection performance as a function of tumor cell fraction.
At each tumor cell fraction, tumor cells were subsampled and mixed with randomly sampled normal cells at the corresponding proportion. Precision, recall and F1 scores were calculated based on the detected segments from scRNA-seq data and the ground truth copy number profiles (from WGS) in 5 multiple myeloma samples. For Numbat, two methods are compared: pseudobulk joint HMM (Numbat-HMM) and iterative optimization (Numbat-iterative) with no minimum pseudobulk size limit. a, Performance for all event types (amplification, deletion, and CNLoH). b, Performance for amplifications. c, Performance for deletions.
Extended Data Fig. 9 Numbat analysis of gastric cell line (NCI-N87) scRNA-seq data and validation by scDNA-seq.
a, Single-cell copy number landscape and subclonal structure reconstructed by scDNA-seq data. Gray vertical bars represent gap regions. A rooted hierarchical clustering tree is shown on the left. Three subclones were defined by cutting the tree with k=3. Red asterisks denote salient subclonal events. b, Single-cell CNV landscape and subclonal structure inferred from the paired scRNA-seq data by Numbat. The original prediction was composed of four subclones. The uppermost two clones were merged and denoted as the ‘major’ clone. Red asterisks denote validated subclonal events. c, Subclone-specific copy number profiles. For each subclone, the top track shows CNV calls made by clone-specific Numbat HMM; the bottom track shows DNA copy number profile of a representative cell from that subclone. Gray vertical bars represent gap regions. d, Numbat recapitulates clonal fractions measured by scDNA-seq. e, Stability and accuracy of Numbat CNV calls for each subclone with respect to parameter variations. F1 scores were computed by comparing DNA profiles for each subclone with the best-matching subclone CNV profiles predicted by Numbat. Circles denote F1 score from initialization with a random tree. Red triangles mark default parameter values.
Extended Data Fig. 10 Single-cell copy number profile and phylogeny reconstructed by Numbat (TNBC and ATC).
Branch lengths correspond to the number of CNV events. Blue dashed line separates predicted tumor and normal cells. Confident subclones are highlighted and marked by red dashed rectangles. The vertical bar on the left of each panel shows cell type ground truth. In TNBC5 and ATC2, the second vertical bar on the left of the panel shows variant allele frequency of a clone-associated mitochondrial mutation. For ATC2, results from the subsampled dataset (including aneuploid cells and 50 randomly sampled normal cells) are shown. In ATC5, some tumor cells were likely mis-annotated as normal in the original annotation.
Supplementary information
Supplementary Information
Supplementary Figs. 1–17, Tables 1–3 and Methods.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gao, T., Soldatov, R., Sarkar, H. et al. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nat Biotechnol 41, 417–426 (2023). https://doi.org/10.1038/s41587-022-01468-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-022-01468-y
This article is cited by
-
HATCHet2: clone- and haplotype-specific copy number inference from bulk tumor sequencing data
Genome Biology (2024)
-
Robust analysis of allele-specific copy number alterations from scRNA-seq data with XClone
Nature Communications (2024)
-
Allele-specific transcriptional effects of subclonal copy number alterations enable genotype-phenotype mapping in cancer cells
Nature Communications (2024)
-
De novo detection of somatic mutations in high-throughput single-cell profiling data sets
Nature Biotechnology (2024)
-
Aneuploidy as a driver of human cancer
Nature Genetics (2024)