Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes


Single-cell transcriptomic analysis is widely used to study human tumors. However, it remains challenging to distinguish normal cell types in the tumor microenvironment from malignant cells and to resolve clonal substructure within the tumor. To address these challenges, we developed an integrative Bayesian segmentation approach called copy number karyotyping of aneuploid tumors (CopyKAT) to estimate genomic copy number profiles at an average genomic resolution of 5 Mb from read depth in high-throughput single-cell RNA sequencing (scRNA-seq) data. We applied CopyKAT to analyze 46,501 single cells from 21 tumors, including triple-negative breast cancer, pancreatic ductal adenocarcinoma, anaplastic thyroid cancer, invasive ductal carcinoma and glioblastoma, to accurately (98%) distinguish cancer cells from normal cell types. In three breast tumors, CopyKAT resolved clonal subpopulations that differed in the expression of cancer genes, such as KRAS, and signatures, including epithelial-to-mesenchymal transition, DNA repair, apoptosis and hypoxia. These data show that CopyKAT can aid in the analysis of scRNA-seq data in a variety of solid human tumors.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Overview of the CopyKAT analysis workflow.
Fig. 2: Comparison of bulk DNA and single-cell RNA copy number profiles.
Fig. 3: Classification of cancer cells and normal cells in human tumors.
Fig. 4: Classification of tumor and normal cells sequenced by different scRNA-seq technologies.
Fig. 5: Clonal substructure of three triple-negative breast tumors.

Data availability

scRNA-seq data from this study were deposited in the Gene Expression Omnibus (GEO; GSE148673).

Code availability

Software is available at GitHub (


  1. 1.

    Peng, J. et al. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 725–738 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Ma, L. et al. Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer. Cancer Cell 36, 418–430 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Gao, R. et al. Nanogrid single-nucleus RNA sequencing reveals phenotypic diversity in breast cancer. Nat. Commun. 8, 228 (2017).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Gierahn, T. M. et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14, 395–398 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Taylor, A. M. et al. Genomic and functional approaches to understanding cancer aneuploidy. Cancer Cell 33, 676–689 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Fan, J. et al. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res. 28, 1217–1227 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Freeman, M. F. & Tukey, J. W. Transformations related to the angular and the square root. Ann. Math. Stat. 21, 607–611 (1950).

    Google Scholar 

  12. 12.

    Petris, G. An R package for dynamic linear models. J. Stat. Softw. 36, 1–16 (2010).

    Google Scholar 

  13. 13.

    Baslan, T. et al. Genome-wide copy number analysis of single cells. Nat. Protoc. 7, 1024–1041 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Harada, T. et al. Genome-wide DNA copy number analysis in pancreatic cancer using high-density single nucleotide polymorphism arrays. Oncogene 27, 1951–1960 (2008).

    CAS  PubMed  Google Scholar 

  15. 15.

    Samuel, N. et al. Integrated genomic, transcriptomic, and RNA-interference analysis of genes in somatic copy number gains in pancreatic ductal adenocarcinoma. Pancreas 42, 1016–1026 (2013).

    CAS  PubMed  Google Scholar 

  16. 16.

    Cancer Genome Atlas Research Network. Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer Cell 32, 185–203 (2017).

    Google Scholar 

  17. 17.

    Yao, H. et al. Glypican-3 and KRT19 are markers associating with metastasis and poor prognosis of pancreatic ductal adenocarcinoma. Cancer Biomark. 17, 397–404 (2016).

    CAS  PubMed  Google Scholar 

  18. 18.

    Girgis, A. H., Bui, A., White, N. M. & Yousef, G. M. Integrated genomic characterization of the kallikrein gene locus in cancer. Anticancer Res. 32, 957–963 (2012).

    PubMed  Google Scholar 

  19. 19.

    Dijk, F. et al. Unsupervised class discovery in pancreatic ductal adenocarcinoma reveals cell-intrinsic mesenchymal features and high concordance between existing classification systems. Sci. Rep. 10, 337 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Heid, I. et al. Co-clinical assessment of tumor cellularity in pancreatic cancer. Clin. Cancer Res. 23, 1461–1470 (2017).

    CAS  PubMed  Google Scholar 

  21. 21.

    Ravi, N. et al. Identification of targetable lesions in anaplastic thyroid cancer by genome profiling. Cancers 11, 402 (2019).

    CAS  PubMed Central  Google Scholar 

  22. 22.

    Ribeiro, F. R., Meireles, A. M., Rocha, A. S. & Teixeira, M. R. Conventional and molecular cytogenetics of human non-medullary thyroid carcinoma: characterization of eight cell line models and review of the literature on clinical samples. BMC Cancer 8, 371 (2008).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Guo, D. et al. Cytokeratin-8 in anaplastic thyroid carcinoma: more than a simple structural cytoskeletal protein. Int. J. Mol. Sci. 19, 577 (2018).

  24. 24.

    Hunt, J. L. Molecular Pathology of Endocrine Diseases (Springer, 2010).

  25. 25.

    Barletta, J. A. Endocrine pathology: advances, updates, and diagnostic pearls. Surg. Pathol. Clin. 12, xi–xii (2019).

    PubMed  Google Scholar 

  26. 26.

    Asa, S. L. & LiVolsi, V. A. New diagnostic and management approaches in endocrine pathology. Arch. Pathol. Lab. Med. 132, 1228–1230 (2008).

    PubMed  Google Scholar 

  27. 27.

    Turner, N. et al. Integrative molecular profiling of triple negative breast cancers identifies amplicon drivers and potential therapeutic targets. Oncogene 29, 2013–2023 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Gao, R. et al. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat. Genet. 48, 1119–1130 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Andre, F. et al. Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array. Clin. Cancer Res. 15, 441–451 (2009).

    CAS  PubMed  Google Scholar 

  30. 30.

    Neftel, C. et al. An integrative model of cellular states, plasticity, and genetics for glioblastoma. Cell 178, 835–849 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Brennan, C. W. et al. The somatic genomic landscape of glioblastoma. Cell 155, 462–477 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

    Google Scholar 

  33. 33.

    Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).

    CAS  PubMed  Google Scholar 

  34. 34.

    Hanzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14, 7 (2013).

    PubMed  PubMed Central  Google Scholar 

  35. 35.

    Xin, Y. et al. Use of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells. Proc. Natl Acad. Sci. USA 113, 3293–3298 (2016).

    CAS  PubMed  Google Scholar 

  36. 36.

    Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

    CAS  PubMed  Google Scholar 

  37. 37.

    Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Martin, A. D., Quinn, K. M. & Park, J. H. MCMCpack: Markov chain Monte Carlo in R. J. Stat. Softw. 42, 1–21 (2011).

    Google Scholar 

  39. 39.

    Kim, C. et al. Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing. Cell 173, 879–893 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004).

    PubMed  Google Scholar 

  41. 41.

    Willenbrock, H. & Fridlyand, J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics 21, 4084–4091 (2005).

    CAS  PubMed  Google Scholar 

  42. 42.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2019).

    CAS  PubMed  Google Scholar 

Download references


This work was supported by grants to N.E.N. from the American Cancer Society (129098-RSG-16-092-01-TBG), the National Cancer Institute (RO1CA240526, RO1CA236864), the Emerson Collective Cancer Research Fund (20200619153514) and the CPRIT Single Cell Genomics Center (RP180684). N.E.N. is an AAAS Wachtel Scholar, AAAS Fellow, Andrew Sabin Family Fellow and Jack & Beverly Randall Innovator. This study was supported by the MD Anderson Breast Cancer Moonshot Program. This study was supported by the MD Anderson Sequencing Core Facility Grant (CA016672). This project was also supported by a Susan Komen Postdoctoral Fellowship to R.G. (PDF17487910). Other grant support includes the Anaplastic Thyroid Cancer Research Fund (S.Y.L. and J.R.W.) and an institutional multi-investigator research program grant to S.Y.L.

Author information




R.G. and N.E.N. designed the research project. R.G. developed and implemented the computational methods with contributions from N.E.N., Y.Y., A.D., F.W. and K.C. M.H. preprocessed the data. S.F.S. and S.M. provided clinical samples. J.R.W. and S.Y.L. collected thyroid tumor samples. S.B., Y.C.H., Y.L., A.S., T.K. and E.S. performed single-cell sequencing experiments. R.G. and N.E.N. wrote the manuscript with input from all authors.

Corresponding author

Correspondence to Nicholas E. Navin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks Elana Fertig, Jan Korbel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4 and Tables 1 and 2.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, R., Bai, S., Henderson, Y.C. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat Biotechnol 39, 599–608 (2021).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing