Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes

Abstract

Bulk and single-cell DNA sequencing has enabled reconstructing clonal substructures of somatic tissues from frequency and cooccurrence patterns of somatic variants. However, approaches to characterize phenotypic variations between clones are not established. Here we present cardelino (https://github.com/single-cell-genetics/cardelino), a computational method for inferring the clonal tree configuration and the clone of origin of individual cells assayed using single-cell RNA-seq (scRNA-seq). Cardelino flexibly integrates information from imperfect clonal trees inferred based on bulk exome-seq data, and sparse variant alleles expressed in scRNA-seq data. We apply cardelino to a published cancer dataset and to newly generated matched scRNA-seq and exome-seq data from 32 human dermal fibroblast lines, identifying hundreds of differentially expressed genes between cells from different somatic clones. These genes are frequently enriched for cell cycle and proliferation pathways, indicating a role for cell division genes in somatic evolution in healthy skin.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Overview and validation of the cardelino model.
Fig. 2: Parallel deep-exome sequencing and scRNA-seq profiling of 32 human dermal fibroblast lines.
Fig. 3: Clone-specific transcriptome profiles reveal gene expression differences for joxm, one example line.
Fig. 4: Signatures of transcriptomic clone-to-clone variation across 31 lines.

Data availability

scRNA-seq data have been deposited in the ArrayExpress database at EMBL-EBI (www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-7167. WES data is available through the HipSci portal (www.hipsci.org). The lines used in this study have the identifiers: euts, fawm, feec, fikt, garx, gesg, heja, hipn, ieki, joxm, kuco, laey, lexy, naju, nusw, oaaz, oilg, pipw, puie, qayj, qolg, qonc, rozh, sehl, ualf, vass, vils, vuna, wahn, wetu, xugn, zoxy. Metadata, processed data and large results files are available at https://doi.org/10.5281/zenodo.1403510

Code availability

The cardelino methods are implemented in an open-source, publicly available R package (github.com/single-cell-genetics/cardelino). The code used to process and analyse the data is available (github.com/davismcc/fibroblast-clonality), with a reproducible workflow implemented in Snakemake64. Descriptions of how to reproduce the data processing and analysis workflows, with html output showing code and figures presented in this paper, are available at davismcc.github.io/fibroblast-clonality. Docker images providing the computing environment and software used for data processing (hub.docker.com/r/davismcc/fibroblast-clonality/) and data analyses in R (hub.docker.com/r/davismcc/r-singlecell-img/) are publicly available.

References

  1. Burnet, F. M. Intrinsic mutagenesis: a genetic basis of ageing. Pathology 6, 1–11 (1974).

    CAS  PubMed  Google Scholar 

  2. Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489 (2015).

    CAS  PubMed  Google Scholar 

  3. Stransky, N. et al. The mutational landscape of head and neck squamous cell carcinoma. Science 333, 1157–1160 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Hodis, E. et al. A landscape of driver mutations in melanoma. Cell 150, 251–263 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Huang, K.-L. et al. Pathogenic germline variants in 10,389 adult cancers. Cell 173, 355–370.e14 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).

    CAS  PubMed  Google Scholar 

  9. Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385.e18 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Ding, L. et al. Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173, 305–320.e10 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11, 396 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16, 35 (2015).

    PubMed  PubMed Central  Google Scholar 

  13. Jiang, Y., Qiu, Y., Minn, A. J. & Zhang, N. R. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc. Natl Acad. Sci. USA 113, E5528–E5537 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Navin, N. E. The first five years of single-cell cancer genomics and beyond. Genome Res. 25, 1499–1507 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Kim, K. I. & Simon, R. Using single cell sequencing data to model the evolutionary history of a tumor. BMC Bioinf. 15, 27 (2014).

    Google Scholar 

  18. Navin, N. E. & Chen, K. Genotyping tumor clones from single-cell data. Nat. Methods 13, 555–556 (2016).

    CAS  PubMed  Google Scholar 

  19. Jahn, K., Kuipers, J. & Beerenwinkel, N. Tree inference for single-cell data. Genome Biol. 17, 86 (2016).

    PubMed  PubMed Central  Google Scholar 

  20. Kuipers, J., Jahn, K., Raphael, B. J. & Beerenwinkel, N. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 27, 1885–1894 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Roth, A. et al. Clonal genotype and population structure inference from single-cell tumor sequencing. Nat. Methods 13, 573–576 (2016).

    CAS  PubMed  Google Scholar 

  22. Salehi, S. et al. ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data. Genome Biol. 18, 44 (2017).

    PubMed  PubMed Central  Google Scholar 

  23. Malikic, S. et al. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat. Commun. 10, 2750 (2019).

    PubMed  PubMed Central  Google Scholar 

  24. Müller, S. et al. Single‐cell sequencing maps gene expression to mutational phylogenies in PDGF‐ and EGF‐driven gliomas. Mol. Syst. Biol. 12, 889 (2016).

    PubMed  PubMed Central  Google Scholar 

  25. Tirosh, I. et al. Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature 539, 309–313 (2016).

    PubMed  PubMed Central  Google Scholar 

  26. Fan, J. et al. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res. 28, 1217–1227 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Campbell, K. R. et al. clonealign: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers. Genome Biol. 20, 54 (2019).

    PubMed  PubMed Central  Google Scholar 

  28. Giustacchini, A. et al. Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat. Med. 23, 692–702 (2017).

    CAS  PubMed  Google Scholar 

  29. Cheow, L. F. et al. Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13, 833–836 (2016).

    CAS  PubMed  Google Scholar 

  30. Saikia, M. et al. Simultaneous multiplexed amplicon sequencing and transcriptome profiling in single cells. Nat. Methods 16, 59–62 (2019).

    CAS  PubMed  Google Scholar 

  31. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    CAS  PubMed  Google Scholar 

  32. Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370–375 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Williams, M. J. et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat. Genet. 50, 895–903 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 173, 1823 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Simons, B. D. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis. Proc. Natl Acad. Sci. USA 113, 128–133 (2016).

    CAS  PubMed  Google Scholar 

  36. Williams, M. J., Werner, B., Barnes, C. P., Graham, T. A. & Sottoriva, A. Identification of neutral tumor evolution across cancer types. Nat. Genet. 48, 238 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Ramaker, R. C. et al. RNA sequencing-based cell proliferation analysis across 19 cancers identifies a subset of proliferation-informative cancers with a common survival signature. Oncotarget. 8, 38668–38681 (2017).

    PubMed  PubMed Central  Google Scholar 

  38. Kowalczyk, M. S. et al. Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 25, 1860–1872 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Tsang, J. C. H. et al. Single-cell transcriptomic reconstruction reveals cell cycle and multi-lineage differentiation defects in Bcl11a-deficient hematopoietic stem cells. Genome Biol. 16, 178 (2015).

    PubMed  PubMed Central  Google Scholar 

  40. Kolodziejczyk, A. A. et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Guo, H. et al. Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing. Genome Res. 23, 2126–2135 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11, 817–820 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).

    CAS  PubMed  Google Scholar 

  46. Streeter, I. et al. The human-induced pluripotent stem cell initiative—data resources for cellular genetics. Nucleic Acids Res. 45, 691–697 (2016).

    Google Scholar 

  47. Church, D. M. et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv [q-bio.GN] (2013).

  49. Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  50. Karczewski, K. J. et al. The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acids Res. 45, D840–D845 (2017).

    CAS  PubMed  Google Scholar 

  51. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Google Scholar 

  52. Fisher, R. A. On the interpretation of χ2 from contingency tables, and the calculation of P. J. R. Stat. Soc. 85, 87–94 (1922).

    Google Scholar 

  53. Gori, K. & Baez-Ortega, A. sigfit: flexible Bayesian inference of mutational signatures. Preprint at bioRxiv https://doi.org/10.1101/372896 (2018).

  54. Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).

    CAS  PubMed  Google Scholar 

  55. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Lun, A. T. L., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    PubMed  Google Scholar 

  58. Hoffman, G. E. & Schadt, E. E. variancePartition: interpreting drivers of variation in complex gene expression studies. BMC Bioinf. 17, 483 (2016).

    Google Scholar 

  59. Lund, S. P., Nettleton, D., McCarthy, D. J. & Smyth, G. K. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat. Appl. Genet. Mol. Biol. 11, https://doi.org/10.1515/1544-6115.1826 (2012).

  60. Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).

    CAS  PubMed  Google Scholar 

  61. Wu, D. & Smyth, G. K. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 40, e133 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. Ignatiadis, N., Klaus, B., Zaugg, J. B. & Huber, W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13, 577–580 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

    PubMed  Google Scholar 

  65. Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 1–25 (2004).

    Google Scholar 

  66. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Google Scholar 

Download references

Acknowledgements

We thank D. Jörg for highly constructive discussions and P. Qiao for valuable comments on the manuscript. We acknowledge the Wellcome Sanger Institute Cellular Genetics and Phenotyping teams (in particular, A. Alderton, C. Gomez, R. Boyd, S. Patel and S. Barnett) and DNA pipelines for their invaluable assistance in generating the data for this study. We thank G. Kildisiute for assisting in CNV analysis of the fibroblast lines. This project was supported by Wellcome Sanger core funding (WT206194) and the Human Induced Pluripotent Stem Cell Initiative. Research in the Stegle laboratory is supported by the BMBF, the Volkswagen Foundation, the Chan Zuckerberg Initiative and the EU (ERC project DECODE, grant agreement 732546). D.J.M. is supported by the National Health and Medical Research Council of Australia (grants APP1112681 and APP1162829), seed funding from the Baker Foundation and the Holyoake Research Fellowship at St Vincent's Institute of Medical Research and the University of Melbourne. R.R. is supported the BBSRC Doctoral Training Programme. Y.H. is supported by the University of Cambridge and EMBL-EBI through an EBPOD postdoctoral fellowship. D.J.K. is supported by the Wellcome Trust under grants 203828/Z/16/A and 203828/Z/16/Z. T.H. is supported by a Human Frontier Science Program Fellowship, an EMBO Long-term Fellowship and an EMBO Advanced Fellowship.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

R.R., T.H. and S.A.T. conceived and planned the experiments. R.R. and T.H. carried out the experiments. Y.H., D.J.M. and O.S. developed the computational methods. Y.H. developed the statistical model and the implementation. Y.H. and D.J.M. wrote the software. Y.H. carried out all simulation experiments and benchmarked alternative methods. The HipSci Consortium provided the cell lines and exome sequencing data. P.D. conducted somatic variant calling from exome sequencing data. D.J.G. advised on somatic variant calling approaches and the mutational signatures analysis carried out by R.R. D.J.M. and M.J.B. developed data processing workflows and D.J.M. processed the fibroblast scRNA-sequencing data. R.L. and D.J.M. processed the melanoma scRNA-sequencing data. D.J.K. conducted the selection analyses, supervised by B.D.S. D.J.M. and Y.H. carried out clonal inference and cell-assignment analyses. D.J.M. conducted differential gene and pathway expression analyses and integrated the computational analyses into a reproducible workflow. D.J.M. and R.R. took the lead in writing the manuscript. D.J.M., R.R. and Y.H. drafted the manuscript and designed the figures. W.W. suggested improvements to somatic variant calling and DE analyses. S.A.T. and O.S. conceived of the study, planned and supervised the work. All authors contributed to the interpretation of results and commented on and approved the final manuscript. The HipSci Consortium generated and provided early access to the fibroblast lines used in this work (see Supplementary Note for a full list of consortium members).

Corresponding authors

Correspondence to Oliver Stegle or Sarah A. Teichmann.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Graphical representation of the cardelino model.

The clonal tree configuration matrix C is a random variable and follows a Bernoulli distribution encoded by an input tree configuration Ω that is provided to the model (for example estimated from bulk or single-cell DNA-seq data using existing methods such as Canopy) as well as an error rate ξ, which follows a beta prior distribution with hyper parameters 𝜅. The indicator matrix I defines the assignment of cells to clones, which is another unknown variable, and assumed to follow a multinomial prior with fixed parameter 𝜋 for each cell. The clone configuration C and cell identity I together encode the genotype ci,Ij of each variant i in each cell j. If ci,Ij is 1, the alternative allelic read count will follow a binomial distribution with gene specific parameter 𝜃i, otherwise with error related parameter 𝜃0. Both 𝜃i and 𝜃0 have a beta prior distribution, but with different parameters. Shaded nodes represent observed variables; unshaded nodes represent unknown variables; yellow circled nodes represent fixed hyper parameters.

Supplementary Figure 2 Distribution of key data characteristics from experimental scRNA-seq data from 32 fibroblast cell lines, used as the basis of parameter settings in simulations.

(a) Number of clones inferred from bulk exome-seq data. (b) The median number of variants per clonal branch; (c) The overall coverage of variants, namely the fraction of variants with at least one read. (d) Scatter plot between the mean number of reads per variant per cell and the overall coverage of variants in the same line. The default parameters used in simulations are highlighted with the red line.

Supplementary Figure 3 Simulation results evaluating the inferred relax (error) rate in the configuration of variants in the guide clonal tree.

(a) The estimated relax rate as a function of the simulated error rates. Errors are simulated by uniformly swapping the mutation states in the configuration matrix of the guide clonal tree, which means that a clone may contain false mutations in the guide clonal tree provided to cardelino (except in the case of the base clone which has no mutations under any simulation conditions). (b) The estimated relax rate across different fractions of variants that have wrong branch configuration. Errors are added by swapping branches for variants.

Supplementary Figure 4 Additional results from assessing cardelino and alternative methods using simulated data.

Assessment of cell assignment to clones across a variety of simulation settings, considering SingleCellGenotyper (SCG), Demuxlet, cardelino and its two versions: cardelino-free without any informative clone configuration prior and cardelino-fixed assuming that the clone configuration prior is correct (Methods; Supplementary Note). All methods were applied to simulated data with known ground truth, varying (a) the number of informative variants per clonal branch, (b) the fraction of informative variants covered (that is, nonzero scRNA-seq read coverage), (c) the total number of clones, (d) the precision (i.e., inverse variance) of allelic ratio across genes; lower precision means more genes with high allelic imbalance, (e) the rate of general errors of mutation states in the clone configuration matrix, (f) the fraction of wrongly clustered variants in the input clonal tree branch. Default parameter values are marked with an asterisk and are retained when varything other parameters. (g) The effects of the tree topology on the cell assignment accuracy. In the simulations there are 50 repeats for each parameter, where one of the tree topology candidates is randomly selected in each repeat. For the four-clone configurations, there are four different tree topologies (upper panel), and their performance (area under the precision-recall curve) for the five different methods are splitted (bottom panel).

Supplementary Figure 5 Estimated mutational signature exposures based upon the tri-nucleotide context of somatic SNVs called from whole-exome sequencing (WES) data for n=32 HipSci human fibroblast lines.

The x-axis shows 30 COSMIC mutational signatures, in order, and the y-axis shows estimated exposures (mutation fraction) using the sigfit package (Methods), with significant signatures highlighted in blue. Across lines, the only significant signatures are Signature 7 (UV mutagenic process) and Signature 11.

Supplementary Figure 6 Variant allele frequency (VAF) distributions for somatic variants called from whole exome sequencing data for the 32 fibroblast lines.

The grey lines indicate the minimum allele-frequency threshold (0.05) used for variants for this analysis (Methods). The blue lines indicate the model (neutral/selected) inferred by SubClonalSelection (shading 95% confidence interval). Donors with a selection probability below 0.3 are classified as ‘neutral’, above 0.7 as ‘selected’. Donors which are neither ‘selected’ nor ‘neutral’ remain ‘undetermined’. High confidence ‘selected’ lines (selection probability >0.7 and >100 somatic variants) are: joxm, wahn, garx, vass, ualf, euts, pipw, oilg, feec, fikt, qolg, and puie.

Supplementary Figure 7 Comparison of five methods on simulated data matching 32 fibroblast cell lines and estimated error rate and cell assignability with cardelino from experimental data for 32 fibroblast lines.

(a) Assessment of cell assignment to clones across a variety of simulation settings, considering SingleCellGenotyper (SCG), Demuxlet, cardelino, cardelino-free and cardelino-fixed (Methods; Supplementary Note). Considered are simulated data based on empirical characteristics observed in 32 fibroblast lines. For each line, the sequence coverage, clone configuration (i.e., number of clones, variants on each branch), and allelic imbalance parameters were obtained to derive simulation parameters. 200 cells are synthesised per line and a guide clonal tree with 10% errors in allocation of variants to clones. (b) Estimated error rate in the clonal tree configuration derived from bulk exome-seq data (based on cardelino) for each of 32 lines versus fraction of confidently assigned cells (>90% of cells assigned for 23 lines; at cardelino posterior probability P>0.5 for most-probable clone).

Supplementary Figure 8 Comparison of cell assignment between five methods on experimental data across 32 fibroblast lines.

(a) The fraction of assignable cells (i.e., highest P > thresholds) when varying the thresholds from 0.5 to 0.95. Shown are box plots depicting median and the first and third quantiles of the 32 lines. (b) The adjusted Rand index of cell assignment to clones between the five considered methods. The values are averaged across 32 fibroblast lines. (c) Scatter plot between the uncertainty of the inferred tree from cardelino-free (x-axis) and the mean absolute difference of the assignment probability between cardelino-free and cardelino (y-axis). The output posterior clonal configuration matrix from cardelino-free consists of the probability of each variant being present in each clone. A completely uninformative clonal tree would have all entries equal to 0.5. Thus, we measure the uncertainty of the output tree from cardelino-free by taking 0.5 minus the mean absolute difference of the posterior probability configuration matrix and the uninformative configuration probability matrix of all of entries equal to 0.5. With this measure, a value of 0.5 indicates a posterior configuration indistinguishable from the uninformative configuration and a value of 0 indicates very high-confidence from the model in the posterior configuration. (d) The comparison of cell assignment for one representative line (feec) when using different guide clonal trees sampled from Canopy’s posterior distribution as input. Each violin plot shows the adjusted Rand index of cell assignment between each of 435 tree pairs combining the 30 most probable trees from bulk exome-seq for the feec line. (e) Cell assignment similarity for each of the 32 lines when using different guide clonal trees, quantified with adjusted Rand index values between different pairs of guide clonal trees. For each line, we take the 30 most probable posterior trees from Canopy, and then each dot in the box plot denotes the average adjusted Rand index value for one line, calculated from 435 of these pairwise comparisons.

Supplementary Figure 9 ICell-clone assignment rates from cardelino.

(a) Scatter plot of the fraction of cells assigned in each cell line using cardelino (at posterior probability > 0.5) as a function of the minimum number of clone-specific variants for the corresponding line (minimum Hamming distance between clones for a given donor), for 32 fibroblast lines. Total number of cells that were considered for this analysis (QC passed) per line indicated by colour. (b) Scatter plot of recall (assignment rate) versus precision (assignment accuracy) when assigning cells using cardelino (at posterior probability > 0.5). Shown are data from for 32 simulated lines, using parameters that match the observed data characteristics in the set of 32 real fibroblast lines (Methods). The average number of variants per clonal branch (i.e., #variant/(#clone - 1)) is shown by point colour (slightly different from Supplementary Fig. 4 which uses the minimum number of variants distinguishing between pairs of clones, as shown in Fig. 3a). Lines with fewer informative variants per branch tend to have lower assignment rates, but the precision remains high.

Supplementary Figure 10 Clone prevalence estimates from WES data (x-axis; using Canopy) versus the fraction of single-cell transcriptomes assigned to the clone (y-axis; using cardelino), for each clone across lines.

Points are coloured by the overall fraction of single-cell transcriptomes assigned for a given line (i.e. cells with posterior P>0.5 for assignment).

Supplementary Figure 11 Direct effects of somatic variants on genes overlapping the variant.

Volcano plot showing negative log P values versus log2-fold change from testing differential expression for genes with a somatic mutation between cells with the mutation and cells without the mutation, faceted by VEP annotation category (Methods). Each point represents a gene, and box plots show the overall log2-fold change distribution for each annotation category. DE tests (two-sided QL F test in edgeR) are conducted within each line (donor) separately, and the results shown here are aggregated across n=32 lines. Genes are categorised by simplified functional annotations from VEP of the somatic mutation, and genes significantly DE at an FDR threshold of 20% are shown in red.

Supplementary Figure 12 Gene set enrichment results for fibroblast data from n=32 lines.

(a) Heatmap showing Spearman correlation between gene set enrichment results for the 16 most frequently enriched MSigDB Hallmark gene sets across 31 lines. Colour indicates the correlation between pairs of gene sets and is only shown if the correlation is significant (P < 0.05). (b) Heatmap showing proportion of overlap in genes between pairs of gene sets (matching those in left panel). (c) Heatmap showing the direction (first listed clone relative to second listed clone; in colour) and strength of enrichment (-log10(P) as degree of shading) for Hallmark gene sets tested with camera (Methods) for all pairwise comparisons between clones across n=31 lines. Gene sets that are significantly enriched at an FDR threshold of 5% are indicated with dots. Gene sets are shown if significant in at least one line and are ordered by number of lines in which they are significant.

Supplementary Figure 13 Results from five human melanoma samples.

(a) Number of cells assigned by cardelino to each inferred clone for five melanoma patients, stratified by cell type identified using gene expression of marker genes as in the original publication 37. (b) Gene set enrichment analysis results when comparing gene expression in clone1 cells to cells in other clones, within each patient, including cells from all cell types. Given that immune cells and cancer-associated fibroblast (CAF) cells are almost all assigned to clone1, this comparison effectively reflects expression differences between melanoma and immune cells. (c) Gene set enrichment analysis results when considering all pairwise comparisons between clones consisting of melanoma cells only. The heatmaps in (b) and (c) depict signed P-values of gene set enrichment (n=31 cell lines; two-sided test using camera) for Hallmark gene sets found to be significantly enriched (FDR<0.05) in at least one comparison. Dots denote significant enrichments. For details on the cell assignment and gene set enrichment analyses see Supplementary Note. (d) Heatmap showing correlations between gene set enrichment results when using all cells (across melanoma, immune and cancer-associated fibroblast cell types) assigned to clones across five melanoma patients and comparing expression of cells assigned to clone1 to those assigned to other clones. (e) Heatmap showing correlations between gene set enrichment results when using all melanoma cells assigned to clones across five melanoma patients and comparing expression of cells between all pairs of clones (for which the clones have sufficiently many cells assigned). For both (d) and (e), the eatmap shows Spearman correlation between gene set enrichment results for the 16 most frequently enriched MSigDB Hallmark gene sets across n=5 patients. Colour indicates the correlation between pairs of gene sets and is only shown if the correlation is significant (P < 0.05).

Supplementary information

Supplementary Information

Supplementary Figs. 1–13, Tables 1 and 2 and Note.

Reporting Summary

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

McCarthy, D.J., Rostom, R., Huang, Y. et al. Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes. Nat Methods 17, 414–421 (2020). https://doi.org/10.1038/s41592-020-0766-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-020-0766-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing