Clonal genotype and population structure inference from single-cell tumor sequencing

Journal name:
Nature Methods
Volume:
13,
Pages:
573–576
Year published:
DOI:
doi:10.1038/nmeth.3867
Received
Accepted
Published online

Single-cell DNA sequencing has great potential to reveal the clonal genotypes and population structure of human cancers. However, single-cell data suffer from missing values and biased allelic counts as well as false genotype measurements owing to the sequencing of multiple cells. We describe the Single Cell Genotyper (https://bitbucket.org/aroth85/scg), an open-source software based on a statistical model coupled with a mean-field variational inference method, which can be used to address these problems and robustly infer clonal genotypes.

At a glance

Figures

  1. Overview of the SCG model.
    Figure 1: Overview of the SCG model.

    (a,b) Histograms of variant allele frequencies of diploid heterozygous loci from (a) bulk sequencing and (b) single-cell sequencing of a 184-hTert cell line sample. The same set of loci are used for single-cell and bulk sequencing. (c) Schematic workflow of single-cell sequencing experiment. The SCG model is applied to the discrete data input matrix to cluster the data, predict clonal genotypes, and infer the prevalence of clones. (d) Probabilistic graphical model representing the basic SCG model. Shaded nodes represent observed values or fixed values; a posterior distribution over the values of the unshaded nodes is approximated using a variational Bayesian method. π, clone prevalence; Zn, variable indicating clone of origin for cell n; Gkm, variable indicating genotype of locus m for clone k; εs, error profile for genotype state s; Xnm, observed data from cell n and locus m; γs, parameter of Dirichlet distribution prior for the error profile for genotype state s; and k, parameter of Dirichlet distribution prior for the clone prevalence.

  2. Comparison of clustering performance on real data with doublets.
    Figure 2: Comparison of clustering performance on real data with doublets.

    SCG3, D-SCG3, and CMM3 models were used to identify clones in single-cell sequence data from a patient with childhood acute lymphoblastic leukemia. The same data were included in all plots, and doublet cells predicted by D-SCG3 were arbitrarily assigned to clusters. (ac) Raw data ordered by cluster for (a) CMM3, (b) SCG3, and (c) D-SCG3 models. (df) Predicted genotypes for each cluster for (d) CMM3, (e) SCG3, and (f) D-SCG3 models. (gi) Maximum-parsimony trees relating clonal genotypes for predicted genotypes from (g) CMM3, (h) SCG3, and (i) D-SCG3 models. Clusters are annotated to the left of each heat map and at each cladogram branch terminus. Branch length expressed as normalized proportion of change (i.e., fraction of locus change).

  3. The D-SCG3 model identified clonal cell populations in multiple samples from an HGSOC patient.
    Figure 3: The D-SCG3 model identified clonal cell populations in multiple samples from an HGSOC patient.

    The data set contained both SNV and breakpoint events. (a) Input data ordered by cluster (inner left bar). Originating samples for each cell are annotated on the far left. (b) Predicted clonal genotypes. (c) Maximum-parsimony tree relating clonal genotypes. (d) Estimated prevalence of clones across samples. Error bars indicate ±1 s.d. from the posterior mean, based on posterior distribution estimates from the model. Samples consisted of 84 cells each. By cluster, n = 123 (cluster 0), 102 (cluster 1), 75 (cluster 2), 37 (cluster 3), 35 (cluster 4), and 20 (cluster 5). LOv, left ovary; Om, omentum; ROv, right ovary.

  4. Performance comparison using 90 synthetic data without doublets.
    Supplementary Fig. 1: Performance comparison using 90 synthetic data without doublets.

    (a) Example synthetic data used for benchmarking. (b) V-measure metric used to assess clustering performance (higher is better). The mean Hamming distance between predicted genotypes for each cell and their true genotypes in the (c) two-state and (d) three-state representations respectively (lower is better).

  5. Performance comparison using 80 synthetic data with doublets.
    Supplementary Fig. 2: Performance comparison using 80 synthetic data with doublets.

    (a) F-measure of the B- cubed metric to assess feature allocation performance (higher is better). (b) Clone accuracy assessed by the maximum Hamming distance of a predicted clonal genotype to its nearest true clonal genotype in 3 state representation (lower is better).

  6. Difference between the number of true clusters and number of clusters predicted by the D-SCG3 model.
    Supplementary Fig. 3: Difference between the number of true clusters and number of clusters predicted by the D-SCG3 model.

    Data was simulated from the D-SCG3 model with 100 data points with 10 replicate datasets per parameter setting. We simulated data across a range of doublet probabilities and number of clusters.

  7. Copy number profile for the high grade serous ovarian cancer dataset.
    Supplementary Fig. 4: Copy number profile for the high grade serous ovarian cancer dataset.

    Red lines indicate major copy number and blue lines indicate minor copy number. Note that this tumour likely underwent a genome doubling early in its evolutionary history.

  8. Missing data in high grade serous ovarian cancer dataset.
    Supplementary Fig. 5: Missing data in high grade serous ovarian cancer dataset.

    Proportion of missing values per cell for SNV events in the high grade serous ovarian cancer data set. Cells are grouped by cluster.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. Navin, N. et al. Nature 472, 9094 (2011).
  2. Gawad, C., Koh, W. & Quake, S.R. Proc. Natl. Acad. Sci. USA 111, 1794717952 (2014).
  3. Wang, Y. et al. Nature 512, 155160 (2014).
  4. Baslan, T. et al. Genome Res. 25, 714724 (2015).
  5. Eirew, P. et al. Nature 518, 422426 (2015).
  6. Navin, N.E. Sci. Transl. Med. 7, 296fs29 (2015).
  7. Roth, A. et al. Nat. Methods 11, 396398 (2014).
  8. Jiao, W., Vembu, S., Deshwar, A.G., Stein, L. & Morris, Q. BMC Bioinformatics 15, 35 (2014).
  9. Zare, H. et al. PLoS Comput. Biol. 10, e1003703 (2014).
  10. Malikic, S., McPherson, A.W., Donmez, N. & Sahinalp, C.S. Bioinformatics 31, 13491356 (2015).
  11. Popic, V. et al. Genome Biol. 16, 91 (2015).
  12. Shapiro, E., Biezuner, T. & Linnarsson, S. Nat. Rev. Genet. 14, 618630 (2013).
  13. Ning, L. et al. Front. Oncol. 4, 7 (2014).
  14. Yuan, K., Sakoparnig, T., Markowetz, F. & Beerenwinkel, N. Genome Biol. 16, 36 (2015).
  15. Broderick, T., Pitman, J. & Jordan, M.I. Bayesian Anal. 8, 801836 (2013).
  16. McPherson, A. et al. Nat. Genet. http://dx.doi.org/10.1038/ng.3573 (2016).
  17. Ahmed, A.A. et al. J. Pathol. 221, 4956 (2010).
  18. Rosenberg, A. & Hirschberg, J. In Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 410420 (Association for Computational Linguistics, 2007).
  19. Amigó, E., Gonzalo, J., Artiles, J. & Verdejo, F. Inf. Retrieval 12, 461486 (2009).
  20. Shah, S.P. et al. Nature 461, 809813 (2009).

Download references

Author information

Affiliations

  1. Department of Molecular Oncology, BC Cancer Agency, Vancouver, British Columbia, Canada.

    • Andrew Roth,
    • Andrew McPherson,
    • Emma Laks,
    • Justina Biele,
    • Damian Yap,
    • Adrian Wan,
    • Maia A Smith,
    • Cydney B Nielsen,
    • Samuel Aparicio &
    • Sohrab P Shah
  2. Graduate Bioinformatics Training Program, University of British Columbia, Vancouver, British Columbia, Canada.

    • Andrew Roth
  3. School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.

    • Andrew McPherson
  4. Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

    • Damian Yap,
    • Cydney B Nielsen,
    • Samuel Aparicio &
    • Sohrab P Shah
  5. Department of Gynecology and Obstetrics, University of British Columbia, Vancouver, British Columbia, Canada.

    • Jessica N McAlpine
  6. Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada.

    • Alexandre Bouchard-Côté
  7. Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, Canada.

    • Sohrab P Shah

Contributions

A.R., project conception, algorithm development, software implementation, and data analysis; S.A., A.M., E.L., J.B., D.Y., and A.W., single-nucleus sequencing; M.A.S. and C.B.N., data visualization; J.N.M., surgery, sample acquisition, and tumor banking; A.R., S.A., A.B.-C., and S.P.S., manuscript writing; S.P.S., project oversight and senior responsible author.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Performance comparison using 90 synthetic data without doublets. (96 KB)

    (a) Example synthetic data used for benchmarking. (b) V-measure metric used to assess clustering performance (higher is better). The mean Hamming distance between predicted genotypes for each cell and their true genotypes in the (c) two-state and (d) three-state representations respectively (lower is better).

  2. Supplementary Figure 2: Performance comparison using 80 synthetic data with doublets. (38 KB)

    (a) F-measure of the B- cubed metric to assess feature allocation performance (higher is better). (b) Clone accuracy assessed by the maximum Hamming distance of a predicted clonal genotype to its nearest true clonal genotype in 3 state representation (lower is better).

  3. Supplementary Figure 3: Difference between the number of true clusters and number of clusters predicted by the D-SCG3 model. (51 KB)

    Data was simulated from the D-SCG3 model with 100 data points with 10 replicate datasets per parameter setting. We simulated data across a range of doublet probabilities and number of clusters.

  4. Supplementary Figure 4: Copy number profile for the high grade serous ovarian cancer dataset. (108 KB)

    Red lines indicate major copy number and blue lines indicate minor copy number. Note that this tumour likely underwent a genome doubling early in its evolutionary history.

  5. Supplementary Figure 5: Missing data in high grade serous ovarian cancer dataset. (47 KB)

    Proportion of missing values per cell for SNV events in the high grade serous ovarian cancer data set. Cells are grouped by cluster.

PDF files

  1. Supplementary Text and Figures (1,833 KB)

    Supplementary Figures 1–5, Supplementary Notes 1–3, Supplementary Results and Supplementary Discussion.

Excel files

  1. Supplementary Table 1 (5,632 KB)

    Parameters used to generate synthetic data sets

  2. Supplementary Table 2 (5,632 KB)

    P-values from Nemenyi test comparing clustering accuracy using V-measure metric.

  3. Supplementary Table 3 (5,632 KB)

    P-values from Nemenyi test comparing performance of genotype prediction using mean Hamming distance in two-state representation.

  4. Supplementary Table 4 (5,632 KB)

    P-values from Nemenyi test comparing performance of genotype prediction using mean Hamming distance in three-state representation.

  5. Supplementary Table 5 (79,872 KB)

    Clustering performance of methods on synthetic data sets without doublets.

  6. Supplementary Table 6 (50,688 KB)

    Genotyping performance of methods on synthetic data sets without doublets

  7. Supplementary Table 7 (5,632 KB)

    P-values from Nemenyi test comparing feature allocation accuracy using B-cubed metric.

  8. Supplementary Table 8 (71,680 KB)

    Feature allocation performance of methods on data sets with doublets.

  9. Supplementary Table 9 (5,632 KB)

    P-values from Nemenyi test comparing maximum Hamming distance to nearest clone.

  10. Supplementary Table 10 (22,016 KB)

    Accuracy of predicted clonal genotypes of methods on data sets with doublets.

  11. Supplementary Table 11 (46,592 KB)

    Input data for CMM and SCG models for the childhood leukemia data set.

  12. Supplementary Table 12 (13,824 KB)

    Cluster assignments predicted by CMM3 model for the childhood leukemia data set.

  13. Supplementary Table 13 (13,824 KB)

    Cluster assignments predicted by SCG3 model for the childhood leukemia data set.

  14. Supplementary Table 14 (13,824 KB)

    Cluster assignments predicted by D-SCG3 model for the childhood leukemia data set.

  15. Supplementary Table 15 (5,632 KB)

    Predicted genotypes from CMM3 model of clusters with cells assigned for the childhood leukemia data set.

  16. Supplementary Table 16 (5,632 KB)

    Predicted genotypes from SCG3 model of clusters with cells assigned for the childhood leukemia data set.

  17. Supplementary Table 17 (5,632 KB)

    Predicted genotypes from D-SCG3 model of clusters with cells assigned for the childhood leukemia data set.

  18. Supplementary Table 18 (166 KB)

    Input data for D-SCG3 model for the HGSOC data set.

  19. Supplementary Table 19 (30,208 KB)

    Cluster assignments for the HGSOC data set using D-SCG3 model.

  20. Supplementary Table 20 (9,728 KB)

    Predicted genotypes of clusters with cells assigned for the HGSOC data set using D-SCG3 model.

  21. Supplementary Table 21 (9,728 KB)

    Predicted clone prevalences for the HGSOC data set using D-SCG3 model.

Zip files

  1. Supplementary Software (69,149 KB)

    Single cell genotyper model and simulation code.

Additional data