Abstract
Single-cell DNA sequencing has great potential to reveal the clonal genotypes and population structure of human cancers. However, single-cell data suffer from missing values and biased allelic counts as well as false genotype measurements owing to the sequencing of multiple cells. We describe the Single Cell Genotyper (https://bitbucket.org/aroth85/scg), an open-source software based on a statistical model coupled with a mean-field variational inference method, which can be used to address these problems and robustly infer clonal genotypes.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Navin, N. et al. Nature 472, 90–94 (2011).
Gawad, C., Koh, W. & Quake, S.R. Proc. Natl. Acad. Sci. USA 111, 17947–17952 (2014).
Wang, Y. et al. Nature 512, 155–160 (2014).
Baslan, T. et al. Genome Res. 25, 714–724 (2015).
Eirew, P. et al. Nature 518, 422–426 (2015).
Navin, N.E. Sci. Transl. Med. 7, 296fs29 (2015).
Roth, A. et al. Nat. Methods 11, 396–398 (2014).
Jiao, W., Vembu, S., Deshwar, A.G., Stein, L. & Morris, Q. BMC Bioinformatics 15, 35 (2014).
Zare, H. et al. PLoS Comput. Biol. 10, e1003703 (2014).
Malikic, S., McPherson, A.W., Donmez, N. & Sahinalp, C.S. Bioinformatics 31, 1349–1356 (2015).
Popic, V. et al. Genome Biol. 16, 91 (2015).
Shapiro, E., Biezuner, T. & Linnarsson, S. Nat. Rev. Genet. 14, 618–630 (2013).
Ning, L. et al. Front. Oncol. 4, 7 (2014).
Yuan, K., Sakoparnig, T., Markowetz, F. & Beerenwinkel, N. Genome Biol. 16, 36 (2015).
Broderick, T., Pitman, J. & Jordan, M.I. Bayesian Anal. 8, 801–836 (2013).
McPherson, A. et al. Nat. Genet. http://dx.doi.org/10.1038/ng.3573 (2016).
Ahmed, A.A. et al. J. Pathol. 221, 49–56 (2010).
Rosenberg, A. & Hirschberg, J. In Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 410–420 (Association for Computational Linguistics, 2007).
Amigó, E., Gonzalo, J., Artiles, J. & Verdejo, F. Inf. Retrieval 12, 461–486 (2009).
Shah, S.P. et al. Nature 461, 809–813 (2009).
Acknowledgements
We acknowledge generous long-term funding support from the BC Cancer Foundation. In addition, the S.P.S. and S.A. groups receive operating funds from the Canadian Breast Cancer Foundation, the Canadian Cancer Society Research Institute (impact grant 701584 to S.A. and S.P.S.), the Terry Fox Research Institute (PPG program on forme fruste tumors), Canadian Institutes for Health Research (CIHR) (grant MOP-115170 to S.A. and S.P.S.), CIHR Foundation (grant FDN-143246 to S.P.S.), and a CIHR new investigator grant (MSH-261515 to J.N.M.). A.R. is supported by a Frederick Banting and Charles Best CIHR doctoral scholarship. S.P.S. and S.A. are supported by Canada Research Chairs. S.P.S. is a Michael Smith Foundation for Health Research scholar. We thank V. Earle for artwork depicting anatomic sites sampled in the study.
Author information
Authors and Affiliations
Contributions
A.R., project conception, algorithm development, software implementation, and data analysis; S.A., A.M., E.L., J.B., D.Y., and A.W., single-nucleus sequencing; M.A.S. and C.B.N., data visualization; J.N.M., surgery, sample acquisition, and tumor banking; A.R., S.A., A.B.-C., and S.P.S., manuscript writing; S.P.S., project oversight and senior responsible author.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Performance comparison using 90 synthetic data without doublets.
(a) Example synthetic data used for benchmarking. (b) V-measure metric used to assess clustering performance (higher is better). The mean Hamming distance between predicted genotypes for each cell and their true genotypes in the (c) two-state and (d) three-state representations respectively (lower is better).
Supplementary Figure 2 Performance comparison using 80 synthetic data with doublets.
(a) F-measure of the B- cubed metric to assess feature allocation performance (higher is better). (b) Clone accuracy assessed by the maximum Hamming distance of a predicted clonal genotype to its nearest true clonal genotype in 3 state representation (lower is better).
Supplementary Figure 3 Difference between the number of true clusters and number of clusters predicted by the D-SCG3 model.
Data was simulated from the D-SCG3 model with 100 data points with 10 replicate datasets per parameter setting. We simulated data across a range of doublet probabilities and number of clusters.
Supplementary Figure 4 Copy number profile for the high grade serous ovarian cancer dataset.
Red lines indicate major copy number and blue lines indicate minor copy number. Note that this tumour likely underwent a genome doubling early in its evolutionary history.
Supplementary Figure 5 Missing data in high grade serous ovarian cancer dataset.
Proportion of missing values per cell for SNV events in the high grade serous ovarian cancer data set. Cells are grouped by cluster.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–5, Supplementary Notes 1–3, Supplementary Results and Supplementary Discussion. (PDF 1790 kb)
Supplementary Table 1
Parameters used to generate synthetic data sets (XLS 5 kb)
Supplementary Table 2
P-values from Nemenyi test comparing clustering accuracy using V-measure metric. (XLS 5 kb)
Supplementary Table 3
P-values from Nemenyi test comparing performance of genotype prediction using mean Hamming distance in two-state representation. (XLS 5 kb)
Supplementary Table 4
P-values from Nemenyi test comparing performance of genotype prediction using mean Hamming distance in three-state representation. (XLS 5 kb)
Supplementary Table 5
Clustering performance of methods on synthetic data sets without doublets. (XLS 78 kb)
Supplementary Table 6
Genotyping performance of methods on synthetic data sets without doublets (XLS 49 kb)
Supplementary Table 7
P-values from Nemenyi test comparing feature allocation accuracy using B-cubed metric. (XLS 5 kb)
Supplementary Table 8
Feature allocation performance of methods on data sets with doublets. (XLS 70 kb)
Supplementary Table 9
P-values from Nemenyi test comparing maximum Hamming distance to nearest clone. (XLS 5 kb)
Supplementary Table 10
Accuracy of predicted clonal genotypes of methods on data sets with doublets. (XLS 21 kb)
Supplementary Table 11
Input data for CMM and SCG models for the childhood leukemia data set. (XLS 45 kb)
Supplementary Table 12
Cluster assignments predicted by CMM3 model for the childhood leukemia data set. (XLS 13 kb)
Supplementary Table 13
Cluster assignments predicted by SCG3 model for the childhood leukemia data set. (XLS 13 kb)
Supplementary Table 14
Cluster assignments predicted by D-SCG3 model for the childhood leukemia data set. (XLS 13 kb)
Supplementary Table 15
Predicted genotypes from CMM3 model of clusters with cells assigned for the childhood leukemia data set. (XLS 5 kb)
Supplementary Table 16
Predicted genotypes from SCG3 model of clusters with cells assigned for the childhood leukemia data set. (XLS 5 kb)
Supplementary Table 17
Predicted genotypes from D-SCG3 model of clusters with cells assigned for the childhood leukemia data set. (XLS 5 kb)
Supplementary Table 18
Input data for D-SCG3 model for the HGSOC data set. (XLS 162 kb)
Supplementary Table 19
Cluster assignments for the HGSOC data set using D-SCG3 model. (XLS 29 kb)
Supplementary Table 20
Predicted genotypes of clusters with cells assigned for the HGSOC data set using D-SCG3 model. (XLS 9 kb)
Supplementary Table 21
Predicted clone prevalences for the HGSOC data set using D-SCG3 model. (XLS 9 kb)
Supplementary Software
Single cell genotyper model and simulation code. (ZIP 67 kb)
Rights and permissions
About this article
Cite this article
Roth, A., McPherson, A., Laks, E. et al. Clonal genotype and population structure inference from single-cell tumor sequencing. Nat Methods 13, 573–576 (2016). https://doi.org/10.1038/nmeth.3867
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3867
This article is cited by
-
Conifer: clonal tree inference for tumor heterogeneity with single-cell and bulk sequencing data
BMC Bioinformatics (2021)
-
PyClone-VI: scalable inference of clonal population structures using whole genome data
BMC Bioinformatics (2020)
-
Machine learning approaches to drug response prediction: challenges and recent progress
npj Precision Oncology (2020)
-
Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes
Nature Methods (2020)
-
Learning mutational graphs of individual tumour evolution from single-cell and multi-region sequencing data
BMC Bioinformatics (2019)