PyClone: statistical inference of clonal population structure in cancer

Journal name:
Nature Methods
Volume:
11,
Pages:
396–398
Year published:
DOI:
doi:10.1038/nmeth.2883
Received
Accepted
Published online

We introduce PyClone, a statistical model for inference of clonal population structures in cancers. PyClone is a Bayesian clustering method for grouping sets of deeply sequenced somatic mutations into putative clonal clusters while estimating their cellular prevalences and accounting for allelic imbalances introduced by segmental copy-number changes and normal-cell contamination. Single-cell sequencing validation demonstrates PyClone's accuracy.

At a glance

Figures

  1. Comparison of clustering performance for the mixture of normal-tissue data sets.
    Figure 1: Comparison of clustering performance for the mixture of normal-tissue data sets.

    (a,b) Comparison of methods (10 replicates of 37 mutations) when analyzing a four-mixture experiment separately (a) or jointly (b). IBMM, infinite binomial (Bin) mixture model; IBBMM, infinite beta-binomial (BeBin) mixture model; TCN, with total copy-number priors; PCN, with parental copy-number priors. (c) Expected cellular prevalence of each cluster across the four-mixture experiments. (d,e) Inferred cellular prevalences and clustering using the IBBMM model (d) and PyClone BeBin-PCN model (e). Clusters with homozygous single-nucleotide variants (SNVs) or an equal number of homo- and heterozygous SNVs are indicated by solid lines; clusters with heterozygous SNVs, by dashed lines. (f) Variant allelic prevalence for mutations assigned to cluster 1 by the PyClone BeBin-PCN model. Dashed lines represent heterozygous SNVs; solid lines represent homozygous SNVs. Whiskers (a,b) indicate 1.5× the interquartile range, red bars indicate the median and boxes represent the interquartile range. Error bars (d,e) indicate the mean s.d. (using 9,000 post-'burn-in' samples (Online Methods)) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. (df) The number of mutations n in each cluster is shown in the legends in parentheses.

  2. Joint analysis of multiple samples from high-grade serous ovarian cancer 2.
    Figure 2: Joint analysis of multiple samples from high-grade serous ovarian cancer 2.

    (a,b) Left, variant allelic prevalence for each mutation, color coded by predicted cluster, using IBBMM (a) and PyClone with the BeBin-PCN model (b) to jointly analyze the four samples. Right, inferred cellular prevalence for each cluster using IBBMM (a) and BeBin-PCN methods (b). As in Figure 1, the cellular prevalence of the cluster is the mean value of the cellular prevalence of mutations in the cluster. (c) Presence or absence of variant alleles at target loci in single cells from sample B. Loci with fewer than 40 reads covering them are colored gray. Predicted clusters for each method are shown on the left, with white cells indicating nonsomatic control positions. Row labels indicate hg19 chromosome and chromosome coordinate separated by a colon. (d) Presence or absence of IBBMM clusters in single cells from sample B. Clusters were deemed present if any mutation in the cluster was present. gDNA, bulk genomic DNA control. Error bars (a,b) indicate the mean s.d. (using 50,000 post-'burn-in' samples) of Markov chain Monte Carlo–derived cellular prevalence estimates over all mutations in a cluster. The number of mutations n in each cluster is shown in the legends in parentheses.

References

  1. Nowell, P.C. Science 194, 2328 (1976).
  2. Aparicio, S. & Caldas, C. N. Engl. J. Med. 368, 842851 (2013).
  3. Greaves, M. & Maley, C.C. Nature 481, 306313 (2012).
  4. Shah, S.P. et al. Nature 486, 395399 (2012).
  5. Ding, L. et al. Nature 481, 506510 (2012).
  6. Nik-Zainal, S. et al. Cell 149, 9941007 (2012).
  7. Carter, S.L. et al. Nat. Biotechnol. 30, 413421 (2012).
  8. Govindan, R. et al. Cell 150, 11211134 (2012).
  9. Shah, S.P. et al. Nature 461, 809813 (2009).
  10. Gerlinger, M. et al. N. Engl. J. Med. 366, 883892 (2012).
  11. The 1000 Genomes Project Consortium. Nature 467, 10611073 (2010).
  12. Harismendy, O. et al. Genome Biol. 12, R124 (2011).
  13. Rosenberg, A. & Hirschberg, J. in Proc. 2007 Joint Conf. Empir. Methods Natural Lang. Process. Comput. Natural Lang. Learn. (EMNLP-CoNLL) Vol. 410, 420 (2007).
  14. Bashashati, A. et al. J. Pathol. 231, 2134 (2013).
  15. Forshew, T. et al. Sci. Transl. Med. 4, 136ra68 (2012).
  16. Dawson, S.J. et al. N. Engl. J. Med. 368, 11991209 (2013).
  17. Sottoriva, A. et al. Proc. Natl. Acad. Sci. USA 110, 40094014 (2013).
  18. Fritsch, A. & Ickstadt, K. Bayesian Anal. 4, 367392 (2009).
  19. Ng, S.B. et al. Nature 461, 272276 (2009).
  20. Van Loo, P. et al. Proc. Natl. Acad. Sci. USA 107, 1691016915 (2010).
  21. Greenman, C.D. et al. Biostatistics 11, 164175 (2010).
  22. Yau, C. et al. Genome Biol. 11, R92 (2010).
  23. Untergasser, A. et al. Nucleic Acids Res. 40, e115 (2012).
  24. Li, H. & Durbin, R. Bioinformatics 26, 589595 (2010).

Download references

Author information

Affiliations

  1. Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada.

    • Andrew Roth &
    • Gavin Ha
  2. Department of Molecular Oncology, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada.

    • Andrew Roth,
    • Jaswinder Khattra,
    • Damian Yap,
    • Adrian Wan,
    • Emma Laks,
    • Justina Biele,
    • Gavin Ha,
    • Samuel Aparicio &
    • Sohrab P Shah
  3. Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

    • Samuel Aparicio &
    • Sohrab P Shah
  4. Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada.

    • Alexandre Bouchard-Côté

Contributions

Project conception and oversight: S.P.S., S.A., A.R.; method development: A.R., A.B.-C., S.P.S.; implementation and benchmarking: A.R.; manuscript writing and editing, study design and execution: A.R., A.B.C., S.P.S., S.A.; single-cell sequencing: J.K., D.Y., A.W., E.L., J.B.; data analysis and interpretation: G.H.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (5,499 KB)

    Supplementary Figures 1–14, Supplementary Results, Supplementary Discussion and Supplementary Note

Excel files

  1. Supplementary Table 1 (51 KB)

    Allelic counts, IBBMM and PyClone PCN cellular prevalence estimates for mutations in high grade serous ovarian cancer case 2. Copy number predictions where inferred using PICNIC as described in the Online Methods. Cellular prevalences where computed by taking the mean of the post burnin trace for the cellular prevalences for the respective methods. The standard deviation of the cellular prevalence parameter estimated from the post burnin trace is also included. Cluster ids (last two columns) were predicted from the post burnin trace using the MPEAR clustering criteria as described in the Online Methods and Supplementary Note. Mutation ids list gene name, chromosome and chromosome coordinate. All coordinates are in the hg19 coordinate system.

  2. Supplementary Table 2 (41 KB)

    Allelic counts, IBBMM and PyClone PCN cellular prevalence estimates for mutations in high grade serous ovarian cancer case 1. Copy number predictions where inferred using PICNIC as described in the Online Methods. Cellular prevalences where computed by taking the mean of the post burnin trace for the cellular prevalences for the respective methods. The standard deviation of the cellular prevalence parameter estimated from the post burnin trace is also included. Cluster ids (last two columns) were predicted from the post burnin trace using the MPEAR clustering criteria as described in the Online Methods and Supplementary Note. Mutation ids list gene name, chromosome and chromosome coordinate. All coordinates are in the hg19 coordinate system.

Additional data