Inferring clonal composition from multiple tumor biopsies

Manica, Matteo; Kim, Hyunjae Ryan; Mathis, Roland; Chouvarine, Philippe; Rutishauser, Dorothea; De Vargas Roditi, Laura; Szalai, Bence; Wagner, Ulrich; Oehl, Kathrin; Saba, Karim; Pati, Arati; Saez-Rodriguez, Julio; Roy, Angshumoy; Parsons, Donald W.; Wild, Peter J.; Martínez, María Rodríguez; Sumazin, Pavel

doi:10.1038/s41540-020-00147-5

Download PDF

Article
Open access
Published: 25 August 2020

Inferring clonal composition from multiple tumor biopsies

Matteo Manica ORCID: orcid.org/0000-0002-8872-0269^1,2,
Hyunjae Ryan Kim³,
Roland Mathis¹,
Philippe Chouvarine³,
Dorothea Rutishauser⁴,
Laura De Vargas Roditi⁴,
Bence Szalai⁵,
Ulrich Wagner⁴,
Kathrin Oehl⁴,
Karim Saba ORCID: orcid.org/0000-0002-1172-8123⁴,
Arati Pati³,
Julio Saez-Rodriguez ORCID: orcid.org/0000-0002-8552-8976^5,6,
Angshumoy Roy³,
Donald W. Parsons³,
Peter J. Wild⁷,
María Rodríguez Martínez ORCID: orcid.org/0000-0003-3766-4233¹ &
…
Pavel Sumazin ORCID: orcid.org/0000-0002-1215-4977³

npj Systems Biology and Applications volume 6, Article number: 27 (2020) Cite this article

2950 Accesses
4 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Knowledge about the clonal evolution of a tumor can help to interpret the function of its genetic alterations by identifying initiating events and events that contribute to the selective advantage of proliferative, metastatic, and drug-resistant subclones. Clonal evolution can be reconstructed from estimates of the relative abundance (frequency) of subclone-specific alterations in tumor biopsies, which, in turn, inform on its composition. However, estimating these frequencies is complicated by the high genetic instability that characterizes many cancers. Models for genetic instability suggest that copy number alterations (CNAs) can influence mutation-frequency estimates and thus impede efforts to reconstruct tumor phylogenies. Our analysis suggested that accurate mutation frequency estimates require accounting for CNAs—a challenging endeavour using the genetic profile of a single tumor biopsy. Instead, we propose an optimization algorithm, Chimæra, to account for the effects of CNAs using profiles of multiple biopsies per tumor. Analyses of simulated data and tumor profiles suggested that Chimæra estimates are consistently more accurate than those of previously proposed methods and resulted in improved phylogeny reconstructions and subclone characterizations. Our analyses inferred recurrent initiating mutations in hepatocellular carcinomas, resolved the clonal composition of Wilms’ tumors, and characterized the acquisition of mutations in drug-resistant prostate cancers.

Quantifying the influence of mutation detection on tumour subclonal reconstruction

Article Open access 07 December 2020

Lydia Y. Liu, Vinayak Bhandari, … Paul C. Boutros

Power and pitfalls of computational methods for inferring clone phylogenies and mutation orders from bulk sequencing data

Article Open access 26 February 2020

Sayaka Miura, Tracy Vu, … Sudhir Kumar

Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data

Article Open access 21 June 2019

Salem Malikic, Katharina Jahn, … Niko Beerenwinkel

Introduction

Pan-cancer tumor profiling has identified recurrent alterations that are associated with tumor etiology at the loci of thousands of genes but the interpretation of genetic alterations remains a major challenge^1,2,3. Knowledge about the clonal evolution of tumors can point to genetic alterations that both contribute to tumorigenesis, indicate prognostically relevant intratumoral variability, and point to refractory tumor subclones^4,5. Specifically, clonal evolution—depicted as a phylogenetic tree in Fig. 1a—can help to identify alterations that play a role in tumor initiation as well as those that confer a selective advantage to altered tumor cells. Moreover, information about its subclone composition is important for predicting cancer’s potential for drug resistance and metastasis, which vary across tumor subclones⁶ and are the key determinants of patient outcome. Consequently, tumor-subclone characterization is essential for designing personalized therapies that target all tumor subclones and may hold the key to predicting tumor progression, metastases, drug sensitivity, and patient outcome.

**Fig. 1: A simulated footprint of the clonal evolution of a tumor, as observed in genetic profiles of multiple biopsies.**

Current methods that rely on DNA-profiling to reconstruct clonal evolution of tumors can be classified into two categories: methods that primarily rely on single-cell profiles^7,8,9,10 and those that computationally resolve mixtures of subclones from molecular profiles of bulk tumor cells, i.e., profiles of pools of cells that originate from a common malignant lesion^11,12,13,14. Single-cell DNA sequencing can produce more definitive estimates of the proportion (frequencies) of tumor cells that contain each genetic alteration and more complete profiles of tumor subclones, including information about the co-occurrence of alterations within each subclone. Its primary disadvantage is operational: the availability of high-quality tumor samples that permit single-cell isolation and profiling as well as the accuracy and cost associated with parallel sequencing DNA from a multitude of cells per tumor. Moreover, improving the accuracy of single-cell mutation profiling remains challenging due to limited material availability in single cells¹⁵; this is not likely to improve as future sequencing technologies focus on profiling formalin-fixed paraffin-embedded (FFPE) tumor samples^16,17. Alternatively, single-cell RNA or protein profiling can help indicate tumor subclones, but these assays may not directly point to key driving genetic alterations.

Focusing on single-nucleotide somatic variants (SNVs; or simply mutations), we sought to reconstruct clonal evolution from DNA profiles of genetically unstable cancers. This entails deconvolving mutation frequencies, mutation-subclone associations, and CNAs from DNA profiles—including both whole-exome sequencing (WES) and panel-based (targeted) sequencing assays—that produce average estimates across cellular ensembles (Fig. 1b–d). One approach to improve the accuracy of these deconvolutions is to profile multiple biopsies from the same tumor across time points¹⁸ or across regions^6,19. This approach relies on two key assertions: (1) that genetic alterations that are specific to the same tumor subclone will co-occur with the same frequency across biopsies, and (2) that the clonal composition of heterogeneous regions varies, i.e., multiple sampling will allow for the aggregation and deconvolution of the frequencies of most mutations with improved power. It is important to note that mutations that underwent convergent evolution²⁰ do not violate these assertions and will not be aggregated with other mutations from the same tumor subclone because of differing frequency estimates across biopsies. It is also important to note that accurate deconvolution must account for tumor purity, and our efforts—including production of simulated data to compare leading methods and analyses of tumor profiles—account for differences tumor purity across samples.

A central challenge for estimating mutation frequencies in tumors with unstable genomes is accounting for the effects of CNAs that can alter mutated-read fractions. These are observed in profiles of tumor biopsies that are composed of tumor cells with wild-type and mutated alleles as well as tumor-adjacent cells (Fig. 1e, f). In turn, inaccurate mutation-frequency estimates can contribute to erroneous associations between mutations and tumor subclones as well as errors in phylogeny reconstructions (Fig. 1g, h). We describe the mutation-frequency inference problem as that of inferring tumor subclone frequencies and associating mutations with subclones. Consequently, we describe the tumor-phylogeny reconstruction problem as that of inferring ancestral relations between tumor subclones. The main challenges for addressing the mutation-frequency inference problem are to aggregate co-occurring mutations across biopsies, estimate the frequency of each aggregate in every biopsy, and identify partial orders across aggregates that are consistent across biopsies. When viewed this way, each tumor subclone could be associated with a frequency vector that describes the proportion of cells containing its mutations in each biopsy. Ancestral order between two subclones could then be established based on (probabilistic) comparisons between their corresponding mutation frequencies. Ancestral order inference requires confident frequency assignment to the majority of mutations based on observed mutated-read fractions, and inference methods can be compared based on the number of mutations with frequency estimates, the accuracy of these estimates, and their accuracy at aggregating sister mutations that initiate the same clones.

We studied the mutation-frequency inference problem as a function of genetic instability and proposed the inference method Chimæra to improve these estimates and subsequent phylogeny reconstruction. Chimæra uses an optimization process to resolve the parameters of a natural model for the effects of CNAs on mutated-read fractions and is unique in its emphasis on the simultaneous inference of mutation frequencies and CNAs. We report on a comparison of Chimæra’s accuracy to that of other mutation-frequency and heterogeneity inference methods on simulated DNA-profiling data of genetically unstable tumors and on biopsy subsets of thirteen tumors, including liver, kidney, and prostate cancers^21,22,23. Each of these tumors was profiled in 4–10 regions, and three of the prostate cancers profiled were surveyed across multiple time points. We showed that Chimæra’s inferences can be used to identify the key mutations that are associated with increased subclone proliferation, drug response, and tumor grade, as well as to infer the ancestral relations between tumor subclones that harbor these mutations.

Results

We describe the results of our efforts to evaluate inference method accuracy on simulated data and to reconstruct phylogenies based on WES and targeted sequencing assays of tumor biopsies. Our analyses highlighted the challenges in inferring cancer mutation frequencies from these assays and the benefits of methods that rely on profiles of multiple biopsies per tumor to characterize tumor subclones and tumor evolution by tracking mutation aggregates.

Simulation of DNA profiling data

We used phylogeny models—with sizes ranging from three to twelve tumor subclones, twenty to fifty somatic mutations per subclone, and varying degrees of genetic instability—to generate simulated DNA profiles. ABSOLUTE²⁴, AncesTree¹², EXPANDS²⁵, PhyloWGS^14,26, SCHISM¹³, and Chimæra were then used to reconstruct phylogenies based on simulated data. Each method inferred ancestral relations between mutation pairs, and errors were estimated as the combined frequencies of false-positive and false-negative predictions. ABSOLUTE infers tumor purity and malignant cell ploidy directly from the analysis of somatic DNA alterations by fitting estimates of copy-ratio of both homologous chromosomes with a Gaussian mixture model, where components were centered at the discrete concentration-ratios implied by an initial frequentist estimation. AncesTree characterizes the clonal evolution of tumors based on the probabilistic model for errors in observed read fractions and infers phylogeny matrices using integer linear programming. EXPANDS clusters mutations based on their cell-frequency probability distributions; clusters are next extended by members with similar distributions and pruned based on statistical confidence by comparing the cluster maxima and peaks observed outside the core region. PhyloWGS reconstructs phylogenies based on a model for simple somatic mutations in addition to a correction for CNAs, all based on a single biopsy per tumor. SCHISM takes as input mutation cellularity estimations and mutation clustering inferred by other methods and uses a generalized likelihood ratio to infer lineage precedence and lineage divergence. A genetic algorithm is then used to build phylogenetic trees.

Accuracy of mutation-frequency estimation based on simulated data

Our initial efforts to compare accuracy between methods based on analyses of phylogenies of size three revealed variable success rates, with some methods showing consistently poor accuracy. EXPANDS and PhyloWGS, which were designed to reconstruct phylogenies using profiles of one biopsy per tumor, and ABSOLUTE, which is best known and most effective for estimating tumor purity, had consistently poor accuracy in our simulations—with results statistically indistinguishable from random inferences. SCHISM and AncesTree had better or comparable performance than these three methods in every simulated instance. For example, the magnitude of frequency inference errors by ABSOLUTE, which processed profiles of multiple biopsies per tumor, were more than double those of SCHISM and analyses required manual parameter optimization. However, ABSOLUTE had good accuracy for inferring tumor purity in our synthetic data. SCHISM and AncesTree do not explicitly account for the full range of observed CNAs in tumors, but they were accurate in 100% of our tested instances with three tumor subclones. Consequently, we focused on accuracy comparisons between inferences by SCHISM, AncesTree, and Chimæra on phylogenies composed of 6–12 tumor subclones. Moreover, our analysis suggested that more than one biopsy per tumor was required to accurately approximate mutation frequencies and CNAs at these mutation loci.

We compared the accuracy of SCHISM, AncesTree, and Chimæra on phylogenies that were adapted from a precompiled library that was generated both manually and using CITUP²⁷; see Fig. 2b and Table S1 for representative phylogenies. Each somatic mutation was associated with a trio of copy numbers—δ^s, $\delta _{\mathrm{w}}^{\mathrm{s}}$, and $\delta _{\mathrm{m}}^{\mathrm{s}}$ (Fig. 2a)—that were taken from truncated normal distributions with means μ ∈ {1, 2, 3}, where μ = 1 corresponds to no copy number changes and standard deviation σ ∈ {0, 1, 2, 3}; σ = 0. was used only when μ = 1. The resulting copy numbers modeled a range of genetic instability conditions that were in line with observed CNAs in TCGA-profiled prostate, hepatocellular, breast carcinomas (HCC and BRCA in Fig. 2d, e); we assumed no linkage between simulated CNAs of any mutations, and we omitted prostate adenocarcinoma (PRAD) curves from Fig. 2 for readability. In addition, we added up to 10% of wildtype reads for all simulated mutations to account for the potential inclusion of nontumor cells in biopsied samples (WT subclone in Fig. 1a). Total coverage for each allele—i.e., the number of reads covering both the wildtype and mutated variants of a specific nucleotide—was taken by sampling mutation coverage values from prostate and Wilms’ tumor biopsies profiled here. Finally, once idealized counts were available for both mutated and wildtype alleles, we simulated duplication or loss of up to 5% of the observations according to a uniform distribution. To simulate multiple regions per tumor, we repeated each biopsy simulation using the same simulation parameters but with distinct cellular composition vectors to produce simulated profiles of 6–12 biopsies per tumor, as depicted in Fig. 2c. The availability of six biopsies per tumor increases the likelihood that mutations can be aggregated and subclone mutation frequencies can be compared to infer ancestral relations and are in line with pre-existing datasets, including those reported here. The selection of six biopsies is a compromise between clinical feasibility and the power needed to infer mutation frequencies and phylogenies.

**Fig. 2: Our model for the effects of copy number alterations on mutated-read fractions in DNA profiles.**

AncesTree accepts no external input when estimating mutation frequencies, but SCHISM can be guided by externally inferred mutation frequencies and clusters. SCHISM’s implementation includes its own selected clustering methods, and these were also used to compare accuracy. Chimæra can also be guided by externally inferred mutation frequencies and clusters, but by default, it uses a clustering approach modeled after hdbscan²⁸. When comparing SCHISM and Chimæra performance on synthetic data, we clustered mutations using their native clustering approaches and with tclust²⁹ optimization subroutines including ElbowSSE, Entropy, GMD, Mclust, and SDIndex^30,31. We compared the accuracy of methods and pipelines on 2000 simulated assays, including both with and without modeled genetic instability by varying mutation copy numbers. The accuracy of SCHISM estimates was better, on average, than that of AncesTree, but it was relatively sensitive to clustering optimization methods, with SDIndex outperforming other methods, including SCHISM’s native implementation. Comparatively, Chimæra estimates were less dependent on clustering methods and significantly outperformed estimates by SCHISM with SDIndex (p < 1E−16 by U-test).

When using its native clustering approach Chimæra exhibited lower accuracy than implementations using tclust (Fig. 3a), but it estimated frequencies for a significantly larger number of mutations (Fig. 3b). The number of mutations with no frequency estimates by Chimæra was 2-fold less than that of the next best method, which allowed for dramatically improved phylogeny reconstruction in both simulated data and profiled cancers. Inference accuracy, for both SCHISM and Chimæra, was anticorrelated with copy number variability across biopsies. We used CNA variability as a surrogate for genetic instability and quantified it using the coefficient of variation of mutation copy numbers across biopsies, which followed truncated normal distributions (Fig. 3c). However, while Chimæra inferences were affected by copy number variability, they were independent of the actual magnitude of CNAs (Fig. 3d). This, in turn, suggested that instability across biopsies is a key challenge for estimating mutation frequencies. We note that both the SCHISM and tclust-based Chimæra pipelines failed to cluster 40% of mutations in our synthetic data. Moreover, while Chimæra assigned frequencies to all clustered mutations and made predictions for each simulate cancer, SCHISM did not successfully estimate mutation frequencies for some simulated genomes and cancer profiles. To account for this, accuracy comparisons in Fig. 3 relied on only those mutations that had assigned frequencies by all methods. In total, our analysis suggests that mutation frequency estimation is more challenging for genomes with high CNA variability (Fig. 3c, d). All data—including supplementary tables and analyses—are available at Chimæra’s GitHub repository.

**Fig. 3: Performance of mutation frequency estimates on simulated data.**

The number of regions profiled per tumor dictates algorithm convergence

Chimæra requires at least two profiled regions per tumor for making tumor subclone predictions, and while SCHISM can predict tumor subclones based on a single region, its accuracy improves with the number of profiled regions per tumor. To test the benefit of profile-multiplicity per tumor, we compared Chimæra and SCHISM analyses of multiregion WES profiles of 13 tumors using only subsets of the available tumor profiles for each analysis. Tumor profiles included profiles of nine hepatocellular carcinomas (HCCs)³², three high-risk Wilms’ tumors, and a castrate-resistant prostate cancer (CRPC). Each tumor was profiled in 5–10 distinct regions, and we compared the number of subclones detected by each method in each multiregion subset as a function of the number of regions, with a minimum of two WES profiles per tumor.

All HCCs were profiled in five regions, which permitted testing predictions in 2-size, 3-size, and 4-size subsets of the assays for each tumor; Wilms’ tumors were profiled in six and eight regions; and the CRPC in ten regions. Our results (Fig. 4) suggested that, for most HCCs and Wilms’ tumors, Chimæra-analysis of four tumor regions resulted in a similar number of subclones as profiling five tumor areas, however, profiling two tumor areas was insufficient for predicting tumor subclones accurately with Chimæra. Predictions for our CRPC, which had the highest genomic instability and mutation burden of the 13 tumors, largely converged with seven profiled regions. Interestingly, because mutations that were clustered together by Chimæra were associated with sufficiently different frequencies by SCHISM, the number of subclones predicted by SCHISM were often dramatically greater, and SCHISM analyses did not produce subclone predictions for some tumors and often required more regions to converge (Fig. 4b). A detailed description of tumor subclone predictions in each tumor context follows. All data and analyses are given in Table S2.

**Fig. 4: The number of predicted subclones depends on the number of profiled regions per tumor.**

Phylogeny inference in HCC

HCCs are high-risk liver tumors that are known to have high genetic instability²³. We used Chimæra to infer mutation frequencies and ancestral relations between HCC subclones based on WES profiles of nine HBV-positive HCCs³². In total, we obtained mutated-read fractions and CNA estimates for 1424 mutation candidates in 9 tumors and 43 tumor samples. Table S3 lists the data input to Chimæra, including mutation-frequency and CNA estimates for each mutation; it also details the outcome of the analyses described below.

Chimæra inferred frequencies estimates for 60% (858/1424) of all mutations, reconstructing phylogenetic trees for each tumor and predicting initiating clones and proliferative subclones; see representative trees in Fig. 5a–c. In contrast, SCHISM inferred mutations frequency for 18% of the identified mutations. Interestingly, 100% (9/9) of the HBV-positive HCCs had predicted initiating mutations in WNT-signaling pathway genes (Fig. 5d). An examination of 102 TCGA-profiled HBV-positive HCCs²³ suggested that 74% (75/102) of samples carried mutations in WNT-signaling pathway genes, and that the majority of these samples (76%) had WNT-signaling pathway mutations with mutated-read fractions above 25%—corresponding to mutations that are potentially present in the majority of cells.

**Fig. 5: Inferred tumor phylogenies for HBV-positive HCCs suggest that WNT-signaling pathway mutations play a key role in tumor initiation.**

To test whether WNT-signaling pathway genes were enriched for mutations—and particularly high-frequency mutations with mutated-read fractions above 25%—we calculated the proportion of tumors with such mutations in each of 186 KEGG pathways in MSigDB³². The most enriched pathways by p-value and mutated-sample fraction are shown in Fig. 5e. p-values were estimated using permutation testing, where for each pathway, random same-size gene sets were generated using KEGG pathway genes and the mutated-sample fraction taken to generate a null distribution. WNT-signaling was the most enriched pathway, and most of the remaining enriched pathways significantly overlapped it (p < 0.01, FET). To correct for this overlap³³—where pathways that overlap another pathway that is mutated in many samples are identified as significant—we recalculated enrichment significance for each pathway using the same test but after excluding WNT-signaling pathway genes. We note that MAPK-signaling and two other top-10 pathways were still enriched (Fig. 5e). Analyses products and input data are given in Table S3.

Phylogeny inference in Wilms’ tumors

To test SCHISM’s and Chimæra’s predictive ability in tumors with a range of genomic instability, we selected three Wilms’ tumors with low-genomic (CG118), intermediate-genomic (CG565), and high-genomic instability (CG163). Multiple regions of these tumors were profiled by WES, including six regions from each of CG118 and CG163 and eight regions from CG565 (Fig. 6a). SCHISM produced stable tumor subclone predictions that agreed with predictions by Chimaera for CG118, but it did not converge on a set of subclones even when profiling was available for seven regions from CG565; it also was not able to predict any subclones for CG163 (Fig. 4b). Chimæra tumor subclone predictions converged with fewer profiled regions, and Chimæra predicted phylogenies for all three profiled tumors.

**Fig. 6: Inferred tumor phylogenies for high-risk Wilms’ tumors identified driving initiating mutations.**

Chimæra analysis of CG118 profiles (Fig. 6b) suggested that the tumor was composed of primarily two types of cells: tumor subclones with a CTNNB1 (S45F) mutation and those with a mutation in WT1 (R445W) mutation; a predicted daughter clone of the CTNNB1 mutation had no previously studied mutations. Both CTNNB1 and WT1 mutations have been previously implicated with Wilms’ tumor genesis^34,35, and both clone types were present in every profiled region. However, the majority of cells—in all regions—were predicted to have CTNNB1 mutations. To compare the effects of these mutations, we compared RNA-expression profiles in regions with the lowest and highest frequencies of WT1-mutated cells (Fig. 6c), 7% and 3%, respectively. Analysis of the CTNNB1-pathways and WT1-pathways^36,37 suggested that they are differentially regulated across these tumor regions (Fig. 6c). These data suggested that S45F activates CTNNB1 and that R445W inhibits WT1, as previously described^38,39.

Chimæra analysis of CG163 profiles suggested that the acquisition of missense mutation in LIN28A (p.R132H) was a key event early in the formation of this tumor. Chimæra identified a second event that produced a less frequent tumor clone with many coding mutations with unknown significance. RNA-expression profiles of regions with the lowest and highest concentration of LIN28A mutations suggested that genes downstream of LIN28 are significantly altered in mutated cells. Analysis of CG565 did not reveal any mutations with known significance in Wilms’ tumors. Data and analyses are provided in Table S4.

Phylogeny inference in prostate tumors

To further test Chimæra’s predictive ability, we studied ultra-deep profiles of multiple regions of a select set of prostate cancers at multiple time points. Ultra-deep targeted profiling allows for improved mutation identification and read-fraction estimation. A single region from each of these cancers was previously profiled and helped identify multiple predicted driver mutations^40,41. However, because mutations detection by WES are not always reliable and often includes both false-positive and false-negative predictions⁴², we selected three cancers and designed a mutation panel that targets mutated genes in these cancers, as well as other known driver genes in prostate cancers⁴³ for ultra-deep sequencing. The identity of targeted genes is given in “Methods” section. This approach helped test Chimæra’s performance in more restrictive assays, which are quickly becoming standard in oncology clinics. Controls and five areas of each cancer were profiled at 2, 3, and 5 time points per cancer using both our targeted sequencing panel and OncoScan arrays to estimate CNAs on genome scales; areas profiled from tumor PC1 at each of three time points are shown in Fig. 7a and given in Supplementary Table S5.

**Fig. 7: Inferred phylogenies of tumors from three prostate cancer patients.**

We recorded changes in treatment, identified mutations that may have been acquired following treatment, and predicted phylogenies for each tumor based on these mutations. We only considered mutations that were observed in multiple regions at the same time point, thus eliminating two-thirds of the candidates. Our results demonstrate the feasibility of phylogeny inference from targeted-panel profiling of multiple tumor regions and support phylogeny prediction efforts by suggesting that predicted subclones that are supported by multiple genetic variants may persist and accumulate additional genetic variants across time. We describe the results of Chimæra analysis below.

The tumor PC1 was diagnosed as Gleason 3 + 4 and treated with an LHRH analog. It was profiled at 3-time points during follow-ups after treatment was started. Timepoint 1 in Fig. 7a was taken over a year after diagnosis and treatment, and Time points 2 and 3 followed 1.17 and 1.4 years after Time Point 1. These suggested increased risk, with Gleason 5 + 5 and the discovery of a mutation in AR that has been associated with increased cancer cell proliferation and poor outcome. Chimæra analysis of profiles of PC1 suggested two predominant tumor subclones that are represented in Fig. 7b by predicted deleterious mutations in EP300 (p.I997V) and AR (p.T878A); Fig. 7b. Tumor subclones with the EP300 mutation were present in all regions profiled at Time Point 1. AR mutations were identified at high frequencies at Time Points 2 and 3 but only in regions that lacked the EP300 mutation and had high Gleason scores. Together, these findings suggest that either the most aggressive tumor sections were not profiled in Time Point 1, or that subclones with the AR mutation have a proliferative advantage and have overtaken subclones with the EP300 mutation. Protein profiling by Hyper Reaction Monitoring of regions with low-frequencies and high-frequencies for the subclones with the AR mutation confirmed that genes that have been shown to be downregulated by AR⁴⁴ are significantly upregulated in AR-mutated regions.

PC2 was diagnosed as Gleason 5 + 4 and biopsied before treatment. The patient was treated for 9 months with LHRH-Analogon, during which the cancer was biopsied two additional times and registered an increase in severity to Gleason 5 + 5 at Time Point 3. Following treatment, the tumor was biopsied a 4th time and showed no change in severity (Gleason 5 + 5). The patient was then treated with combined androgen blockage—LHRH and Casodex—for 2 months, followed by a Gleason 4 + 5 diagnosis at Time Point 5. Time Point 1 profiles revealed the loss of RB1 and a commonly observed stop-gain mutation in PTEN (p.R303X). Mutations in BRCA2 and EP300 were observed in Time Points 4 and 5. Chimæra analysis suggested that the RB1 locus deletion predates the PTEN mutation and that the BRCA2 and EP300 mutations were acquired in tumor subclones that following RB1 loss, lacked the PTEN mutation (Fig. 7d).

The tumor PC3 was diagnosed as Gleason 5 + 4 using post-treatment profiles, which included androgen blockers and orchiectomy 9 years prior to Time Point 1 profiling. A second biopsy, taken 6 months after the first suggested increased severity and Gleason 5 + 5. Tumor profiles identified known deleterious mutations in PTEN (p.Q245) and BRCA1 (p.E1038G), as well as a stop-gain mutation in BRIP1, and nonsynonymous mutations in BRCA2 and PALB2. All mutations were identified at both time points and nearly all regions. Chimæra’s analysis suggested that RB1 loss was followed by the BRCA1 mutation. This was followed by the acquisition of a mutation in TP53 (p.V173G) or PALB2 and BRIP1. These sister clones subsequently acquired mutations in PTEN and BRCA2, respectively (Fig. 7e). Comparing the expression of genes downstream from PTEN and TP53 by Hyper Reaction Monitoring in regions with the lowest-consternation and highest-consternation of subclones with PTEN and TP53 mutations suggested that these mutations disrupt the associated pathway (Fig. 7f). Genes known to be downregulated by PTEN⁴⁵ were upregulated in regions rich with PTEN-mutated subclones as did targets of TP53⁴⁶. All input data—including protein expression profiles—and analyses products are given in Table S5.

Discussion

Bulk tissue DNA profiling by whole-genome and targeted sequencing can identify key DNA alterations that provide insight into the biology of tumors and indicate effective treatment options. Increasingly, these assays are used to help elucidate the clonal composition of heterogeneous tumors and even predict ancestral relationships between tumor subclones. The key to such efforts is the accurate estimation of mutation frequencies from both high-coverage and low-coverage DNA profiles. Our study suggested that such estimation efforts must explicitly account for the copy numbers of both the reference and alternative alleles and that only accurate mutation-frequency estimates can yield accurate tumor phylogenies.

Accordingly, we reported on methodology to improve the accuracy of tumor phylogeny reconstruction by improving mutation-frequency estimates from DNA profiles of multiple same-tumor biopsies. Our analysis suggested that mutation-frequency estimates are particularly challenging in the face of high genetic instability, which is characteristic of high-risk cancers, and that the accuracy of methods that rely on DNA profiles of a single biopsy of such cancers is poor. We also showed that even when profiles of multiple biopsies are available, methods that do not explicitly account for the full range of copy number variability produce inconsistent results and have poor accuracy.

We briefly outlined current methods to reconstruct tumor clonal evolution using DNA-profiling. These include methods that rely on single-cell profiles and methods that resolve subclone mixtures from profiles of bulk tissues. We note that single-cell DNA sequencing is expected to help produce more accurate mutation-frequency estimates, but this technology is yet to mature and does not produce accurate single-cell estimates for FFPE tumor samples. The greatest challenge in cancer genomics—and this is not expected to change—is sample availability for patients with rich or specific clinical annotation, and FFPE is and will remain the primary preservation method for solid tumor resections. Moreover, frozen samples that could be used to generate cell suspensions that may be profiled using single-cell DNA sequencing cannot be easily used to evaluate the heterogeneity of tumors. We expect that whole-genome, whole-exome, and targeted-panel sequencing will produce the vast majority of tumor DNA profiles in the near future.

We expect that in the cases that single-cell DNA and RNA profiling is possible and pursued, whole-genome and whole-exome sequencing will be used to inform on the biology of clones. In these cases—where analyses of single-cell RNA, DNA, or protein expression will be integrated with analyses of bulk tumor profiles—methods that impute mutation frequencies and even tumor phylogeny from bulk tumor DNA profiles will play a key role in this integrated analysis. Most importantly, we argue that given the heterogeneity of solid tumors and the often observable and documented differences in the composition of regions on the same tumor, profiling multiple tumor regions will be required for both research and clinical efforts in the future.

Our analyses also suggested that Chimæra improves on mutation-frequency estimates by harnessing added information from multiple profiles and by directly accounting for the influence of CNAs on observations from DNA profiles. Chimæra’s advantage was clearly observed in simulated data, where its performance was the most consistent and its accuracy the greatest. Interestingly, while Chimæra was able to estimate mutation frequencies with relatively high accuracy even at very high and very low copy numbers, its performance declined for the most unstable genomes where copy numbers for the same mutation varied widely across samples.

Using three tumor types—including Wilm’s tumors, which is expected to have relatively low genetic instability and few mutations—and prostate cancers, where longitudinal biopsies from the same patients were available, we showed that when given a sufficient number of biopsies per tumor, Chimæra is able to address mutation-frequency estimate challenges arising from genetic instability. Our results suggested that Chimæra’s mutation aggregation approach can help resolve issues arising from convergent evolution²⁰ and false-positive mutation calls⁴². We also showed how Chimæra could be used in conjunction with ultra-deep sequencing to improve on mutation calling accuracy. In conclusion, our results suggest that accurate mutation-frequency and cellularity inference are possible using profiles of multiple biopsies per tumor when coupled with analyses that account for the effects of CNAs on observed mutation read fractions.

Methods

We formulated the phylogeny reconstruction problem in set-theoretic terms, which lead to a natural model for the effects of CNAs on mutated-read fractions in sequencing profiles. We describe our methodology for simulating WES tumor profiles, as well as our efforts to deconvolve mutation frequencies from simulated data using ABSOLUTE, AncesTree, EXPANDS, SCHISM, and Chimæra. Note that, to reduce analysis complexity, CNAs and mutation frequency simulations did not include alterations in sex chromosomes. Finally, to demonstrate that Chimæra can be effectively applied to clinical data, we described reconstructed phylogenies from WES profiles of ten same-tumor CRPC biopsies; a set of five same-tumor HCC biopsies from nine patients; two six same-tumor biopsies and an eight same-tumor biopsy of three Wilms’ tumors; and three same-tumor prostate-cancer biopsies from three patients profiled at multiple time points.

Phylogeny reconstruction problem

Let $M = \left\{ {m:m \in {\Bbb N},1 \le m \le n} \right\}$ denote the set of n mutations identified across a set of profiled biopsies S. The mutation burden in any given cell is given as a subset of M, $\gamma \subseteq M$, or as an element of the power set over M, ${\cal{P}}(M)$; i.e., γ ∈ ${\cal{P}}(M)$ is a specific mutation ensemble that characterizes a tumor subclone. We denote the cellularity of γ and its corresponding subclone in biopsy s ∈ S as $\rho _\gamma ^{\mathrm{s}}$, and the frequency of mutation m ∈ γ in biopsy s as $\varphi _{\mathrm{m}}^{\mathrm{s}} = \mathop {\sum}\nolimits_{\left\{ {\gamma :\gamma \in {\cal{P}}(M),m \in \gamma } \right\}} {\rho _\gamma ^{\mathrm{s}}}$. Consequently, $\mathop {\sum}\nolimits_{\gamma \in {\cal{P}}(M)} {\rho _\gamma ^{\mathrm{s}} = 1}$ and the assignment $A = \{ {\rho _\gamma ^{\mathrm{s}}:\gamma \in {\cal{P}}\left( M \right),s \in S} \}$ produces a solution to our clonality reconstruction formulation.

Mutation frequencies

As defined above, for a mutation m in biopsy s ∈ S, $\varphi _{\mathrm{m}}^{\mathrm{s}}$ denotes the frequency of cells in s with mutation m. The total copy number C^s of the allele targeted by the mutation can be estimated from WES data. C^s is composed by: the copy numbers of the allele in cells that lack mutation m, δ^s, the copy number of the wildtype allele in m-mutated cells, $\delta _{\mathrm{w}}^{\mathrm{s}}$ and the copy number of the mutated allele in m-mutated cells, $\delta _{\mathrm{m}}^{\mathrm{s}}$ (Fig. 2). Notice that if no copy number event has occurred at the locus:δ^s = 2, $\delta _{\mathrm{w}}^{\mathrm{s}} = 1$ and $\delta _{\mathrm{m}}^{\mathrm{s}} = 1$. Adopting the infinite-sites assumption, we denote the mutated-read fraction—the fraction of reads reflecting the mutated versus wild-type allele in a WES profile—in sample s as $f_{\mathrm{m}}^{\mathrm{s}}$. Then, we can formulate the following equations (Eqs. 1, 2).

$$C^{\mathrm{s}} = \delta ^{\mathrm{s}}\left( {1 - \varphi _{\mathrm{m}}^{\mathrm{s}}} \right) + (\delta _{\mathrm{w}}^{\mathrm{s}} + \delta _{\mathrm{m}}^{\mathrm{s}})\varphi _{\mathrm{m}}^{\mathrm{s}}.$$

(1)

$$f_{\mathrm{m}}^{\mathrm{s}} = \frac{{\varphi _{\mathrm{m}}^{\mathrm{s}}\delta _{\mathrm{m}}^{\mathrm{s}}}}{{C^{\mathrm{s}}}}.$$

(2)

Equation (1) provides a weighted sum of the copy number contribution from each allele type, and Eq. (2) gives the ratio of the number of reads from the mutated allele and the total number of reads.

Chimæra

Chimæra proceeds in three steps. First, mutation frequencies are approximated from sequencing and CNA data in each biopsy; then, mutations with similar frequency vectors (where each vector component gives the mutation frequency in each biopsy) are clustered together to form subclones; and finally, mutation frequencies and CNAs for these alleles are refined using an optimization process. The optimization assumes that clustered mutations that are associated with the same subclone have the same frequencies in each tumor biopsy and that $\delta _{\mathrm{m}}^{\mathrm{s}}$—the average copy number of m (the mutated allele)—is unchanged across biopsies from the same tumor. This assumption can be relaxed in post-processing.

We first approximate the true frequency of the mutation $\varphi _{\mathrm{m}}^{\mathrm{s}}$ by accounting for tumor purity, i.e., the fraction of tumor cells in biopsies that include non-tumor cells, and assuming that the allele’s average copy number in tumor cells—whether mutated or not—is fixed. Let p^s be the purity of biopsy s, then Eq. (2) can be rewritten as follows:

$$f_{\mathrm{m}}^{\mathrm{s}} = \frac{{\varphi _{\mathrm{m}}^{\mathrm{s}}\delta _{\mathrm{m}}^{\mathrm{s}}p^{\mathrm{s}}}}{{2\left( {1 - p^{\mathrm{s}}} \right) + C^{\mathrm{s}}p^{\mathrm{s}}}}.$$

(3)

The experimentally observed copy number, ${\mathrm{C}}_{{\mathrm{obs}}}^{\mathrm{s}}$, depends on the purity of the sample and the copy number of the sample tumor cells, C^s, as follows:

$$C_{{\mathrm{obs}}}^{\mathrm{s}} = 2\left( {1 - p^{\mathrm{s}}} \right) + C^{\mathrm{s}}p^{\mathrm{s}},$$

(4)

where $C_{{\mathrm{obs}}}^{\mathrm{s}}$ can be estimated using additional biochemical assays, genetic sequencing, or through computational analysis of WES data⁴⁷, and the normal cells are assumed to have been corrected for germline copy number variants associated biases.

The simplifying assumption that the allele’s average copy number—averaged across all profiled cells—of the mutated allele in tumor cells is constant across biopsies, i.e,: $\delta _{\mathrm{m}}^{\mathrm{s}} = \frac{{C^{\mathrm{s}}}}{2}$. Under this approximation, we can use Eqs. (3) and (4) to eliminate C^s and obtain a first approximation of the mutation frequency $\widetilde {\varphi _{\mathrm{m}}^{\mathrm{s}}}$:

$$\widetilde {\varphi _{\mathrm{m}}^{\mathrm{s}}} = {\mathrm{min}}\left( {\frac{{2f_{\mathrm{m}}^{\mathrm{s}}C_{{\mathrm{obs}}}^{\mathrm{s}}}}{{C_{{\mathrm{obs}}}^{\mathrm{s}} - 2\left( {1 - p^{\mathrm{s}}} \right)}},1} \right).$$

(5)

This constraint will be later removed in the optimization process that follows but is necessary at this stage to obtain a first approximation that mutation frequencies that take into account the copy number influence from WES measurements. The minimization is necessary because of the interplay between the copy numbers at these alleles that may produce a first approximation above 1.

The approximate mutation frequency vectors (Eq. 5) are next clustered to identify candidate groups of mutations that form subclones. We considered clustering algorithms with the robust treatment of outliers in order to ensure good clustering stability and quality. Specifically, we used a method modeled after hdbscan, a density-based hierarchical clustering method that aims at maximizing the stability of the obtained clustered against noise and requires minimal parameter selection. The number of clusters is determined automatically based on the minimal number of mutations that have to be considered to constitute a cluster. We also use tclust²⁹, a nonhierarchical robust clustering that trims outliers based on a probabilistic model. The number of clusters is selected by optimizing intracluster entropy or the sum of squared errors (SSE) and using a variety of optimization methods including the Elbow method, Gaussian mixture decomposition (GMD), and SD index^48,49,50. The clustering based on hdbscan showed better performance on the generated synthetic data compared to others, especially when considering the number of mutations that could be assigned frequency estimates. Furthermore, it has the advantage of avoiding imposing a prior distribution on the mutations frequencies. Once the clusters are found, Chimæra assumes that each cluster represents a subclone and uses the mutation assignment to infer subclone frequencies and copy number estimates for each mutated allele in the final optimization step.

Focusing on subclone ∈ ${\cal{P}}(M)$, Eq. (3) describes a relationship between the frequencies and copy numbers of mutations in γ:

$$\varphi _{\mathrm{m}}^{\mathrm{s}}\delta _{\mathrm{m}}^{\mathrm{s}} = f_{\mathrm{m}}^{\mathrm{s}}\frac{{C_{{\mathrm{obs}}}^{\mathrm{s}}}}{{p^{\mathrm{s}}}} \equiv {\cal{B}}_{{\mathrm{ms}}},\forall m \in \gamma ,s \in S,$$

(6)

where ${\cal{B}}_{{\mathrm{ms}}}$ is the entry of a matrix ${\mathbf{{\cal{B}}}} \in {\Bbb R}^{\left| S \right|,\left| \gamma \right|}$ corresponding to mutation m and biopsy s. ${\mathbf{{\cal{B}}}}$ is fully determined from analysis of sequencing assays, including purity, observed copy numbers, and observed mutated-read fractions of each mutation.

Unfortunately, the right-hand side of Eq. (6)—a multiplication of frequencies and copy numbers—can not be analytically decoupled. However, from our problem formulation, mutations from the same subclone are expected to have the same mutation frequencies, i.e., $\varphi _{{\mathrm{m}}_{\mathrm{i}}}^{\mathrm{s}} = \varphi _{{\mathrm{m}}_{\mathrm{j}}}^{\mathrm{s}} \equiv \varphi ^{\mathrm{s}}\,\forall \,m_{\mathrm{i}},m_{\mathrm{j}} \in \gamma$. Further, we assume that the copy number of each mutation m is constant across biopsies, i.e., $\delta _{\mathrm{m}}^{{\mathrm{s}}_{\mathrm{i}}} = \delta _{\mathrm{m}}^{{\mathrm{s}}_{\mathrm{j}}} \equiv \delta _{\mathrm{m}} \in \left[ {0,CN} \right]\forall \,{\mathrm{s}}_i,{\mathrm{s}}_{\mathrm{j}} \in S$, where CN is a fixed upper bound for the copy number; CN = 15 in our simulations. While we expect that this assumption will introduce some errors to the approximation of $\delta _{\mathrm{m}}^{\mathrm{s}}$, it will have limited effects on the selection of optimal mutation frequencies because the variability of copy number averages for the mutated allele across biopsy is expected to be low for most mutated loci. We also note that we have not assumed stable genomes in our simulated data, i.e., the generated data displays variable copy numbers for the same mutated allele across biopsies in order to have an accurate estimate of the committed error.

After these assumptions, the optimization problem for each subclone $\in$${\cal{P}}(M)$, based on Eq. (6), can be formulated as:

$${\min} {\Vert} {\mathop{{\varphi} ^{\mathrm{s}}}\limits^{\rightharpoonup}} \otimes {\mathop{{\delta}_{\mathrm{m}}}\limits^{\rightharpoonup}} - {\cal{B}} {\Vert_2}; 0 \le \delta _{\mathrm{m}} \le {\mathrm{CN}},0 \le \varphi ^{\mathrm{s}} \le 1,\forall {\mathrm{m}} \in \gamma, \forall {\mathrm{s}} \in S,$$

(7)

where $\mathop{{\varphi ^{\mathrm{s}}}}\limits^{\rightharpoonup}$ is the mutation frequency vector across biopsies for all mutations in γ; $\mathop{{\delta _{\mathrm{m}}}}\limits^{\rightharpoonup}$ is the copy-number vector for each mutation in γ; ${\mathbf{{\cal{B}}}}$ is as defined in Eq. (6); and $\mathop{{\varphi ^{\mathrm{s}}}}\limits^{\rightharpoonup} \otimes \mathop{{\delta _{\mathrm{m}}}}\limits^{\rightharpoonup}$ denotes the outer product of vectors $\mathop{{\varphi ^{\mathrm{s}}}}\limits^{\rightharpoonup} \in {\Bbb R}^{\left| S \right|}$ and $\mathop{{\delta _{\mathrm{m}}}}\limits^{\rightharpoonup} \in {\Bbb R}^{\left| \gamma \right|}$. We used the sequential least-squares programming (SLSQP) optimization⁵¹ to find an optimal solution of Eq. (7). To avoid being trapped in local optima, multiple runs with different random initializations for the mutations frequencies and unitary variant allele copy numbers are performed.

Profiling and analysis of Wilms’ tumors

DNA libraries were constructed and sequenced as previously described⁵². Briefly, after QC, high molecular weight double-strand genomic DNA samples are constructed into Illumina PairEnd pre-capture libraries according to the manufacturer’s protocol (Illumina Inc.) with modifications. 300 base-pair fragments were checked using a 2.2% Flash Gel DNA Cassette 5 (Lonza, Cat. No.57023). The Fragmented DNA were End-Repaired, incubated, A-tailed, and ligated. Agencourt® XP® Beads (Beckman Coulter Genomics, Inc.; Cat. No. A63882) were used to purify DNA after each enzymatic reaction. After Beads purification, PCR product quantification and size distributions were determined using the Caliper GX 1K/12K/High Sensitivity Assay Labchip (Hopkinton, MA, Cat. No. 760517). Pre-capture libraries (1 µg) were hybridized in solution to VCRome 2.1 (NimbleGen) targeting 43 Mb of sequence from ~30 K genes, according to the manufacturer’s protocol. Sequencing was performed in paired-end mode with Illumina HiSeq 2000. Cluster formation and primer hybridization were performed on the flow cell with Illumina’s cBot cluster generation system. On average, about 80–100 million successful reads, consisting of 2 × 100 bp, were generated on each lane of a flow cell. CNAs and mutation frequencies in sex chromosomes were estimated using Genome Analysis Toolkit pipelines.

RNA-seq profiles of Wilms’ tumors were aligned using STAR v2.3.0e³² to an index of GRCh37 that included GENCODE v16 gene annotation. Alignment files were processed using Picard tools v1.54, and the final BAM files indexed using SAMtools index v0.1.11³³. RNA-seq run quality was assessed using the RNA-SeQC package³⁴ using the same GENCODE 16.gtf file. Transcript quantification was performed using Cufflinks¹⁸ v2.02 running in quantification mode against the GENCODE v16.gtf file. FPKM values were used for relative abundance estimation.

Profiling and analysis of WES of prostate cancer biopsies

To test our ability to infer mutation frequencies and ancestral relations between subclones based on clinical profiles of four prostate cancers. The Specimen were collected at the Department of Pathology and Molecular Pathology, University Hospital Zurich, Switzerland as previously described⁵³ with the approval of Cantonal scientific ethics committee Zurich, approval number KEK-ZH-No. 2014-0007, and with informed consent by the patient. Tumor regions were selected for heterogeneous histological presentation by an experienced uropathologist (PJW). DNA from peripheral blood and FFPE punches was isolated with the Maxwell 16 LEV Blood DNA kit (Promega, AS1290) and Maxwell 16 FFPE Tissue LEV DNA Purification Kit (Promega AS1130), respectively, according to manufacturer’s recommendations; 300 μl of blood collected in a BD Vacutainer K2 (EDTA 18.0 mg) tube was added to 30 μl of Proteinase K solution (final concentration 2 mg/ml) and subsequently mixed with 300 μl lysis buffer, vortexed, and incubated for 20 min at 56 °C. FFPE cylinders were deparaffinised with xylene, washed twice with ethanol, dried 10 min at 37 °C and re-suspended in 200 μl incubation buffer containing 2 mg/ml Proteinase K. Samples were incubated overnight at 70 °C and mixed with 400 μl lysis buffer. Lysates from both, blood and FFPE tissues, were transferred to well 1 of the supplied cartridge of the corresponding kit and DNA was automatically purified and eluted in 30 μl Tris-buffer, pH 8.0 by the Maxwell instrument.

WES samples were profiled using Agilent SureSelect Whole Exome Enrichment, v6 (58 Mbp) and 2 × 75 bp paired-end reads were used for optimal performance on a HiSeq 4000 (Illumina). Mutation calling was followed by protocols established by TCGA and ExAC^21,54. Reads were aligned to hg19 using BWA⁵⁵, and variants were called with GenomeAnalysisTK, MuTect⁵⁶, Picard MarkDuplicates, and additional post-processing utilities from GATK including BaseRecalibrator. FastQ files were deposited in EBI’s ENA project PRJEB19193. Predicted mutations are given in Table S5; mutations were annotated with estimated read fractions and estimated CNAs by VarScan using default parameters and after setting the maximum amplification to 15×⁴⁷.

Profiling and analysis of targeted-panel sequencing of prostate cancer biopsies

We selected a total of 36 genes whose mutations are enriched in prostate cancers for ultra-deep sequencing^40,41. ImmQuant was used to capture their coding regions and the completeness of the capture was verified against the human reference genome GRCh38. Target genes are given below.

AKT1	CDH1	MED12	PMS2
AR	CDKN1B	MRE11A	PTEN
ATM	CHEK2	MSH2	RAD51C
ATR	EP300	MSH6	RAD52D
AURKA	ERG	MYC	RB1
BARD1	EZH2	MYCN	SPOP
BRACA1	FOXA1	NBN	TMPRSS
BRACA2	GEN1	PALB2	TP53
BRIP1	HOBX13	PIK3CA	ZNF595

The samples were taken from the Prostate Cancer Outcomes Cohort Study from UZH (ProCOC) and Metastatic Prostate Cancer Biobank from UZH (metaProC). Patients and target genes were selected for ultra-deep sequencing based on WES profiling and using our predictive panel. Their profiles were implemented in three batches and sequenced using Illumina HiSeq using 150 bp pair-end reads by Sophia Genetics. Analysis of DNA profiles mirrored the steps described for WES analysis. BAM files are freely available on ENA project PRJEB19193.

Copy numbers were estimated using Affymetrix OncoScan arrays and the OncoScan FFPE Assay Kit for detection of genome-wide copy number changes and loss of heterozygosity in FFPE samples. Oncoscan uses molecular inversion probe technology to query over 220,000 SNPs at carefully selected genomic locations, evenly distributed across the genome and with increased density within approximately 900 cancer or cancer-related genes. All samples underwent array hybridization and analysis and passed gel QC, as established during validation of the assay for clinical genetic analysis. Data were analyzed using Nexus Express.

Protein-expression profiling and analysis of prostate cancer biopsies

For the comprehensive quantitation of proteins in high throughput manner, we employed a data-independent acquisition (DIA) workflow compiled of the commercially available and standardized iST sample preparation kit (PreOmics GmbH), a Q Exactive HF (Thermo Fisher Scientific Inc.) and Spectronaut Pulsar data analysis software⁵⁷. We have applied a combination of published and specifically generated spectral libraries. We have prepared and analyzed a selected five tissue samples from human prostate adenocarcinomas from the retrospective FFPE PC cohort from the University Hospital Zurich as described above⁴¹. Each specimen was analysed once.

First, the FFPE fixed material was deparaffinised and peptides were generated by using the iST 96× kit from PreOmics following the optimized protocol for FFPE punches. This was followed with DIA-MS analysis of 1 µg of total peptide mixture for each patient sample including iRT standard peptides for non-linear calibration, batch correction and large-scale data set merge. Then, we generated and applied spectral libraries through the merge of an in house generated library (Spectronaut Pulsar) and a commercial organ-specific library 311 FFPE prostate library (311_FFPE_prostate). The resulting library contained 8200 protein groups with a size of 65.5 mb. For quality control, the peptide yield was determined after protein extraction, digestion, and peptide clean up. MS signal intensity was monitored (TIC) during entire MS run as well as the performance and stability of the liquid chromatography according to a reference peptide set.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

BAM files are freely available without restrictions on ENA project PRJEB19193. All supplementary tables, including synthetic data, processed data, and analyses used to produce Fig. 3 are also given in the GitHub repository https://github.com/drugilsberg/chimaera.

Code availability

The Chimæra web server is available at https://ibm.biz/chimaera-aas and source code is available at https://github.com/drugilsberg/chimaera.

References

Futreal, P. A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
CAS PubMed PubMed Central Google Scholar
Higgins, M. E., Claremont, M., Major, J. E., Sander, C. & Lash, A. E. CancerGenes: a gene selection resource for cancer genome projects. Nucleic Acids Res. 35, D721–D726 (2007).
CAS PubMed Google Scholar
Ding, L. et al. Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173, 305–320. e310 (2018).
CAS PubMed PubMed Central Google Scholar
Nowell, P. C. The clonal evolution of tumor cell populations. Science 194, 23–28 (1976).
CAS PubMed Google Scholar
Fidler, I. J. & Hart, I. R. Biological diversity in metastatic neoplasms: origins and implications. Science 217, 998–1003 (1982).
CAS PubMed Google Scholar
Boutros, P. C. et al. Spatial genomic heterogeneity within localized, multifocal prostate cancer. Nat. Genet. 47, 736–745 (2015).
CAS PubMed Google Scholar
Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160 (2014).
CAS PubMed PubMed Central Google Scholar
Suzuki, H. et al. Mutational landscape and clonal architecture in grade II and III gliomas. Nat. Genet. 47, 458–468 (2015).
CAS PubMed Google Scholar
Mann, K. M. et al. Analyzing tumor heterogeneity and driver genes in single myeloid leukemia cells with SBCapSeq. Nat. Biotechnol. 34, 962–972 (2016).
Gao, R. et al. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat. Genet. 48, 1119–1130 (2016).
CAS PubMed PubMed Central Google Scholar
Andor, N. et al. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nat. Med. 22, 105–113 (2016).
CAS PubMed Google Scholar
El-Kebir, M., Oesper, L., Acheson-Field, H. & Raphael, B. J. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics 31, i62–i70 (2015).
CAS PubMed PubMed Central Google Scholar
Niknafs, N., Beleva-Guthrie, V., Naiman, D. Q. & Karchin, R. SubClonal hierarchy inference from somatic mutations: automatic reconstruction of cancer evolutionary trees from multi-region next generation sequencing. PLoS Comput. Biol. 11, e1004416 (2015).
PubMed PubMed Central Google Scholar
Espiritu, S. M. G. et al. The evolutionary landscape of localized prostate cancers drives clinical aggression. Cell 173, 1003–1013 (2018).
Chu, W. K. et al. Ultraaccurate genome sequencing and haplotyping of single human cells. Proc. Natl Acad. Sci. 114, 201707609 (2017).
Getz, G. & Ardlie, K. in TCGA Second Annual Scientific Symposium (eds Meyerson, M. & Shmulevich, I.) (National Institutes of Health, Crystal City, Virgina, 2012).
Cieslik, M. et al. The use of exome capture RNA-seq for highly degraded RNA with application to clinical cancer sequencing. Genome Res. 25, 1372–1381 (2015).
CAS PubMed PubMed Central Google Scholar
Wang, J. et al. Tumor evolutionary directed graphs and the history of chronic lymphocytic leukemia. eLife 3, e02869 (2014).
Gundem, G. et al. The evolutionary history of lethal metastatic prostate cancer. Nature 520, 353–357 (2015).
CAS PubMed PubMed Central Google Scholar
Kuipers, J., Jahn, K., Raphael, B. J. & Beerenwinkel, N. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 27, 1885–1894 (2017).
The Cancer Genome Atlas. The molecular taxonomy of primary prostate cancer. Cell 163, 1011–1025 (2015).
Google Scholar
The Cancer Genome Atlas. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Google Scholar
TCGA. Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell 169, 1327–1341 (2017).
Google Scholar
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
CAS PubMed PubMed Central Google Scholar
Andor, N., Harness, J. V., Muller, S., Mewes, H. W. & Petritsch, C. EXPANDS: expanding ploidy and allele frequency on nested subpopulations. Bioinformatics 30, 50–60 (2014).
CAS PubMed Google Scholar
Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16, 35 (2015).
PubMed PubMed Central Google Scholar
Malikic, S., McPherson, A. W., Donmez, N. & Sahinalp, C. S. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics 31, 1349–1356 (2015).
CAS PubMed Google Scholar
McInnes, L., Healy, J. & Astels, S. hdbscan: hierarchical density based clustering. J. Open Source Softw. 2, 205 (2017).
Google Scholar
Fritz, H., Garcıa-Escudero, L. A. & Mayo-Iscar, A. tclust: an r package for a trimming approach to cluster analysis. J. Stat. Softw. 47, 1–26 (2012).
Google Scholar
Fraley, C. & Raftery, A. E. MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-based Clustering (DTIC Document, 2006).
Zhao, X. & Sandelin, A. GMD: measuring the distance between histograms with applications on high-throughput sequencing reads. Bioinformatics 28, 1164–1165 (2012).
CAS PubMed Google Scholar
Lin, D.-C. et al. Genomic and epigenomic heterogeneity of hepatocellular carcinoma. Cancer Res. 77, 2255–2265 (2017).
CAS PubMed PubMed Central Google Scholar
Roy, A. et al. Integration of whole transcriptome sequencing into the genomic analysis of pediatric solid tumors: early experience and challenges. J. Mol. Diagn. 16, 754–755 (2014).
Google Scholar
Li, C.-M. et al. CTNNB1 mutations and overexpression of Wnt/β-catenin target genes in WT1-mutant Wilms’ tumors. Am. J. Pathol. 165, 1943–1953 (2004).
CAS PubMed PubMed Central Google Scholar
Pelletier, J. et al. WT1 mutations contribute to abnormal genital system development and hereditary Wilms’ tumour. Nature 353, 431 (1991).
CAS PubMed Google Scholar
Liu, C. et al. Control of β-catenin phosphorylation/degradation by a dual-kinase mechanism. Cell 108, 837–847 (2002).
CAS PubMed Google Scholar
Kim, H.-S. et al. Identification of novel Wilms’ tumor suppressor gene target genes implicated in kidney development. J. Biol. Chem. 282, 16278–16287 (2007).
CAS PubMed Google Scholar
Rubinfeld, B. et al. Stabilization of β-catenin by genetic defects in melanoma cell lines. Science 275, 1790–1792 (1997).
CAS PubMed Google Scholar
Wang, Y. et al. Mutation spectrum of genes associated with steroid-resistant nephrotic syndrome in Chinese children. Gene 625, 15–20 (2017).
CAS PubMed Google Scholar
Guo, T. et al. Multi-region proteome analysis quantifies spatial heterogeneity of prostate tissue biomarkers. Life Sci. Alliance 1, e201800042 (2018).
PubMed PubMed Central Google Scholar
Zhong, Q. et al. A curated collection of tissue microarray images and clinical outcome data of prostate cancer patients. Sci. Data 4, 170014 (2017).
CAS PubMed PubMed Central Google Scholar
Shi, W. et al. Reliability of whole-exome sequencing for assessing intratumor genetic heterogeneity. Cell Rep. 25, 1446–1457 (2018).
CAS PubMed PubMed Central Google Scholar
Robinson, D. et al. Integrative clinical genomics of advanced prostate cancer. Cell 161, 1215–1228 (2015).
CAS PubMed PubMed Central Google Scholar
Doane, A. et al. An estrogen receptor-negative breast cancer subset characterized by a hormonally regulated transcriptional program and response to androgen. Oncogene 25, 3994 (2006).
CAS PubMed Google Scholar
Iwanaga, K. et al. Pten inactivation accelerates oncogenic K-ras–initiated tumorigenesis in a mouse model of lung cancer. Cancer Res. 68, 1119–1127 (2008).
CAS PubMed PubMed Central Google Scholar
Bruins, W. et al. The absence of Ser389 phosphorylation in p53 affects the basal gene expression level of many p53-dependent genes and alters the biphasic response to UV exposure in mouse embryonic fibroblasts. Mol. Cell. Biol. 28, 1974–1987 (2008).
CAS PubMed PubMed Central Google Scholar
Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
CAS PubMed PubMed Central Google Scholar
Krzanowski, W. J. & Lai, Y. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988).
Google Scholar
Kovács, F., Legány, C. & Babos, A. in 6th International symposium of hungarian researchers on computational intelligence. (Citeseer, 2005).
Celeux, G. & Govaert, G. Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995).
Google Scholar
Sheppard, D., Terrell, R. & Henkelman, G. Optimization methods for finding minimum energy paths. J. Chem. Phys. 128, 134106 (2008).
PubMed Google Scholar
Cancer Genome Atlas, N. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Google Scholar
Mortezavi, A. et al. KPNA2 expression is an independent adverse predictor of biochemical recurrence after radical prostatectomy. Clin. Cancer Res. 17, 1111–1121 (2011).
CAS PubMed Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
PubMed PubMed Central Google Scholar
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
CAS PubMed PubMed Central Google Scholar
Bruderer, R. et al. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol. Cell. Proteom. 14, 1400–1410 (2015).
CAS Google Scholar

Download references

Acknowledgements

This project was funded by the European Union’s Horizon 2020 research and innovation programme under grant agreements 668858 and 826121 to P.J.W., M.R.M., and P.S.; by the Foundation for Research in Science and the Humanities at the University of Zurich to P.J.W., and by NCI R21CA223140, Texas Children’s Cancer Center, and Cookies for Kids’ Cancer Foundation to P.S. and D.W.P.

Author information

Authors and Affiliations

IBM Research—Zurich, 8803, Rüschlikon, Switzerland
Matteo Manica, Roland Mathis & María Rodríguez Martínez
Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
Matteo Manica
Texas Children’s Cancer Center, Baylor College of Medicine, Houston, TX, USA
Hyunjae Ryan Kim, Philippe Chouvarine, Arati Pati, Angshumoy Roy, Donald W. Parsons & Pavel Sumazin
Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
Dorothea Rutishauser, Laura De Vargas Roditi, Ulrich Wagner, Kathrin Oehl & Karim Saba
RWTH Aachen University, Faculty of Medicine, Joint Research Centre for Computational Biomedicine, Aachen, Germany
Bence Szalai & Julio Saez-Rodriguez
Institute for Computational Biomedicine, Heidelberg University Hospital, Heidelberg, Germany
Julio Saez-Rodriguez
Senckenberg Institute of Pathology, University Hospital Frankfurt, Frankfurt am Main, Germany
Peter J. Wild

Authors

Matteo Manica
View author publications
You can also search for this author in PubMed Google Scholar
Hyunjae Ryan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Roland Mathis
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Chouvarine
View author publications
You can also search for this author in PubMed Google Scholar
Dorothea Rutishauser
View author publications
You can also search for this author in PubMed Google Scholar
Laura De Vargas Roditi
View author publications
You can also search for this author in PubMed Google Scholar
Bence Szalai
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Kathrin Oehl
View author publications
You can also search for this author in PubMed Google Scholar
Karim Saba
View author publications
You can also search for this author in PubMed Google Scholar
Arati Pati
View author publications
You can also search for this author in PubMed Google Scholar
Julio Saez-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Angshumoy Roy
View author publications
You can also search for this author in PubMed Google Scholar
Donald W. Parsons
View author publications
You can also search for this author in PubMed Google Scholar
Peter J. Wild
View author publications
You can also search for this author in PubMed Google Scholar
María Rodríguez Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Sumazin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.M., D.W.P., P.J.W., M.R.M., and P.S. concieved and designed the project. M.M., M.R.M., and P.S. lead the project’s implementation. M.M., H.R.M., R.M., P.C., A.P., and P.S. performed data analyses and designed methods. D.R., L.D.V.R., U.W., K.O., K.S., and P.J.W. lead efforts of collecting, profiling, and analyzing prostate cancer samples. A.R. and D.W.P. lead efforts of collecting, profiling and analyzing Wilms’ tumor samples, and M.M., B.S., J.S.Z., and P.S. analyzed regulatory networks and altered pathways.

Corresponding authors

Correspondence to Peter J. Wild, María Rodríguez Martínez or Pavel Sumazin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Manica, M., Kim, H.R., Mathis, R. et al. Inferring clonal composition from multiple tumor biopsies. npj Syst Biol Appl 6, 27 (2020). https://doi.org/10.1038/s41540-020-00147-5

Download citation

Received: 17 January 2020
Accepted: 15 July 2020
Published: 25 August 2020
DOI: https://doi.org/10.1038/s41540-020-00147-5

This article is cited by

PhyliCS: a Python library to explore scCNA data and quantify spatial tumor heterogeneity
- Marilisa Montemurro
- Elena Grassi
- Gianvito Urgese
BMC Bioinformatics (2021)