Abstract
Copynumber aberrations (CNAs) and wholegenome duplications (WGDs) are frequent somatic mutations in cancer but their quantification from DNA sequencing of bulk tumor samples is challenging. Standard methods for CNA inference analyze tumor samples individually; however, DNA sequencing of multiple samples from a cancer patient has recently become more common. We introduce HATCHet (Holistic Allelespecific Tumor Copynumber Heterogeneity), an algorithm that infers allele and clonespecific CNAs and WGDs jointly across multiple tumor samples from the same patient. We show that HATCHet outperforms current stateoftheart methods on multisample DNA sequencing data that we simulate using MASCoTE (Multiple Allelespecific Simulation of Copynumber Tumor Evolution). Applying HATCHet to 84 tumor samples from 14 prostate and pancreas cancer patients, we identify subclonal CNAs and WGDs that are more plausible than previously published analyses and more consistent with somatic singlenucleotide variants (SNVs) and small indels in the same samples.
Similar content being viewed by others
Introduction
Cancer results from the accumulation of somatic mutations in cells, yielding a heterogeneous tumor composed of distinct subpopulations of cells, or clones, with different complements of mutations^{1}. Quantifying this intratumor heterogeneity and inferring past tumor evolution have been shown to be crucial in cancer treatment and prognosis^{2,3,4}. CNAs are frequent somatic mutations in cancer that amplify or delete one or both the alleles of genomic segments, chromosome arms, or even entire chromosomes^{5}. In addition, WGD, a doubling of all chromosomes, is a frequent event in cancer with an estimated frequency higher than 30% in recent pancancer studies^{5,6,7,8}. Accurate inference of CNAs and WGDs is crucial for quantifying intratumor heterogeneity and reconstructing tumor evolution, even when analyzing only SNVs^{9,10,11,12,13}.
In principle, CNAs can be detected in DNA sequencing data by examining two signals: (1) the difference between the observed and expected counts of sequencing reads that align to a locus, quantified by the readdepth ratio (RDR), and (2) the proportion of reads belonging to the two distinct alleles of the locus, quantified by the Ballele frequency (BAF) of heterozygous germline singlenucleotide polymorphisms (SNPs). In practice, the inference of CNAs and WGDs from DNA sequencing data is challenging, particularly for bulk tumor samples that are mixtures of thousandsmillions of cells. In such mixtures the signal from the observed reads is a superposition of the signals from normal cells and distinct tumor clones, which share the same clonal CNAs but are distinguished by different subclonal CNAs. One thus needs to deconvolve, or separate, this mixed signal into the individual components arising from each of these clones. This deconvolution is complicated as both the CNAs and the proportion of cells originating from each clone in the mixture are unknown; in general the deconvolution problem is underdetermined with multiple equivalent solutions. In the past few years, over a dozen methods have been developed to solve different versions of this copynumber deconvolution problem^{6,9,14,15,16,17,18,19,20,21,22,23,24,25,26,27}. These methods rely on various simplifying assumptions, such as: only one tumor clone is present in the sample, no WGDs, etc. While these assumptions remove ambiguity in copynumber deconvolution, it is not clear that the resulting solutions are accurate, particularly in cases of highly aneuploid tumors.
Singlecell DNA sequencing^{28} obviates the need for copynumber deconvolution, but remains a specialized technique with various technical and financial challenges, and thus is not yet widely used in sequencing of cancer patients, particularly in clinical settings. A valuable intermediate between DNA sequencing of single cells and DNA sequencing of a single bulk tumor sample is DNA sequencing of multiple bulk tumor samples from the same patient; these samples may be obtained from multiple regions of a primary tumor, matched primary and metastases, or longitudinal samples^{11,12,26,29,30,31}. A number of approaches have demonstrated that simultaneous analysis of SNVs from multiple tumor samples helps to resolve uncertainties in clustering SNVs into clones^{32,33} and to reduce ambiguities in inferring phylogenetic trees^{11,29,34,35,36}. However, available methods for inferring CNAs analyze individual samples, losing the important information that multiple samples from the same patient share many CNAs that occurred during tumor evolution.
To slice through the thicket of ambiguity in copynumber deconvolution, we introduce HATCHet, an algorithm to infer allele and clonespecific CNAs as well as the proportions of distinct tumor clones jointly across one or more samples from the same patient. HATCHet provides a fresh perspective on CNA inference and includes two main algorithmic innovations that address limitations of existing methods. First, HATCHet jointly analyzes multiple samples by globally clustering RDRs and BAFs along the entire genome and across all samples, and by solving a matrix factorization problem to infer allele and clonespecific copy numbers from all samples. In contrast, existing methods^{6,9,14,15,16,17,18,19,20,21,22,25,26,27} infer allelespecific copy numbers on each sample independently (with one exception^{25}) and locally cluster RDRs and BAFs for neighboring regions in each sample separately. Second, HATCHet separates two distinct sources of ambiguity in the copynumber deconvolution problem, the presence of subclonal CNAs and the occurrence of WGDs, and uses a modelselection criterion to distinguish these sources. In contrast, existing methods attempt to fit a unique value for the variables tumor ploidy and purity (or equivalent variables) to the observed RDRs and BAFs, conflating different sources of ambiguity in the data.
In this paper, we evaluate HATCHet on both simulated and cancer data. We show that HATCHet outperforms six current stateoftheart methods^{9,15,17,21,22,25,27,37} on 256 samples from 64 patients simulated by MASCoTE, a framework that we develop to generate DNA sequencing data from multiple mixed samples with appropriate corrections for the differences in genome lengths between normal and tumor clones. Next, we apply HATCHet on wholegenome multisample DNA sequencing data from 49 samples from 10 metastatic prostate cancer patients^{11} and 35 samples from four metastatic pancreas cancer patients^{30}. We show that HATCHet’s inferred subclonal CNAs and WGDs are more plausible than reported in published analyses and more consistent with somatic SNVs and small indels measured in the same samples, resulting in alternative reconstructions of tumor evolution and metastatic seeding patterns.
Results
HATCHet algorithm
We introduce HATCHet, an algorithm to infer allele and clonespecific copy numbers and clone proportions for several tumor clones jointly across multiple bulk tumor samples from the same patient (Fig. 1a). The inputs to HATCHet are RDRs and BAFs for short genomic bins across k tumor samples from the same patient. We assume that each sample is a mixture of at most n clones, including the normal diploid clone and one or more tumor clones with different CNAs. The goal is to infer these clones and their proportions in each sample, where we model the effects of CNAs as m segments, or clusters of genomic bins. The two outputs of HATCHet are: (1) copynumber states (a_{s,i}, b_{s,i}) that indicate the allelespecific copy numbers a_{s,i} and b_{s,i} for each segment s in each clone i, and that form two m × n matrices A = [a_{s,i}] and B = [b_{s,i}]; and (2) clone proportions u_{i,p} that indicate the fraction of cells in sample p belonging to clone i, and that form an n × k matrix U = [u_{i,p}].
HATCHet separates the inference of A, B, and U into two modules. The first module infers the allelespecific fractional copy numbers \({f}_{s,p}^{A}={\sum }_{i}{a}_{s,i}{u}_{i,p}\) and \({f}_{s,p}^{B}={\sum }_{i}{b}_{s,i}{u}_{i,p}\) of each segment s in each sample p, which form two m × k matrices F^{A} and F^{B}. Specifically, this module has three steps. First, HATCHet computes RDRs and BAFs in short genomic bins (with a user adjustable size set to 50 kb in our analysis) along the genome (Fig. 1b). Second, HATCHet clusters RDRs and BAFs globally along the entire genome and jointly across all samples using a Bayesian nonparametric clustering algorithm^{38} (Fig. 1c). This clustering leverages the fact that samples from the same patient have a shared evolutionary history. Finally, HATCHet aims to infer the fractional copy numbers F^{A} and F^{B}. Importantly, F^{A} and F^{B} are not measured directly and must be inferred from the observed RDRs and BAFs. However, F^{A} and F^{B} are not identifiable from DNA sequencing data of bulk tumor samples and typically have multiple equally plausible values (Supplementary Fig. 1). We show in Methods that if one knows whether or not a WGD has occurred, then F^{A} and F^{B} are determined under few reasonable assumptions. Thus, in the third step HATCHet estimates two values of F^{A} and F^{B}, assuming the presence or absence of a WGD, and defers the selection between these alternatives until after the inference of A, B, and U, i.e., the clonal composition (Fig. 1d).
The second module of HATCHet computes A, B, and U from the inferred values of F^{A} and F^{B} by solving a matrix factorization problem. Since F^{A} = AU and F^{B} = BU, the copynumber deconvolution problem corresponds to the problem of simultaneously factoring F^{A} into the factors A, U and F^{B} into the factors B, U. In general, multiple factorizations may exist and thus HATCHet enforces additional constraints on the allowed factorizations, including a maximum copy number (\({a}_{s,i}+{b}_{s,i}\le {c}_{\max }\)), a minimum clone proportion (either \({u}_{i,p}\ge {u}_{\min }\) or u_{i,p} = 0), and evolutionary relationships among the tumor clones. HATCHet solves the resulting optimization problem using a coordinatedescent algorithm (Fig. 1e). Finally, HATCHet uses a modelselection criterion to select the number n of clones and the occurrence of WGD, as these values are unknown a priori and must be selected carefully to avoid overfitting the data. Specifically, HATCHet infers A, B, and U for every value of n and for the two values of F^{A} and F^{B} estimated in the first module. Then, HATCHet considers the tradeoff between the inference of subclonal CNAs (resulting in higher n and more clones present in a sample) and WGD to select the solution (Fig. 1f, g).
HATCHet differs from existing methods for copynumber deconvolution in a number of ways, which are summarized in Supplementary Table 1 and further detailed in Methods. Importantly, HATCHet addresses the challenges of nonidentifiability and model selection using a different strategy than existing methods. Recognizing that the estimation of F^{A}, F^{B} and their deconvolution into A, B, and U are two different sources of ambiguity in the data, HATCHet defers the selection of F^{A} and F^{B} until after the deconvolution. This allows HATCHet to consider the tradeoff between solutions with many subclonal CNAs vs. solutions with WGD (Supplementary Fig. 1). In contrast, existing methods^{6,9,14,15,16,17,18,19,20,23,24,25,26,27} attempt to fit values for the variables tumor ploidy and purity (or equivalent variables) that best model the observed RDRs and BAFs. However, tumor ploidy and purity are composite variables that sum the contributions of the unknown copy numbers and proportions of multiple clones. Thus, tumor ploidy and purity are not ideal coordinates to evaluate tumor mixtures as many different clonal compositions may be equally plausible in these coordinates, particularly when more than one tumor clone is present or a WGD occurs (Supplementary Figs. 2 and 3).
HATCHet outperforms existing methods for copynumber deconvolution
We compared HATCHet with six current stateoftheart methods for copynumber deconvolution, i.e., Battenberg^{9}, TITAN^{17}, THetA^{21,22}, cloneHD^{25}, Canopy^{37} (with fractional copy numbers from FALCON^{15}), and ReMixT^{27}, on simulated data. Most current studies that simulate DNA sequencing data from mixed samples containing CNAs do not account for the different genome lengths of distinct clones^{15,16,17,25,39,40,41,42,43,44}; this oversight leads to incorrect simulation of read counts (Supplementary Figs. 4 and 5). Therefore, we introduce MASCoTE, a simulation framework to correctly generate DNA sequencing reads from multiple bulk tumor samples, with each sample containing one or more clones that share the same evolutionary history during which CNAs and/or WGD occur (Supplementary Fig. 6 and further details in Methods). We simulated DNA sequencing reads from 256 tumor samples (1–3 tumor clones) for 64 patients (3–5 samples per patient), half with a WGD and half without a WGD (Supplementary Fig. 7).
We separated the comparison of methods into two parts in order to assess both the inference of CNAs and proportions, as well as the prediction of WGDs. First, we provided the true values of the main parameters inferred by each method to assess the ability to retrieve the correct solution without the difficulty of model selection. Second, we ran each method in default mode. In both cases, we applied HATCHet jointly on all samples as well as separately on each sample (singlesample HATCHet) to quantify the contribution of the global clustering and the factorization model which capture the dependency across samples. Further details are in Supplementary Notes 1 and 2.
We first ran all methods on the 128 samples from 32 patients without a WGD and also providing the true value of the main parameters required for each method (e.g., tumor ploidy and number of clones). We found that HATCHet outperformed all other methods (Fig. 2a and Supplementary Figs. 8–12), demonstrating the advantages of HATCHet’s joint analysis of multiple samples. While singlesample HATCHet has slightly worse performance, it also outperformed all other methods, suggesting that the additional features of HATCHet, such as the clustering of RDRs and BAFs along the entire genome, play an important role. Further discussions of these results are in Supplementary Note 3.
To assess the simultaneous prediction of WGD and inference of CNAs and proportions, we next ran the methods on all 256 samples from all 64 patients, requiring that each method infers all relevant parameters, including tumor ploidy and number of clones. Note that we excluded THetA from this comparison as it does not automatically infer presence/absence of WGDs. Not surprisingly, in this more challenging setting, all methods have lower performance, but HATCHet and singlesample HATCHet continue to outperform the other methods (Fig. 2b and Supplementary Figs. 13–17), even when assessing the prediction of amplified/deleted segments independently from the presence of a WGD (Supplementary Fig. 18). HATCHet is the only method with high (>75%) precision and recall in the inference of both presence and absence of WGD, while other methods tend to be biased towards presence or absence (Fig. 2c and Supplementary Fig. 19). We observed the same bias even when taking the consensus of the other methods, a procedure used in the recent PCAWG analysis of >2500 wholecancer genomes^{7} (Fig. 2c). The higher performance of HATCHet illustrates the advantage of performing model selection using the natural variables of the problem, i.e., the copy numbers A, B and the clone proportions U, rather than selecting a unique solution based on tumor ploidy and purity as done by existing methods (Supplementary Fig. 20). Further discussions of these results are in Supplementary Note 4.
Finally, we further assessed the performance of HATCHet by comparing the copynumber profiles derived by HATCHet on wholeexome sequencing of bulk tumor samples with the copynumber profiles from DOPPCR singlecell DNA sequencing from the same tumors. On 21 bulk tumor samples from 8 breast cancer patients^{45,46}, we observed a reasonable consistency between HATCHet’s profiles and those from single cells (Supplementary Figs. 21 and 22). Additional details of this analysis are reported in Supplementary Note 5.
HATCHet identifies wellsupported subclonal CNAs
We used HATCHet to analyze two wholegenome DNA sequencing datasets with multiple tumor samples from individual patients (Supplementary Note 6): 49 primary and metastatic tumor samples from 10 prostate cancer patients^{11} (Supplementary Fig. 23) and 35 primary and metastatic tumor samples from four pancreas cancer patients^{30} (Supplementary Fig. 24). While both datasets contain multiple tumor samples from individual patients, the previously published analyses inferred CNAs in each sample independently. Moreover, these studies reached opposite conclusions regarding the landscape of CNAs in these tumors: Gundem et al.^{11} reported subclonal CNAs in all prostate samples, while MakohonMoore et al.^{30} reported no subclonal CNAs in the pancreas samples. An important question is whether this difference is due to cancertype specific or patientspecific differences in CNA evolution of these tumors, or a consequence of differences in the bioinformatic analyses. We investigated whether the HATCHet’s analysis would confirm or refute the discordance between these studies.
On the prostate cancer dataset, HATCHet identified subclonal CNAs in 29/49 samples. In contrast, the published analysis^{11} of these samples used Battenberg for CNA inference and identified subclonal CNAs in all 49 samples (Fig. 3a). On the 29 samples where both methods reported subclonal CNAs, we found that the two methods identified a similar fraction of the genome with subclonal CNAs (Fig. 3b). Moreover, on these samples, there are clear samplesubclonal clusters of genomic bins (i.e., with different copynumber states in the same sample, cf. Fig. 1g and Supplementary Fig. 25) with RDRs and BAFs that are clearly distinct and intermediate between those of sampleclonal clusters (Fig. 3c). These samplesubclonal clusters correspond to subclonal CNAs affecting large genomic regions (Fig. 3d). In contrast, on the 20 samples where only Battenberg reported subclonal CNAs, the samplesubclonal clusters only identified by Battenberg do not have RDRs and BAFs that are clearly distinguishable from the sampleclonal clusters (Fig. 3e, f and Supplementary Figs. 26 and 27).
While it is possible that Battenberg has higher sensitivity in detecting subclonal CNAs than HATCHet, the extensive subclonal CNAs reported by Battenberg in all samples is concerning. This is because the inference of subclonal CNAs will always produce a better fit to the observed RDRs and BAFs, but with a cost of increasing the number of parameters required to describe the copynumber states (model complexity). Battenberg models the clonal composition of each segment independently (Supplementary Fig. 28), and thus has 6× more parameters than HATCHet on this dataset (Supplementary Fig. 29). To avoid overfitting, it is important to evaluate the tradeoff between model fit and model complexity. Battenberg does not include a modelselection criterion to evaluate this tradeoff, and it consequently infers a high fraction of subclonal CNAs in every sample (Supplementary Note 7) without fitting the observed RDRs and BAFs better than HATCHet (Supplementary Note 8). In contrast, HATCHet uses a modelselection criterion to identify the number of clones; consequently in 20/49 samples HATCHet infers that all the subclonal CNAs identified by Battenberg are instead clonal (Supplementary Fig. 30). Since HATCHet fits the observed RDRs and BAFs as well as Battenberg (Supplementary Fig. 31) but without subclonal CNAs, the extensive subclonal CNAs reported by Battenberg in these samples are equally wellexplained as clonal CNAs.
Finally, we found that ReMixT’s inference of subclonal CNAs from the same dataset was more similar to HATCHet than Battenberg (Supplementary Fig. 32). Since both HATCHet and ReMixT outperformed Battenberg on the simulated data, the similarity between HATCHet and ReMixT on this dataset suggests that Battenberg’s results are less accurate. Further details of this analysis are in Supplementary Note 9.
On the pancreas cancer dataset, HATCHet identified subclonal CNAs in 15/35 samples (Fig. 4a). In contrast, the published analysis^{30} of these samples used ControlFREEC for CNA inference, which assumes that all CNAs are clonal and contained in all tumor cells in a sample (Supplementary Fig. 33). Overall, HATCHet reported a greater fraction of the genome with CNAs (Supplementary Fig. 34) and better fit the observed RDRs and BAFs (Supplementary Fig. 35) using less than 1/3 of the parameters used by ControlFREEC (Supplementary Fig. 36). The identification of subclonal CNAs is supported by the presence of samplesubclonal clusters which have RDRs and BAFs that are clearly distinct from those of sampleclonal clusters (Fig. 4b–d); moreover, many of these clusters correspond to large subclonal CNAs spanning chromosomal arms (Fig. 4e) and have different copynumber states across different samples (i.e., tumorsubclonal clusters, cf. Fig. 1g and Supplementary Fig. 25). HATCHet’s joint analysis across multiple samples also aids in the CNA inference in lowpurity samples: the liver metastasis sample Pam01_LiM1 has an inferred tumor purity of 28% causing clusters of genomic bins with distinct copynumber states to have similar values of RDR and BAF (Fig. 4d). The distinct clusters are identified by leveraging the signal from a higher purity sample (Pam01_LiM2 in Fig. 4c).
HATCHet’s joint analysis of multiple tumor samples from the same patient enables the direct identification of clones that are shared across multiple samples. Overall, HATCHet infers that 14/49 samples from the prostate cancer patients and 13/35 samples from the pancreas cancer patients have evidence of subclones shared between multiple samples, compared to 46/49 and 0/35 in the published copynumber analyses, respectively (Supplementary Note 10 and Supplementary Figs. 37–39). For example, HATCHet reports that the lymph node sample of pancreas cancer patient Pam01 (Fig. 4b) is a mixture of two clones with each of these clones present in exactly one of the two distinct liver metastases from the same patient (Fig. 4c, d). We found that the shared subclones identified by HATCHet are more consistent with previous SNV analyses^{11,30}. Moreover, we found that the resulting metastatic seeding patterns agree with previous reports of limited heterogeneity across metastases^{11,30} (Supplementary Note 11 and Supplementary Figs. 40–42) and provide evidence for polyclonal migrations in pancreas cancer patients (Supplementary Fig. 43), consistent with the reports of polyclonal migrations in mouse models of pancreatic tumors^{47}. Finally, additional analyses of RDRs and BAFs further support the subclonal CNAs identified by HATCHet (Supplementary Note 12 and Supplementary Figs. 44 and 45).
HATCHet reliably identifies WGDs
We next examined the prediction of WGDs on the prostate and pancreas cancer datasets. The previously published analyses of these datasets reached opposite conclusions regarding the landscape of WGDs in these tumors: Gundem et al.^{11} reported WGDs in 12 samples of 4 prostate cancer patients (A12, A29, A31, and A32), while MakohonMoore et al.^{30} did not evaluate the presence of WGDs in the pancreas cancer samples, despite reports of high prevalence of WGD in pancreas cancer^{48}. We investigated whether HATCHet analysis would confirm or refute the different prevalence of WGDs reported in the previous studies.
On the prostate cancer dataset, there is strong agreement between WGD predictions from Battenberg and HATCHet, with discordance on only 2/49 samples (Supplementary Fig. 46a, b). Note that Battenberg does not explicitly state whether a WGD is present in a sample, and thus we used the criterion from previous pancancer analysis^{5,6,7,8,12} that a tumor sample with ploidy >3 corresponds to WGD. Since Battenberg’s solutions were manually chosen from different alternatives in the published analysis, the strong agreement between these predictions is a positive indicator for HATCHet’s automated model selection. The two discordant samples, A12C and A29C, are single samples from patients A12 and A29, respectively. Battenberg predicted a WGD only in A12C and no WGD in the other samples from A12. Conversely, Battenberg predicted no WGD in A29C but a WGD in the other sample from A29. However, the divergent predictions of WGD are likely due to the Battenberg’s independent analysis of each sample and is not well supported by the data. In contrast, HATCHet jointly analyzes multiple samples and predicts the absence/presence of a WGD consistently across all samples from the same patient (no WGD in A12 and a WGD in A29), providing simpler solutions with an equally good fit of the observed data (Supplementary Note 13 and Supplementary Figs. 46–47).
On the pancreas dataset, the published analysis excluded the possibility of WGDs and assumed that tumor ploidy is always equal to 2. Instead, HATCHet predicted a WGD in all 31 samples from three of the four patients (Fig. 5a). These results are consistent with recent reports of the high frequency of WGD (~45%) and massive rearrangements in pancreatic cancer^{26,48}, and also supported by additional analyses (Supplementary Fig. 48). All 31 samples from the three patients with a WGD display several large clusters of genomic regions with clearly distinct values of RDR and BAF. When jointly considering all samples from the same patient, these clusters are clearly better explained by the occurrence of a WGD (Fig. 5b) than by the presence of many subclonal CNAs, as the latter would result in the unlikely presence of distinct tumor clones with the same proportions in all samples (Supplementary Fig. 49). By directly evaluating the tradeoff between subclonal CNAs and WGDs in the model selection, HATCHet makes more reasonable predictions of the occurrence of WGDs.
HATCHet’s CNAs better explain somatic SNVs and small indels
We evaluated how well the copy numbers and proportions inferred by each method explain the observed read counts of somatic SNVs and small indels—two classes of mutations that were not used in the identification of CNAs. Specifically, we compared the observed variantallele frequency (VAF) of each mutation with the best predicted VAF obtainable from the inferred copynumber states and proportions at the genomic locus (see details in Supplementary Note 14). We classified a mutation as explained when the predicted VAF is within a 95% confidence interval (CI) (according to a binomial model with beta prior^{34,35}) of the observed VAF (Fig. 6a). When counting the number of explained mutations, we excluded mutations that have low frequency (VAF < 0.2) as well as mutations that are not explained by the copy numbers and proportions inferred by any of the methods. These excluded mutations are more likely to have occurred after CNAs and to be present in smaller subpopulations of cells.
We identified ≈10,600 mutations per prostate cancer sample and ≈9,000 mutations per pancreas cancer sample (Supplementary Fig. 50). We found that for 13/14 patients the copy numbers and proportions inferred by HATCHet yield substantially fewer unexplained mutations (Fig. 6b, c) and lower errors (Supplementary Figs. 51 and 52) than the copy numbers inferred by Battenberg and ControlFREEC, respectively, with the difference on the remaining patient being small. On the prostate cancer dataset, HATCHet explains most of the mutations with high VAF, while the unexplained mutations mostly have lower VAFs (Supplementary Figs. 53 and 54), suggesting that these mutations occurred after the CNAs at the locus, as reported in the published analysis^{11}. On the pancreas cancer dataset, we observed that nearly all mutations have low VAFs (Supplementary Fig. 55), consistent with low tumor purity as well as the presence of WGDs and/or higher ploidy in these samples. Indeed, SNVs/indels that occur after WGDs alter only one copy of the locus, and thus have low VAF. As lower VAFs are also observed in samples with higher purity (e.g., Pam01_LiM2, Pam01_NoM1, and Pam02_PT18), WGDs and high ploidy are the more likely explanation for the low VAFs, consistent with HATCHet’s prediction of WGD in 3/4 patients (Supplementary Fig. 56).
Finally, for each mutation in the prostate cancer patients, we computed the cancer cell fraction (CCF), or fraction of tumor cells that harbor a copy of the mutation, using the method described in Dentro et al.^{49} and the copy numbers and proportions inferred by either Battenberg or HATCHet. Across all patients, we found that ≈11% (i.e., ≈200 mutations per patient) of the unexplained mutations that were classified as subclonal (i.e., CCF ≪ 1 and present in a subset of cells) in the published results using Battenberg’s copy numbers^{11} were explained and classified as clonal (i.e., CCF ≈ 1 and present in all tumor cells) using HATCHet’s copy numbers (Fig. 7). For example, in sample A10E of patient A10 and sample A17F of patient A17, HATCHet infers clonal CNAs on chromosomes 1p and 8q, respectively, that explain all SNVs at these loci, while Battenberg inferred subclonal CNAs at these loci that result in unexplained SNVs (Fig. 7a, b).
We found a particularly interesting case in two samples A22J and A22H of patient A22, where HATCHet explains a large cluster of mutations on chromosome 8p and classifies them as clonal (CCF ≈ 1) while Battenberg does not explain these mutations and classifies them as subclonal (CCF ≈ 0.4 and CCF ≈ 0.6 in the two samples) (Fig. 7c). This difference is due in part to different copy numbers inferred by the two methods: HATCHet assigned copynumber state (2, 0) to chromosome 8p in both samples A22H and A22J, while Battenberg inferred (2, 0) in sample A22H and (1, 0) in sample A22J. This demonstrates the advantage of HATCHet’s leveraging of information across samples from the same patient. Notably, this cluster of mutations on chromosome 8p was highlighted as a main evidence of polyclonal migration between samples A22J and A22H (corresponding to purple cluster in Figure 1 of Gundem et al.^{11}), since the unexplained mutations are classified as subclonal in both samples based on Battenberg’s results. Based on HATCHet’s inferred copy numbers, these mutations are classified as clonal and are not evidence of polyclonal migration.
Discussion
The increasing availability of DNA sequencing data from multiple tumor samples from the same cancer patient provides the opportunity to improve the copynumber deconvolution of bulk samples into normal and tumor clones. Joint analysis of multiple tumor samples has proved to be of substantial benefit in the analysis of SNVs^{11,29,32,33,34,35,36}. However, the advantages of joint analysis have not been exploited in the analysis of CNAs, with all analyses of the prominent multisample sequencing datasets^{11,12,29,30} relying on CNA methods that analyze individual samples, and in some cases assuming that copy numbers are the same in all tumor cells in a sample.
In this paper, we introduced HATCHet, an algorithm to infer allelespecific CNAs and clone proportions jointly across multiple tumor samples from the same patient. HATCHet includes two major enhancements that improve performance over existing methods for copynumber deconvolution. First, we showed that with multiple samples, global clustering of read counts along the genome and across samples becomes an effective strategy, analogous to the clustering of SNVs across samples^{32,33,36} but different from the current focus in CNA inference of local segmentation of read counts along the genome. Second, we showed the advantage of separating the two sources of ambiguity in copynumber deconvolution: ambiguity in fractional copy numbers vs. ambiguity in the factorization of fractional copy numbers into integervalued copynumber states. HATCHet defers the selection of fractional copy numbers, performing model selection in the natural coordinates of copynumber states and clone proportions. We also introduced MASCoTE, a simulator for multisample tumor sequencing data that correctly accounts for different genome lengths of tumor clones and WGD. Finally, we showed that HATCHet outperforms existing methods for CNA inference on simulated bulk tumor samples and produces more plausible inferences of subclonal CNAs and WGDs on two cancer datasets.
There are several areas for future improvements. First, while we have shown that HATCHet accurately recovers the major tumor clones distinguished by larger CNAs, HATCHet may miss small CNAs or CNAs at low proportions. One interesting future direction is to perform a second stage of CNA inference using a local segmentation algorithm (e.g., a HMM^{17}) informed by the clonal composition inferred by HATCHet. Second, HATCHet’s model of RDR and BAF could be improved by modeling additional sources of variation in the data, including replication timing^{50} or variable coverage across samples, by considering different generative models for RDR and BAF, and by incorporating additional signals in DNA sequencing reads, such as phasing of germline SNPs^{9,27}. Third, HATCHet’s modeling of WGD could be further generalized. While recent pancancer studies^{5,6,7,8,12} show that the current assumptions used in HATCHet (namely that a WGD occurs at most once as a clonal event and that additional clonal CNAs also occur) are reasonable for most tumors, HATCHet’s model could be extended to allow for multiple WGDs (e.g. hexaploid or higher ploidy), subclonal WGDs, or WGDs occurring without any other clonal CNAs. Fourth, HATCHet’s modelselection criterion could be further improved by including additional information such as a more refined model of copynumber evolution^{23,24,51,52,53}, and temporal^{54} or spatial^{13} relationships between clones. Fifth, further improvements integrating CNAs and SNVs are an important future direction. For example, phasing somatic mutations to nearby germline SNPs might provide additional information to identify explained mutations, although in the present study, only a small fraction of the mutations (<0.2% in the prostate and <0.17% in pancreas cancer patients) are on the same sequencing read as a heterozygous germline SNP. Finally, some of the algorithmic advances in HATCHet can be leveraged in the design of better methods for inferring CNAs and WGDs in singlecell sequencing data.
The increasing availability of DNA sequencing data from multiple bulk tumor samples from the same patient provides the substrate for deeper analyses of tumor evolution over time, across space, and in response to treatment. Algorithms that maximally leverage this data to quantify the genomic aberrations and their differences across samples will be essential in translating this data into actionable insights for cancer patients.
Methods
HATCHet algorithm
We introduce HATCHet, an algorithm to infer allele and clonespecific CNAs and clone proportions for several tumor clones jointly across multiple bulk tumor samples. We represent the accumulation of all CNAs in all clones by partitioning the L genomic positions of the reference genome into m segments, or clusters, with each segment s consisting of ℓ_{s} genomic positions with the same copy numbers in every clone. Thus, a clone i is represented by a pair of integer vectors a_{i} and b_{i} whose entries indicate the number of copies of each of the two alleles for each segment. Specifically, we define the copynumber state (a_{s,i}, b_{s,i}) of segment s in clone i as the pair of the two integer allelespecific copy numbers a_{s,i} and b_{s,i}, whose sum determines the total copynumber c_{s,i} = a_{s,i} + b_{s,i}. In addition, we define clone 1 to be the normal (noncancerous) diploid clone, and thus (a_{s,1}, b_{s,1}) = (1, 1) and c_{s,1} = 2 for every segment s of the normal clone. We represent the allelespecific copy numbers of all clones as two m × n matrices A = [a_{s,i}] and B = [b_{s,i}]. Similarly, we represent the total copy numbers of all clones as the m × n matrix C = [c_{s,i}] = A + B. Due to the effects of CNAs, the genome length \({L}_{i}=\mathop{\sum }\nolimits_{s = 1}^{m}{c}_{s,i}{\ell }_{s}\) of every tumor clone i is generally different from the genome length L_{1} = 2L of the normal clone.
We obtain DNA sequencing data from k samples of a cancer patient and we assume that each tumor sample p is a mixture of at most n clones, with clone proportion u_{i,p} indicating the fraction of cells in p that belong to clone i. Note that 0 ≤ u_{i,p} ≤ 1 and the sum of clone proportions is equal to 1 in every sample p. We say that i is present in p if u_{i,p} > 0. The tumor purity \({\mu }_{p}=\mathop{\sum }\nolimits_{i = 2}^{n}{u}_{i,p}\) of sample p is the sum of the proportions of all tumor clones present in p. We represent the clone proportions as the n × k matrix U = [u_{i,p}].
HATCHet starts from the DNA sequencing data obtained from the k samples (Fig. 1a) and infers allele and clonespecific CNAs in two separate modules. The first module of HATCHet infers the allelespecific fractional copy numbers \({f}_{s,p}^{A}={\sum }_{i}{a}_{s,i}{u}_{i,p}\) and \({f}_{s,p}^{B}={\sum }_{i}{b}_{s,i}{u}_{i,p}\) whose sum defines the fractional copynumber \({f}_{s,p}={f}_{s,p}^{A}+{f}_{s,p}^{B}\) (Fig. 1b–d). We represent the allelespecific fractional copy numbers using two m × k matrices \({F}^{A}=[{f}_{s,p}^{A}]\) and \({F}^{B}=[{f}_{s,p}^{B}]\). The second module of HATCHet infers allele and clonespecific copy numbers A, B and clone proportions U by simultaneously factoring F^{A} = AU and F^{B} = BU (Fig. 1e, f). Importantly, HATCHet infers two values of F^{A}, F^{B} according to absence/presence of a WGD, and uses a modelselection criterion to simultaneously choose the number n of clones and the presence/absence of a WGD while performing the copynumber deconvolution (Fig. 1g). We describe the details of these two modules in the next two sections.
Inference of allelespecific fractional copy numbers
The first module of HATCHet aims to infer the allelespecific fractional copy numbers F^{A} and F^{B} from the DNA sequencing data of k samples. This module has three steps.
The first step of the first module is the computation of RDRs and BAFs (Fig. 1b), which are derived from the DNA sequencing data for every genomic region in each sample. The RDR r_{s}_{,}_{p} of a segment s in sample p is directly proportional to the fractional copynumber f_{s,p}. The BAF β_{s}_{,}_{p} measures the proportion of the two allelespecific fractional copy numbers \({f}_{s,p}^{A},{f}_{s,p}^{B}\) in F^{A} and F^{B}, respectively. HATCHet computes RDRs and BAFs by partitioning the reference genome into short genomic bins (50 kb in this work) and using the same approach of existing methods^{6,9,14,15,16,17,18,19,20,21,22,23,24,25,26,27} to compute appropriate normalizations of sequencing read counts with a matchednormal sample—accounting for GC bias and other biases. Further details are in Supplementary Methods 1 and 2.
The second step of the first module is the inference of the genomic segments that have undergone CNAs directly from the measured RDRs and BAFs. The standard approach to derive such segments is to assume that neighboring genomic loci with similar values of RDR and BAF are likely to have the same copynumber state in a sample. All current methods for CNA identification rely on such local information, and use segmentation approaches, such as Hidden Markov Models (HMMs) or changepoint detection, to cluster RDRs and BAFs for neighboring genomic regions^{9,14,15,17,18,19,55,56,57}. With multiple sequenced samples from the same patient, one can instead take a different approach of identifying segments with the same copynumber state by globally clustering RDRs and BAFs along the entire genome and simultaneously across multiple samples (Supplementary Fig. 57). Two previous methods, FACETS^{18} and CELLULOID^{26}, clustered segments obtained from a local segmentation algorithm into a small number of distinct copynumber states. HATCHet introduces a global clustering which extends this previous approach in two ways: first, HATCHet jointly analyzes multiple samples from the same patient and, second, HATCHet does not rely on local segmentation.
HATCHet uses a nonparametric Bayesian clustering algorithm^{38} to globally cluster the RDRs and BAFs of all genomic bins jointly across all samples (Fig. 1c). Further details are in Supplementary Method 2. Each cluster corresponds to a collection of segments with the same copynumber state in each tumor clone. These clusters are used to define the entries of F^{A} and F^{B}, playing the role of the segments described above. Although we do not require that clusters contain neighboring genomic loci, we find in practice that our clusters exhibit such locality (see results on cancer datasets). By clustering globally we preserve local information, but the converse does not necessarily hold. The joint clustering across multiple samples is particularly useful in the analysis of samples with low tumor purity. While variations in the values of RDR and BAF cannot be easily distinguished from noise in a single sample with low tumor purity, jointly clustering across samples leverages information from higher purity samples to assist in clustering of lower purity samples (see results on the pancreas cancer dataset).
The last step of the first module is the explicit inference of the allelespecific fractional copy numbers F^{A} and F^{B} from the RDRs and BAFs of the previously inferred segments (Fig. 1d). Existing methods^{6,9,14,15,16,17,18,19,20,21,22,25,26,27}—including widely used methods such as ABSOLUTE^{6}, ASCAT^{14}, Battenberg^{9}, TITAN^{17}, cloneHD^{25}—do not attempt to directly infer fractional copy numbers, but rather attempt to fit other variables, specifically the tumor ploidy and tumor purity (or equivalent variables as the haploid coverage, Supplementary Method 1). However, the values of these variables are difficult to infer^{21,22,25,27} and often require manual selection^{6,7,12,27}. Further details regarding tumor purity and tumor ploidy are reported below in the comparison of HATCHet and existing methods.
We introduce an approach to estimate F^{A} and F^{B} with rigorous and clearlystated assumptions. First, in the case without a WGD, we assume there is a reasonable number of genomic positions in segments whose total copy number is 2 in all clones; this is generally true if a reasonable proportion of the genome is not affected by CNAs and hence diploid. Second, in the case where a WGD occurs, we assume there are two groups of segments whose total copy numbers are distinct and the same in all clones; this is also reasonable if some segments are affected only by WGD and tumor clones accumulate clonal CNAs during tumor evolution. More specifically, we scale the RDR r_{s,p} of each segment s in every sample p into the fractional copy number f_{s,p} and separate f_{s,p} into the allelespecific fractional copy numbers \({f}_{s,p}^{A},{f}_{s,p}^{B}\) using the BAF β_{s,p}. The following theorem states that the assumptions above are sufficient for scaling RDRs to fractional copy numbers.
Theorem 1: The fractional copy number f_{s,p} of each segment s in each sample p can be derived uniquely from the RDR r_{s,p} and either (1) a diploid clonal segment \({s}^{\prime}\) with total copy number \({c}_{{s}^{\prime},i}=2\) in every clone i or (2) two clonal segments \({s}^{\prime}\) and \({z}^{\prime}\) with total copy numbers \({c}_{{s}^{\prime},i}={\omega }_{{s}^{\prime}}\) and \({c}_{{z}^{\prime},i}={\omega }_{{z}^{\prime}}\) for all tumor clones i, and such that \({r}_{{s}^{\prime},p}({\omega }_{{z}^{\prime}}2)\ne {r}_{{z}^{\prime},p}({\omega }_{{s}^{\prime}}2)\) for all samples p.
Notably, this theorem states that the scaling is independent of other copy numbers in A, B, and C as well as the clone proportions in U.
To apply this theorem, HATCHet employs a heuristic to identify the required segments and their total copy numbers; this heuristic leverages the RDRs and BAFs jointly across all samples. First, in the case of no WGD, we aim to identify diploid segments with a copynumber state (1, 1). These segments are straightforward to identify: first, diploid segments will have β_{s,p} ≈ 0.5 in all samples p; and second we assume that a reasonable proportion of the genome in all samples will be unaffected by CNAs and thus have state (1, 1). As such, we identify the largest cluster of segments with β_{s,p} ≈ 0.5 in all samples p, and apply Theorem 1. Second, in the case of a WGD, we assume that at most one WGD occurs and that the WGD affects all tumor clones. These assumptions are consistent with previous pancancer studies of WGDs^{5,6,7,8,12}. Under these assumptions, the segments s with β_{s,p} ≈ 0.5 have copynumber state (2, 2), as a WGD doubles all copy numbers. Thus, we use the second condition of Theorem 1 and aim to find another group of segments with the same state in all tumor clones. More specifically, HATCHet finds segments whose RDRs and BAFs in all samples indicate copynumber states that result from singlecopy amplifications or deletions occurring before or after a WGD^{5}; for example, copynumber state (2, 0) is associated to a deletion occurring before a WGD while copynumber state (2, 1) is associated to a deletion occurring after a WGD. Moreover, we select only those groups of segments whose RDRs and BAFs relative to other segments are preserved in all samples; such preservation indicates that the copynumber state is fixed in all tumor clones (Fig. 1f). Further descriptions of Theorem 1 and this heuristic are in Supplementary Method 3.
Inferring allele and clonespecific copy numbers and clone proportions
The second module of HATCHet aims to derive allele and clonespecific copy numbers A, B and clone proportions U from the two values of allelespecific fractional copy numbers F^{A} and F^{B} that were estimated in the first module. The second module has two steps.
The first step of the second module is the inference of A, B, and U from each estimated value of F^{A} and F^{B} (Fig. 1e). Since the samples from the same patient are related by the same evolutionary process, we model the fractional copy numbers jointly across the k samples such that F^{A} = AU and F^{B} = BU. As such, the problem that we face is to simultaneously factorize F^{A} and F^{B} into the corresponding allelespecific copy numbers A, B and clone proportions U for some number n of clones. Formally, we have the following problem.
Problem 1: (Allelespecific Copynumber Factorization (ACF) problem) Given the allelespecific fractional copy numbers F^{A} and F^{B} and the number n of clones, find allelespecific copy numbers A = [a_{s,i}], B = [b_{s,i}] and clone proportions U = [u_{i,p}] such that F^{A} = AU and F^{B} = BU.
While the ACF problem is a mathematically elegant description of the copynumber deconvolution problem, there are two main practical issues: first, measurement errors in F^{A} and F^{B} may result in the ACF problem having no solution, and second the ACF problem is an underdetermined problem and multiple factorizations of a given F^{A} and F^{B} may exist. To address the first issue, we do not solve the simultaneous factorization F^{A} = AU and F^{B} = BU exactly, but rather minimize the distance between the estimated fractional copy numbers F^{A} and F^{B} and the factorizations AU and BU, respectively, weighted by the corresponding size of the clusters. In particular, we define the distance \(\parallel {F}^{A}AU\parallel =\mathop{\sum }\nolimits_{s = 1}^{m}\mathop{\sum }\nolimits_{p = 1}^{k}{\ell }_{s} {f}_{s,p}^{A}{\sum }_{1\le i\le n}{a}_{s,i}{u}_{i,p}\), where ℓ_{s} is the genomic length of the cluster s. We also define the corresponding distance for F^{B}, B, and U.
To address the second issue of an underdetermined system, we impose three additional and reasonable constraints. All of these constraints are optional and userselectable. First, since we do not expect copy numbers to be arbitrarily high—especially for large genomic regions—we assume that the total copy numbers are at most a value \({c}_{\max }\). Second, to avoid overfitting errors in fractional copy numbers by clones with low proportions, we require a minimum clone proportion \({u}_{\min }\) for every tumor clone present in any sample. Third, we impose an evolutionary relationship between the tumor clones requiring that each allele of every segment s cannot be simultaneously amplified and deleted in distinct clones; i.e., either a_{s,i} ≥ θ or a_{s,i} ≤ θ for all clones i, where θ = 1 when there is no WGD and θ = 2 when there is a WGD. The same constraint also holds for b_{s,i}. These constraints improve the solutions to the copynumber deconvolution problem^{23,24} and are less restrictive than the ones usually applied in current methods which, for example, assume that: tumor clones have at most two copynumber states (a_{s,i}, b_{s,i}), (a_{s,j}, b_{s,j}) per segment and the difference between allelespecific copy numbers is at most 1^{9,19}, i.e., ∣a_{s,i} − a_{s,j}∣ ≤ 1 and ∣b_{s,i} − b_{s,j}∣ ≤ 1; or all clones have either a diploid copynumber state (1, 1) or a unique aberrant state (a, b) ≠ (1, 1) in every cluster s^{17,18}; or every tumor clone i has either c_{s,i} ≥ 2 or c_{s,i} ≤ 2 for every cluster s^{21,22}; or there always exist segments with total copy number equal to 2^{18,25}. We thus have the following problem.
Problem 2: (Distancebased Constrained Allelespecific Copynumber Factorization (DCACF) problem) Given the allelespecific fractional copy numbers F^{A} and F^{B}, a number n of clones, a maximum total copy number \({c}_{\max }\), a minimum clone proportion \({u}_{\min }\), and a constant value θ ∈ {1, 2}, find allelespecific copy numbers A = [a_{s,i}], B = [b_{s,i}] and clone proportions U = [u_{i,p}] such that: the distance D = ∥F^{A} − AU∥ + ∥F^{B} − BU∥ is minimum; \({a}_{s,i}+{b}_{s,i}\le {c}_{\max }\) for every cluster s and clone i; either \({u}_{i,p}\ge {u}_{\min }\) or u_{i,p} = 0 for every clone i and sample p; for every cluster s, either a_{s,i} ≥ θ or a_{s,i} ≤ θ for all clones i; for every cluster s, either b_{s,i} ≥ θ or b_{s,i} ≤ θ for all clones i.
We design a coordinatedescent algorithm^{23,24} to solve this problem by separating the inference of A, B from the inference of U and iterating these two steps until convergence for multiple random restarts. We also derive an ILP formulation that gives exact solutions for small instances. HATCHet uses one of these two algorithms to infer A, B, U from F^{A}, F^{B}. Further details of this problem and methods are in Supplementary Method 4.
Finally, the last step of the second module uses a modelselection criterion to joint select the number n of clones and the occurrence of a WGD (Fig. 1f). Model selection is essential because variations in the fractional copy numbers F^{A} and F^{B} can be fit by increasing the total number n of clones, increasing the number of clones present in a sample, or introducing additional copynumber states in a sample by inferring subclonal CNAs or WGD. There is a tradeoff between these options. For example, a collection of clusters that exhibit many different copynumber states may be explained in different ways: e.g., one could increase n and mark some clusters as subclonal, or one could infer the presence of a WGD which will increase the number of clonal copynumber states (Supplementary Fig. 1). Existing methods either: do not perform model selection and assume that the number n of clones is known^{21,22,26,27}; consider segments independently^{6,9,17,18,19} (Supplementary Fig. 28), perhaps increasing the sensitivity to detect small subclonal CNAs, but with a danger of overfitting the data and overestimating n and the presence of subclonal CNAs; ignore the tradeoff between subclonal CNAs (related to a higher number of clones) and WGD (related to a higher value of tumor ploidy) by not evaluating the presence or absence of WGD in the model selection. Following the factorizations of the two values of F^{A} and F^{B} (corresponding to the cases of WGD and no WGD), HATCHet chooses the simplest solution that minimizes the total number n of clones across all samples. Further details on the modelselection procedure are in Supplementary Method 5.
Comparison of HATCHet and existing methods for copynumber deconvolution
We summarize some of the main differences between HATCHet and existing methods for copynumber deconvolution (see also Supplementary Table 1). First, HATCHet models allelespecific copy numbers, while many methods do not^{20,21,22,23,24}. Second, HATCHet models dependencies between segments as clones, while most of the widely used methods^{6,9,14,15,16,17,18,19} analyze each segment independently and discard the global dependency between segments (Supplementary Fig. 28). Third, HATCHet models dependencies between samples and uses a global clustering approach to infer segments jointly across samples. In contrast, existing methods^{6,9,14,15,16,17,18,19,20,21,22,26,27} analyze samples independently and do not preserve clonal structure across samples; there is one exception which is cloneHD^{25}, but cloneHD infers segments from each sample independently, and also assumes that every sample comprises the same set of few (2–3) clones and is thus not suitable to analyze samples comprising distinct clones^{25}.
Finally, HATCHet introduces an explicit modelselection criterion to select among different allele and clonespecific copy numbers and clone proportions that explain the observed DNA sequencing data. There are often multiple possible mixtures of allelespecific copy numbers that explain the measured RDRs and BAFs: for example, segments with distinct values of RDR and BAF could be explained as either subclonal CNAs or clonal CNAs with high copy numbers, e.g., due to a WGD. It is difficult to distinguish these cases because the total length of the genome of each tumor clone is unknown. HATCHet introduces a modelselection criterion which separates two distinct sources of this ambiguity: (1) the inference of allelespecific fractional copy numbers F^{A} and F^{B}, which are not uniquely determined by the measured RDRs and BAFs; (2) the inference of the allele and clonespecific copy numbers A, B and the clone proportions U. Importantly, HATCHet evaluates two possible values of F^{A}, F^{B}—corresponding to the occurrence of WGD or not—and defers the selection of a solution until after the copynumber deconvolution. Thus, HATCHet performs model selection in the natural coordinates of the problem, i.e., A, B, and U, and evaluates the tradeoff between inferring subclonal CNAs (and thus more clones present in a sample) or a WGD (Supplementary Fig. 1), when modeling a large number of distinct copynumber states. Supplementary Table 2 lists the parameters used in HATCHet’s modelselection criterion and Supplementary Table 3 provides the default values of these parameters.
In contrast, existing methods^{6,9,14,15,16,17,18,19,20,21,22,25,26,27} for copynumber deconvolution do not distinguish different solutions using the variables A, B, and U, but rather use the variables tumor purity μ_{p} = 1 − u_{1,p} and tumor ploidy \({\rho }_{p}=\frac{1}{{\mu }_{p}}\frac{\mathop{\sum }\limits_{i=2}^{n}{u}_{i,p}{L}_{i}}{L}\) (or equivalent variables such as the haploid coverage, Supplementary Method 1). However, tumor purity μ_{p} and tumor ploidy ρ_{p} are composite variables that sum the contributions of the unknown integer copy numbers A, B and the proportions U of multiple clones in a sample. Because of their composite nature, tumor purity and tumor ploidy are both difficult to infer^{21,22,25,27} and not ideal coordinates to evaluate tumor mixtures. This is because multiple values of tumor purity μ_{p} and tumor ploidy ρ_{p} may be equally plausible for the same values of RDR and BAF, particularly when more than one tumor clone is present or when a WGD occurs (Supplementary Figs. 2 and 3). Not surprisingly, existing methods that rely on tumor purity and ploidy typically require manual inspection of the output to evaluate the presence of WGD^{6,7,12,27}; the few methods that automate the prediction of WGD are based on biased criteria or unstated, restrictive assumptions^{9,17,25}.
We note that HATCHet does not directly reconstruct a tumor phylogenetic tree. However, the copy numbers inferred by HATCHet can be used as input to methods for phylogenetic reconstruction. For example, the integer copy numbers inferred by HATCHet can be input to MEDICC^{51} or CNT^{52,53}, and the fractional copy numbers can be input to CNTMD^{23,24} or Canopy^{37}.
Simulating bulk tumor sequencing data with MASCoTE
We introduce MASCoTE, a method to simulate DNA sequencing data from multiple bulk tumor samples that correctly accounts for tumor clones with varying genome lengths. The simulation of DNA sequencing data from bulk tumor samples that contain largescale CNAs is not straightforward, and subtle mistakes are common in previous studies. Suppose R sequencing reads are obtained from a sample consisting of n clones with clone proportions u_{1}, …, u_{n}. Assuming that reads are uniformly sequenced along the genome and across all cells, what is the expected proportion v_{i} of reads that originated from clone i? Most current studies^{15,16,17,25,39,40,41,42,43,44} that simulate sequencing reads from mixed samples compute v_{i} as a function of u_{i} without taking into account the corresponding genome length L_{i}. For example, Ha et al.^{17} and Adalsteinsson et al.^{39} artificially form a mixed sample of two clones by mixing reads from two other given samples in proportions \({v}_{i}=\frac{{u}_{i}}{{\tilde{u}}_{i}}\) where \({\tilde{u}}_{i}\) is the clone proportion of the single tumor clone i uniquely present in a given sample. Another example is Salcedo et al.^{42} that simulates the reads for each segment s separately by setting \({v}_{s,i}={\ell }_{s}\frac{{c}_{s,i}{u}_{i}}{M}\) for every clone i where \(M=\mathop{\max }\limits_{s}{f}_{s}\) is the maximum fractional copy number. However, such values of v_{i} are the correct proportions only when the genome lengths of all clones are equal, e.g., L_{i} = 2L for every clone i. Using an incorrect proportion v_{i} leads to incorrect simulations of read counts, particularly in samples containing WGDs or multiple largescale CNAs in different clones (Supplementary Figs. 4 and 5). In fact, read counts depend on the genome lengths of all clones in the sample^{58} and the correct proportion \({v}_{i}=\frac{{u}_{i}{L}_{i}}{\mathop{\sum }\limits_{j=1}^{n}{u}_{j}{L}_{j}}\) is equal to the fraction of genome content in a sample belonging to the cells of clone i. Moreover, the expected proportion v_{s,i} of reads in segment s that originate from clone i is equal to \({v}_{s,i}={\ell }_{s}\frac{{c}_{s,i}{u}_{i}}{\mathop{\sum }\limits_{j=1}^{n}{u}_{j}{L}_{j}}\), the fraction of the genome content from segment s belonging to the cells of clone i (Supplementary Method 1).
To address these issues, we develop MASCoTE to correctly simulate DNA sequencing reads of multiple mixed samples obtained from the same patient (Supplementary Fig. 6). MASCoTE simulates the genomes of a normal clone and n − 1 tumor clones, which accumulate CNAs and WGDs during tumor evolution; these clones are related via a phylogenetic tree. As such, every sample comprises a subset of these clones and the corresponding sequencing reads are simulated according to the genome lengths and proportions of the clones. More specifically, MASCoTE is composed of four steps: (1) MASCoTE simulates a diploid haplotypespecific germline genome (Supplementary Fig. 6a); (2) MASCoTE simulates the genomes of n − 1 tumor clones that acquire different kinds of CNAs and WGDs—according to the distributions in size and quantity reported in previous pancancer studies^{5}—in random order through a random phylogenetic tree (Supplementary Fig. 6b); (3) MASCoTE simulates the sequencing reads from the genome of each clone through standard methods^{59} (Supplementary Fig. 6c); (4) MASCoTE simulates each sample p by considering an arbitrary subset of the clones (always containing the normal clone) with random clone proportions and by mixing the corresponding reads using the read proportion \({v}_{i,p}=\frac{{u}_{i,p}{L}_{i}}{\sum _{1\le j\le n}{u}_{j,p}{L}_{j}}\) (Supplementary Fig. 6d). Further details about this procedure are in Supplementary Method 6.
Bioinformatics analysis
We applied MASCoTE with default values of all parameters to simulate DNA sequencing reads for 64 patients, half with a WGD and half without a WGD. For each patient, we simulated 3–5 bulk tumor samples, with a total of 256 samples across all 64 patients. For each patient, we simulated 2–4 clones (including the normal clone) with CNAs of varying size using the relative frequencies reported in pancancer analysis^{5} and including: focal CNAs < 1Mb, small CNAs between 3 and 5 Mb, medium CNAs between 10 and 20 Mb, and chromosome arm and whole chromosome CNAs. We provided the human reference genome hg19 and the database dbSNP of known SNPs^{60} to MASCoTE for generating a haplotypespecific genome for each normal clone. We ran every method on the simulated samples by using the default available pipelines. Details about the experimental setting of every method are described in Supplementary Note 1.
We applied HATCHet to analyze 49 samples from 10 prostate cancer patients in Gundem et al.^{11} and 35 samples from four pancreas cancer patients in Makohon et al.^{30} using the published BAM files. In addition to one or more BAM files from the same patient, HATCHet requires two other sources of information: a matchednormal sample and the reference genome used to align the sequencing reads. We used the available matchednormal sample for every patient and the reference genome corresponding to the alignments in the BAM files, i.e., GRCh37 for the prostate cancer patients and hg19 for the pancreas cancer patients. HATCHet used BCFtools (v1.7)^{61} to identify germline heterozygous SNPs with the provided matchednormal sample and reference genome. For each patient, we applied HATCHet on all the corresponding samples using the default values of all parameters: genomic bin size of 50 kb, maximum total copynumber \({c}_{\max }=12\), and minimum clone proportion \({u}_{\min }=0.03\) (for patients A22, A21, Pam03, and Pam04, we used \({u}_{\min }=0.15\) since these patients exhibited high variance in RDRs and BAFs). Further details are in Supplementary Note 6. Lastly, we used Varscan 2 (v2.3.9)^{62} with default parameters and filters to identify somatic SNVs and small indels.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Wholegenome DNA sequencing data for the prostate and pancreas cancer datasets analyzed in this study are available from the European Genomephenome Archive (EGA) under accession numbers EGAS00001000262 and EGAS00001002186, respectively. Wholeexome DNA sequencing data for breast cancer patients in Kim et al.^{45} and Casasent et al.^{46} are available from the NCBI Sequence Read Archive (SRA) under accession numbers SRP114962 and SRP116771. All the processed simulated data, the results of all methods on simulated data, and the results of HATCHet on the prostate and pancreas cancer datasets are available on GitHub from https://github.com/raphaelgroup/hatchetpaper and on Zenodo from https://doi.org/10.5281/zenodo.3830088.
Code availability
HATCHet is available on GitHub at https://github.com/raphaelgroup/hatchet. MASCoTE is available on GitHub at https://github.com/raphaelgroup/mascote.
References
Nowell, P. C. The clonal evolution of tumor cell populations. Science 194, 23–28 (1976).
Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127 (2013).
Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338 (2013).
McGranahan, N. & Swanton, C. Biological and therapeutic impact of intratumor heterogeneity in cancer evolution. Cancer Cell 27, 15–26 (2015).
Zack, T. I. et al. Pancancer patterns of somatic copy number alteration. Nat. Genet. 45, 1134 (2013).
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413 (2012).
Dentro, S. C. et al. Characterizing genetic intratumor heterogeneity across 2,658 human cancer genomes. Preprint at https://doi.org/10.1101/312041 (2018).
Bielski, C. M. et al. Genome doubling shapes the evolution and prognosis of advanced cancers. Nat. Genet. 50, 1189 (2018).
NikZainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
Bolli, N. et al. Heterogeneity of genomic evolution and mutational profiles in multiple myeloma. Nat. Commun. 5, 2997 (2014).
Gundem, G. et al. The evolutionary history of lethal metastatic prostate cancer. Nature 520, 353 (2015).
JamalHanjani, M. et al. Tracking the evolution of non–smallcell lung cancer. N. Engl. J. Med. 376, 2109–2121 (2017).
ElKebir, M., Satas, G. & Raphael, B. J. Inferring parsimonious migration histories for metastatic cancers. Nat. Genet. 50, 718–726 (2018).
Van Loo, P. et al. Allelespecific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).
Chen, H., Bell, J. M., Zavala, N. A., Ji, H. P. & Zhang, N. R. Allelespecific copy number profiling by nextgeneration DNA sequencing. Nucleic Acids Res. 43, e23–e23 (2014).
Favero, F. et al. Sequenza: allelespecific copy number and mutation profiles from tumor sequencing data. Ann. Oncol. 26, 64–70 (2014).
Ha, G. et al. TITAN: inference of copy number architectures in clonal cell populations from tumor wholegenome sequence data. Genome Res. 24, 1881–1893 (2014).
Shen, R. & Seshan, V. E. FACETS: allelespecific copy number and clonal heterogeneity analysis tool for highthroughput DNA sequencing. Nucleic Acids Res. 44, e131–e131 (2016).
Cun, Y., Yang, T.P., Achter, V., Lang, U. & Peifer, M. Copynumber analysis and inference of subclonal populations in cancer genomes using Sclust. Nat. Protoc. 13, 1488 (2018).
Boeva, V. et al. ControlFREEC: a tool for assessing copy number and allelic content using nextgeneration sequencing data. Bioinformatics 28, 423–425 (2011).
Oesper, L., Mahmoody, A. & Raphael, B. J. THetA: inferring intratumor heterogeneity from highthroughput DNA sequencing data. Genome Biol. 14, R80 (2013).
Oesper, L., Satas, G. & Raphael, B. J. Quantifying tumor heterogeneity in wholegenome and wholeexome sequencing data. Bioinformatics 30, 3532–3540 (2014).
Zaccaria, S., ElKebir, M., Klau, G. W. & Raphael, B. J. in International Conference on Research in Computational Molecular Biology, pp. 318–335 (Springer, 2017).
Zaccaria, S., ElKebir, M., Klau, G. W. & Raphael, B. J. Phylogenetic copynumber factorization of multiple tumor samples. J. Comput. Biol. 25, 689–708 (2018).
Fischer, A., VázquezGarcía, I., Illingworth, C. J. & Mustonen, V. Highdefinition reconstruction of clonal composition in cancer. Cell Rep. 7, 1740–1752 (2014).
Notta, F. et al. A renewed model of pancreatic cancer evolution based on genomic rearrangement patterns. Nature 538, 378 (2016).
McPherson, A. W. et al. ReMixT: clonespecific genomic structure estimation in cancer. Genome Biol. 18, 140 (2017).
Gawad, C., Koh, W. & Quake, S. R. Singlecell genome sequencing: current state of the science. Nat. Rev. Genet. 17, 175 (2016).
Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).
MakohonMoore, A. P. et al. Limited heterogeneity of known driver gene mutations among the metastases of individual patients with pancreatic cancer. Nat. Genet. 49, 358 (2017).
Schuh, A. et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood J. Am. Soc. Hematol. 120, 4191–4196 (2012).
Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11, 396 (2014).
Miller, C. A. et al. SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput. Biol. 10, e1003665 (2014).
ElKebir, M., Oesper, L., AchesonField, H. & Raphael, B. J. Reconstruction of clonal trees and tumor composition from multisample sequencing data. Bioinformatics 31, i62–i70 (2015).
ElKebir, M., Satas, G., Oesper, L. & Raphael, B. J. Inferring the mutational history of a tumor using multistate perfect phylogeny mixtures. Cell Syst. 3, 43–53 (2016).
Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from wholegenome sequencing of tumors. Genome Biol. 16, 35 (2015).
Jiang, Y., Qiu, Y., Minn, A. J. & Zhang, N. R. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by nextgeneration sequencing. Proc. Natl Acad. Sci. USA 113, E5528–E5537 (2016).
Hughes, M. C. & Sudderth, E. Memoized Online Variational Inference for Dirichlet Process Mixture Models. In: Advances in Neural Information Processing Systems, 1133–1141 https://papers.nips.cc/paper/4969memoizedonlinevariationalinferencefordirichletprocessmixturemodels (2013).
Adalsteinsson, V. A. et al. Scalable wholeexome sequencing of cellfree DNA reveals high concordance with metastatic tumors. Nat. Commun. 8, 1324 (2017).
Ivakhno, S. et al. tHapMix: simulating tumour samples through haplotype mixtures. Bioinformatics 33, 280–282 (2017).
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic singlenucleotidevariant detection. Nat. Methods 12, 623 (2015).
Salcedo, A. et al. A community effort to create standards for evaluating tumor subclonal reconstruction. Nat. Biotechnol. 38, 97–107 (2020).
Yu, Z., Liu, Y., Shen, Y., Wang, M. & Li, A. CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using wholegenome sequencing data. Bioinformatics 30, 2576–2583 (2014).
Pitea, A. et al. Copy number aberrations from Affymetrix SNP 6.0 genotyping data—how accurate are commonly used prediction approaches? Briefings in Bioinformatics, 21, 272–281 (2020).
Kim, C. et al. Chemoresistance evolution in triplenegative breast cancer delineated by singlecell sequencing. Cell 173, 879–893 (2018).
Casasent, A. K. et al. Multiclonal invasion in breast tumors identified by topographic single cell sequencing. Cell 172, 205–217 (2018).
Maddipati, R. & Stanger, B. Z. Pancreatic cancer metastases harbor evidence of polyclonality. Cancer Discov. 5, 1086–1097 (2015).
Raphael, B. J. et al. Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer Cell 32, 185–203 (2017).
Dentro, S. C., Wedge, D. C. & Van Loo, P. Principles of reconstructing the subclonal architecture of cancers. Cold Spring Harb. Perspect. Med. 7, a026625 (2017).
Kleinheinz, K. et al. ACEseqallele specific copy number estimation from whole genome sequencing. Preprint at https://doi.org/10.1101/210807 (2017).
Schwarz, R. F. et al. Phylogenetic quantification of intratumour heterogeneity. PLoS Comput. Biol. 10, e1003535 (2014).
ElKebir, M. et al. in International Workshop on Algorithms in Bioinformatics, pp. 137–149 (Springer, 2016).
ElKebir, M. et al. Complexity and algorithms for copynumber evolution problems. Algorithms Mol. Biol. 12, 13 (2017).
Myers, M. A., Satas, G. & Raphael, B. J. Calder: Inferring phylogenetic trees from longitudinal tumor samples. Cell systems, 8, 514–522 (2019).
Chiang, D. Y. et al. Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat. Methods 6, 99 (2008).
Xi, R. et al. Copy number variation detection in wholegenome sequencing data using the bayesian information criterion. Proc. Natl Acad. Sci. 108, E1128–E1136 (2011).
Carter, S., Meyerson, M. & Getz, G. Accurate estimation of homologuespecific DNA concentrationratios in cancer samples allows longrange haplotyping. Preprint at https://doi.org/10.1038/npre.2011.6494.1 (2011).
Gusnanto, A., Wood, H. M., Pawitan, Y., Rabbitts, P. & Berri, S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from nextgeneration sequence data. Bioinformatics 28, 40–47 (2011).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a nextgeneration sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
Acknowledgements
We thank Christine IacobuzioDonahue and Alvin MakohonMoore for assistance in obtaining the copynumber data from their publication^{30}. We thank Stefan Dentro, Peter Van Loo, and David Wedge for assistance in running Battenberg on our simulated data. We thank Gavin Ha for assistance in running TITAN on our simulated data. This work was supported by a US National Institutes of Health (NIH) grants R01HG007069 and U24CA211000 and US National Science Foundation (NSF) CAREER Award (CCF1053753) to B.J.R.
Author information
Authors and Affiliations
Contributions
S.Z. and B.J.R. conceived the project, developed the theory and algorithms, and wrote the paper; S.Z. implemented the algorithms and performed the analyses.
Corresponding author
Ethics declarations
Competing interests
B.J.R. is a cofounder of, and consultant to, Medley Genomics. S.Z. declares no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zaccaria, S., Raphael, B.J. Accurate quantification of copynumber aberrations and wholegenome duplications in multisample tumor sequencing data. Nat Commun 11, 4301 (2020). https://doi.org/10.1038/s4146702017967y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702017967y
This article is cited by

HATCHet2: clone and haplotypespecific copy number inference from bulk tumor sequencing data
Genome Biology (2024)

scAbsolute: measuring singlecell ploidy and replication status
Genome Biology (2024)

Computational immunogenomic approaches to predict response to cancer immunotherapies
Nature Reviews Clinical Oncology (2024)

CONIPHER: a computational framework for scalable phylogenetic reconstruction with error correction
Nature Protocols (2024)

Establishment, characterization, and genetic profiling of patientderived osteosarcoma cells from a patient with retinoblastoma
Scientific Reports (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.