Introduction

Underlying the maintained genomic diversity within a patient tumor is the uncontrolled proliferation, a core hallmark of cancer1, coupled with somatic alterations occurring over time2. To prevent the disease, uncovering the somatic aberrations responsible for the malignant growth is the primary goal of precision oncology. At the genomic level, somatic alterations exist on a spectrum, ranging from small changes such as somatic single nucleotide variants (SSNV)3 to large somatic copy number alterations (SCNA). Frequent chromosomal mis-segregation (chromosomal instability or CIN) leads to abnormal chromosome numbers (aneuploidy)4 and unbalanced structural variations (SV) cause segmental SCNAs5. These two genomic errors are intertwined in many solid tumors, leading to extensive SCNAs, especially in advanced diseases4 with poor clinical outcomes6.

The inextricable relation of SCNAs to cancer initiation7,8 and progression9 has become widely recognized in cancer genomics. It remains little known, however, to what extent a specific SCNA accounts for the malignant growth and how it affects the intra-tumor-heterogeneity (ITH)4. Indeed, chaotic karyotype and widespread high copy number (CN) states in aneuploid tumors10 pose a significant challenge in identifying oncogenic SCNAs, limiting the precision of using SCNA patterns for diagnostic and treatment purposes. For example, the treatment strategy for osteosarcoma, the most common bone tumor affecting teenagers with one of the most chaotic aneuploid genomes, has stagnated for decades11. From an evolutionary perspective, discovering the tempo of SCNA during somatic evolution is key to gaining knowledge of SCNA drivers12. Here, we hypothesize that the timing of SCNAs can be systematically measured from whole genome sequencing (WGS) data of patient tumors, and the temporal axis contains tangible information in isolating the effect of specific SCNAs on tumorigenesis.

We should pause to clarify how bulk sequencing data capture somatic evolution timeline. The tumor founder cell arose from the succession of clonal expansions in the pre-cancerous context where beneficial alterations endow progenitor cells with the ability to crowd out less advantageous populations13 (Fig. 1A). The growth of the primary lesion gives rise to genomically diverging lineages14, some of which, after acquiring a more malignant potential, can initiate the re-growth of a secondary tumor, such as metastasis15. Bulk sequencing data provide us with the opportunity to anchor the roots of expansion (the most recent common ancestor, or MRCA). Clonal variants in a single sample refer to the root of the observed sample. In multi-region sequencing, truncal variants from multiple samples could trace back asymptotically to the founder of the tumor16. For longitudinal sampling, e.g., of paired primary and metastatic tumors, truncal variants could point to the MRCA of the branched tumor progression17. Multi-samples reflect the population expansion at a broader scale, i.e., they coalesce to an earlier progenitor cell than a single sample does. Collectively, truncal variants revealed by a particular sampling strategy must map to the somatic evolution timeline prior to the corresponding sampling-relevant expansion (SRE).

Fig. 1: Measuring the arrival and initiation time of SCNAs.
figure 1

A drawing at the top marks the concept of cancer somatic evolution, which leads to the birth of the most recent common ancestor (MRCA) of the primary tumor, as well as the roots of secondary tumors. Tx: treatment. A For an SCNA present within the MRCA of a sampling-relevant tumor expansion, we aim to characterize the time when the last gain and first gain appeared (referred to as arrival and initiation time, respectively) for the corresponding genomic region. B We further aim to address the question if the late truncal gains are neutral or beneficial to tumor evolution. C The solid black line shows that the burden of SSNVs of a given genomic locus correlates with its CN state in tumors sequenced by ICGC (n: the number of independent samples with whole genome doubling). Two dashed lines assume the two extreme scenarios of SCNA arrival time. D The proportion of SSNVs at different allele states depends on the SCNA history matrix and the relative time span of each CN stage. Two possible history matrices are shown with the SCNA at 5:1. A 2D density plot in grayscale shows the burden of single-copy SSNVs against the lead time (tK) simulated using the two matrices. Source data are provided as a Source Data file.

The timing of a truncal SCNA on the evolution toward the MRCA could shed light on the impact of this variant on promoting the SRE. However, our knowledge about the SCNA timing remains fragmentary as the existing methods are restricted to simple (single or double) gains16,18,19. Single-sample based pan-cancer analysis from ICGC19 have revealed the molecular time of these simple gains. Relative to aneuploid tumors, these low CN gains may not be sufficient to induce the final tumor expansion. For example, single chromosome gains have shown limited capability in driving proliferation in vitro20. As such, it would be crucial to know when a genomic segment further evolves beyond the simple gains, which can often culminate at a state greater than four copies10. This requires a timing method applicable to complex gains with high CN states.

In SCNA timing analysis, the following assumption is made: the site frequency spectrum (SFS) of SSNVs in a genomic region affected by SCNA depends on the trajectory (the order) and time span on each CN state that the segment had ever rested on21. For a single gain, the ratio between early (duplicated) and late (non-duplicated) SSNVs can be used to estimate the relative timing of the gain12. For a clonal SCNA involving multiple gains resulting in high CN (>4), one can divide its evolution timeline between germline to the founder of SRE into three fractions (Fig. 1A). The first time fraction (t0) is the initiation time when the first gain occurs. The third fraction (tK) is the lead time which measures the delay from the last gain to the onset of population expansion. We then define 1 − tK as the arrival time. While the detailed trajectory is not identifiable from the SFS18, we can still learn the upper bounds of these time fractions from the SSNV data. In particular, once a segment arrives at the final CN state, it can only accumulate single-copy SSNVs. The longer the segment persists in the observed CN state, the more overwhelming the single-copy SSNVs (Fig. 1D).

A significant question is how the timing of an SCNA reflects its impact on fitness (Fig. 1B). Whereas early gains could initiate and increase the risk of disease, propelling the initial proliferation of cancer cells, we suggest that late-appearing SCNAs close to MRCA could promote the population expansion more directly. If a clonal lineage persists over many generations and accumulates a significant number of mutations following the acquisition of an SCNA, it is improbable that the SCNA alone can drive the tumor’s ultimate growth. Conversely, if an SCNA instigates the final expansion of a tumor, it is conceivable that the progenitor cell undergoes robust proliferation immediately upon acquiring the specific SCNA, resulting in few subsequent alterations attaining clonality. Punctuated acquisition of polyploidy (e.g., through genome doubling or GD) is prevalent in aneuploid tumors22 but it remains unclear how close the occurrence of GD is to tumor transformation. Evidence exists that GD itself doesn’t confer a strong fitness advantage23; instead, it can enhance the plasticity of the genome that permits further CN evolution, such that aneuploid cells adapt to overcome possible fitness penalties incurred by GD24. Therefore, SCNAs that arrive after GD could contain driver events. Moreover, depending on the precise location of biopsied tissue, single-sample analyses may differ in the corresponding time scale; subsequently, it is particularly essential to focus on the timeline toward the malignant growth - the somatic evolution in collecting truncal SSNVs of multiple samples of a tumor (e.g., multi-region samples or paired primary and metastatic samples).

In this study, aiming to broaden the “timeable” genomic regions for SCNAs, we develop Butte (BoUnds of Time Till Expansion), a computational framework to estimate the upper bounds of SCNA arrival and initiation time from WGS data. By applying Butte onto multi-sampled WGS data of five cancer types with widespread SCNAs, including osteosarcoma, we systematically chart the temporal patterns of CN evolution in vivo. To see if late-appearing SCNAs may confer fitness benefits, we construct mathematical models to examine the evolutionary mechanisms that give rise to these late truncal events. Furthermore, we also interrogate potential ways the late culminating SCNAs could add to the fitness and reveal its impact on mutational diversification during tumor expansion. The terminology employed in this manuscript is detailed in Supplementary Table S1.

Results

A computational framework to estimate the arrival time of SCNA gains

From the WGS data, one can characterize with high certainty the dominant SCNAs, inferring the integer allelic CN of a genomic region and the cellular prevalence of the corresponding SCNA, i.e., the percentage of cells sharing the dominant SCNA state25. We refer to a unique version of a genomic region (or segment) as an “allele”. We term the total CN as Nt and the CN for the minor allele as Nb ("b” stands for b-allele determined by germline SNPs) for a dominant SCNA. The “allele” state (a) of an SSNV is the amount out of the total Nt copies of the region that carry the corresponding variant. We found that in the aneuploid tumors sequenced by ICGC (International Cancer Genome Consortium)26, the SSNV burden increases with the dominant SCNA states of the corresponding genomic region (Fig. 1C). The pattern can be attributed to an inherent correlation between SCNAs and SSNVs: a genomic segment resting on a CN state accumulates SSNVs at a rate proportional to the corresponding CN. Thereby, the burden and multiplicity of SSNVs are actively shaped by SCNAs.

The observed SCNA of a genomic segment (with a configuration Nt: Nb different from 2: 1) is the result of a series of CN events. For an SCNA involving at least K gain events, the total time of somatic evolution can be divided into K + 1 stages. The segment begins with the 2: 1 setting in the first stage and keeps “climbing up” by duplicating one of its existing copies in each subsequent stage, respectively, until it arrives at the observed SCNA state in the last stage (Fig. 1D). Accordingly, each stage is associated with certain time proportion (tk ≥ 0) and \(\mathop{\sum }\nolimits_{0}^{K}{t}_{k}=1\), the total time for the somatic evolution. SSNVs occurring at stage k on a segment copy that experiences duplication(s) in later stages will remain present on the duplicated copies with the allele state a ≥ 2/Nt. By contrast, SSNVs acquired at stage k on a copy without further duplications remain at the single allele state (a = 1/Nt). One can define a history matrix A with entry Ajk representing the number of segment copies in stage k that produce the final allele state (i.e., frequency) aj = j/Nt18,21. It can be seen that the abundance of SSNVs at allele state aj depends on ∑Ajktk. From the site frequency spectrum (SFS) of SSNVs in a region affected by SCNA, one can estimate the relative abundance of SSNVs at each allele state, and in turn, solve for each tk. There has been much effort to infer t0, i.e., the timing of the first copy number event18,19. These efforts focused on single gain (2: 0 and 3: 1) and at most double gains (3: 0 and 4: 1), where the history matrix A is invertible. By contrast, for other SCNA states, multiple possible trajectories can exist and the underlying linear system is underdetermined, i.e., there are more time stages (unknown variables) than the possible allele states (equations). We note that, however, regardless of the underlying history, multi-allele SSNVs (≥ 2/Nt) can only occur before the last stage (K) of CN evolution; once the genomic region arrives at the observed clonal SCNA state, all the copies (Nt) would accumulate SSNVs at single allele state (1/Nt). Therefore, the longer the last stage of CN evolution (from the emergence of the SCNA to the onset of population expansion), the more overwhelmingly the single allele SSNVs dominate the SFS (Fig. 1D). The monotonicity property enables the examination of the proportion of time of stage K.

To investigate how various SCNAs unfold during somatic evolution, we developed Butte (BoUnds of Time Till Expansion), which adopts linear programming to infer the upper bounds of lead (tK) and initiation time (t0) of SCNAs (Fig. 2). Butte extends the full maximum-likelihood estimation procedure implemented in cancerTiming18. Notably, Butte does not restrict the analysis to single and double gains, but in addition allows the calculation of the upper bounds of tK and t0 for SCNAs up to seven total copies, broadening the “timeable” SCNA regions. The estimated timing systematically correlate with the actual ones of SCNA initiation and culmination (Supplementary Fig. S1). To see the effect of SCNA history, read depth (D), number of SSNVs (M) and tumor purity (P) on the timing inference, we simulated different histories for SCNAs at 6:1 and 6:2, respectively and obtained virtual SSNV data (see Methods, Supplementary Figs. S2, S3, S4, S5). M have a higher impact on timing accuracy than D and P. While the overestimation of tK and underestimation of t0 is prominent when M is less than 20, the timing results are robust for M above 50, especially for SCNA 6:1. As SCNA 6:2 involves gains for both alleles, the resulting SSNVs have a smaller set of possible allele states than SCNA 6:1. For 6:2, Butte tends to overestimate tK when it is small (≤0.3). Conversely, t0 can be underestimated if the intermediate stages are brief (e.g., t0 ≥ 0.5 and tK = 0.3), owing to the penalty for infeasible models implemented in Butte.

Fig. 2: An example workflow of Butte.
figure 2

Tumor ESCA_R_8 has a clonal SCNA (at 6:1) on chr13, as indicated on the top left. LogR is the Log2 copy number ratio between the tumor and matched normal sample and the B-Allele Frequency quantifies the allelic imbalances. Butte takes the read counts (total depth and the depth of the mutant allele) for SSNVs on chr13, the SCNA state and tumor purity as the input. Butte then works out the allele state distribution by using Expectation Maximization. By adopting linear programming with all possible history matrices, Butte returns the upper bounds of initiation and lead time of the SCNA, respectively (CI: confidence interval). The variant allele frequency distribution of SSNVs is shown on the bottom left to illuminate the relationship between allele states and SCNA timing. Note that the VAF is affected by tumor purity.

To test the performance of Butte on real tumors, we first evaluated the timing predictions by analyzing multi-region WGS data of colorectal adenocarcinomas (COAD)16,27. Similar to the simulation results, a higher number of SSNVs leads to increased precision in timing estimation (Supplementary Fig. S6). With 50 SSNVs, the median confidence interval falls below 0.2. Butte successfully identified early CN gain of chromosome (chr) 5q (Supplementary Fig. S7), corresponding to the SCNA state of 2: 0 (copy neutral loss of heterozygosity), a known early step in COAD initiation involving gene APC28. Additionally, Butte’s event timing on the ICGC BRCA dataset29 aligns with a method employing graph theory to predict genome rearrangement history using both SSNVs and SVs (Supplementary Fig. S8). The timing of low copy number gains estimated by Butte corresponds with predictions made by the method emphasizing joint likelihood of copy number timing16 (Supplementary Fig. S9). As a reference for late-appearing events, private (sample-specific) SCNAs should contain events that arise in the descendent lineage of the MRCA of multi-samples. Using all the multi-sampled WGS data listed in Table 1, Butte predicted their arrival time to be later than the public SCNAs on the timeline leading to the MRCA. This underscores its ability to uncover SCNAs that occur at a later stage. (Supplementary Fig. S10).

Table 1 WGS data included in this study

To guarantee accuracy in identifying late and early gains using a predefined time threshold, we evaluated the precision, false positive, and true positive rates for predicting t0 and tK, respectively (see Methods for details). Overall, these evaluations showed an increase in performance metrics with the threshold values (Supplementary Figs. S11S12). To strike a balance between precision and false positive rate, we defined late gains as those occurring in the last 20% of the truncal time measured by clonal SSNVs, and early gains as those occurring in the first 20%.

Evolving SCNA gains define the tumor transformation leading to the most recent clonal expansion

To evaluate the tempo of SCNAs in solid tumors, we applied Butte on five tumor types by analyzing eight published WGS datasets: osteosarcoma (OS)30,31, breast invasive carcinoma (BRCA)32,33, colorectal adenocarcinomas (COAD)16,27, esophageal carcinoma (ESCA)34, and prostate adenocarcinoma (PRAD)35, six of which comprise multi-sampling of patient tumors (Fig. 3, Table 1). 70% of the analyzed genomes (corresponding to 87% of the patients) were near triploid, with the median fraction (IQR) of the high amplitude CN regions (≥4) being 0.37 (0.23 to 0.49). Loss of heterozygosity (LOH) is prevalent but mostly is at copy neutral or amplified states in the triploid tumors. High amplitude gains can be recurrent across cancer types (e.g., chr 8q) or within a specific tumor type (e.g., chr 1q for BRCA, chr 17p for OS, and chr 7 for COAD). These recurrent gains presumably contain driver events36, yet their tempo in somatic evolution remains uncharted. Notably, karyotypes largely remain stable across different samples of the same tumor, despite the presence of continued subclonal CN diversification in a relatively minor fraction of the genome.

Fig. 3: CN profiles across five tumor types from the re-analysis of published WGS data.
figure 3

Vertical bars represent the segmental CN states along the autosomal chromosomes characterized by the WGS data of tumor samples. For each sample, a color-coded thick bar shows the total CN state of each genomic locus and a thin gray bar to the right indicates that region has loss of heterozygosity (LOH). Samples belonging to the same patient are boxed. The top panel highlights the fraction of high CN states in each sample’s genome. The lower panel exhibits the sample phylogenetic trees constructed from SSNVs. Sample IDs, the reference where the WGS data was published, and tumor types are tabulated at the bottom. Presence of GD are indicated. In this manuscript, a tumor sample is named after the concatenation of the tumor type, the first character of the author’s surname and the patient ID. Source data are provided as a Source Data file.

We note that 74 out 75 patient tumors acquired late-appearing gains close to MRCA regardless of the overall ploidy or tumor type (Fig. 4A, B), with the only exception of COAD_C_4, which shows high microsatellite instability. Punctuated copy number bursts were observed in the triploid samples, reflecting the ability of the genome to leapfrog over intermediate states to reach moderately high CN states through whole or partial genome doubling (GD)22,37. Most of them could derive from whole genome doubling, but we could not exclude the possibility that individual tumors had duplications of multiple chromosomes instead of the whole genome. Here we refer to these synchronized gains as GD. Whereas GD occurs late (close to the MRCA) in some adult cancers (18 out of 34 patients), it appears to be an earlier event in many other tumors. This is particularly evident in OS where 28 of 30 patients had GD at the mid-stage of somatic evolution toward SRE (Fig. 4B). The contrasting tempo of GD suggests that it probably has a context-dependent function. In tumors with early GD, Butte can characterize the post-GD CN evolution, whereby progenitor cells continue to sample the aneuploid fitness landscape24. Notably, the rate of gains is higher post-GD than the rate pre-GD (paired Wilcoxon test, two-sided, V value = 630, sample size = 35, p = 5.8e-11, effect size Cohen’s d = 1.13, 95% Confidence Intervals of the effect size ranges from 0.62 to 1.65, Supplementary Fig. S13). Such an SCNA evolution involves stochastic chromosomal or structural abnormalities; however, certain genomic regions preferentially exhibit late gains across different patients in a particular tumor type, which, includes those recurrent high amplitude gains, such as chr 8q in OS (Fig. 4C) and chr 7 in COAD (Supplementary Fig. S7). On the other hand, recurrent SCNAs appear to initiate early, e.g., chr 1q in BRCA, chr 8q and chr 17p (TP53) in OS and chr 5q (APC) in COAD (Fig. 4C and Supplementary Fig. S7). Recent experiments have highlighted a significant fitness advantage linked to chr1q gains38. It is noteworthy that while these recurrent SCNAs tended to emerge either earlier or establish later in comparison to less common SCNAs, the correlation between copy number and timing was not consistent across all genomic regions (Supplementary Fig. S14). These additional gains pre- and/or post-GD could result from either high evolvability of the corresponding region, or persistent selection upon driver genes within.

Fig. 4: The timing patterns of SCNAs across five tumor types.
figure 4

A SCNA timing of three exemplified tumors. CN states along the genome are shown on the left of each panel. The right panel visualizes the time fraction of somatic evolution from germline to the MRCA of the patient tumor. For each SCNA segment, the inferred time points for its initiation and arrival are shown as either rectangles (exactly solved timing) or arrows (upper bounds of timing when the solutions are not unique) with the same color-coding as its CN. Confidence interval of the inferred timing is drawn by lines. The top panel shows the cumulative distribution (CDF) of SCNA arrival time. B The CDF curve of SCNA arrival time is shown for each patient categorized by the tumor type. C The figure displays normalized rank sums of timing across patients for each genomic bin, representing initiation time for BRCA and arrival time for OS (see Methods). Color-highlighted bins indicate recurrent early-initiating gains for BRCA and recurrent gains established late for OS (with 90% confidence level). Source data are provided as a Source Data file.

The earlier the timing of GD, the more post-GD CN gains (Fig. 5A, where OS and other adult tumor types exhibit R2 values of 0.31 and 0.27, F statistics of 11 and 5.2, degrees of freedom of 25 and 14, and p values of 0.002 and 0.039, respectively). The late evolving gains are shorter in segment length than those associated with GD (Supplementary Fig. S15). This suggests that the post-GD CN evolution is driven by SVs, which occur at a higher rate than chromosomal mis-segregation. Indeed, the breakpoints of structural variants almost locate at the boundary of SCNA segments (Supplementary Fig. S16). As SVs continued to occur, regions containing driver genes could become focally amplified due to selective advantages. These genes would thus appear more often in the late gains, e.g., MYC39 and RUNX240 in OS (Fig. 4C). Given the same total copy number, amplified LOH (Nb = 0 and Nt ≥ 3) tend to culminate later than other types of amplifications, such as allele specific gains (Nb = 1 and Nt ≥ 3). For example, at Nt = 5, amplified LOH (53 events) exhibits a longer arrival time than allele specific gains (106 events). The one-sided Wilcoxon test yielded a W statistic of 3656, a p-value of 0.001, and an effect size Cohen’s d of 0.57. The 95% Confidence Intervals for the effect size range from 0.23 to 0.9. This contrast cannot be explained by the overestimation of Butte (Supplementary Figs. S17 and S1). Whereas truncal LOH were supposedly acquired before GD15 causing the complete loss of tumor-suppressor activity, the late appearing gains of the only remaining allele may indicate that these regions potentially acquire dosage-dependent gain-of-functions41.

Fig. 5: Mathematical modeling suggests that final-expansion driver gains should occur late.
figure 5

A Scatter plots colored by density illustrate the number of post GD gains (in log2 scale) against the time fraction of post-GD evolution towards MRCA for OS and other adult cancer types, respectively. The green center line represents the fitted regression line (for tumors with post-GD gains) and the the green error bands represent the 95% confidence interval limits (with R2 and p values as noted). B The schema shows the setup of the two contrasting mathematical models: (1) GD is followed by neutral evolution where additional gains do not confer a fitness advantage and (2) post-GD gains increase the growth rate. C 2D density plots of the two metrics as in panel 5A characterized by the selection model. We studied the effects of the growth rate of GD (with a fixed growth rate of the MRCA, the left panel) and the rate of beneficial post-GD driver gains (the right panel), respectively. We simulated 10,000 cases for each parameter setting. The density is produced from smoothed data points with each point referring to an average of 50 cases. To convert the post-GD evolution time in the simulation into fractions as in (A), we assume that GD occurs at 120 time units (roughly corresponds to a GD rate of 7 × 10−5 82 during pre-GD evolution with a birth and death rate at 1.1 and 1, respectively). Note that the modeling is not intended to infer the parameters from patient data. Left panel: b0 = 1, a1 = 1.5, b1 = 1, u0 = 0.245, and u1 = 0.00001; right panel: a0 = 1.05, b0 = 1, a1 = 1.5, b1 = 1, and u0 = 0.2. Source data are provided as a Source Data file.

Mathematical modeling suggests the role of late truncal gains in promoting final expansion

While early genomic alterations garner significant attention for their functional implications, our understanding of the significance of late truncal alterations that emerge in proximity to the MRCA remains limited. Employing mathematical models, we constructed a rationale mentioned in the Introduction: a truncal alteration leading to the final expansion might be situated near the MRCA if the progenitor cell undergoes rapid proliferation upon acquiring this specific alteration. Using GD as an example, as it can occur early or late, we reason that early GD and prior alterations alone are insufficient to drive the final tumor expansion. However, if GD occurs closer to the MRCA, it, along with other late aberrations, can propel the tumor’s ultimate growth.

We utilized a multi-type branching process model42 to simulate tumor somatic evolution, focusing on two simplified models for key insights applicable to complex contexts. In the base model (neutral model), initiated with a single tumor-initiating cell acquiring the first driver mutation (GD, Fig. 5B), cells reproduce at a rate of a0 and die at a rate of b0, resulting in net growth rate λ0 = a0 − b0 > 0. Post-GD, daughter cells acquire passenger mutations at a rate of u0 without altering the net growth rate. In this context, tumor expansion is solely driven by GD, and all post-GD gains are passengers. Cells lacking beneficial post-GD gains are type 0 cells. In the selection model, cells with GD can acquire beneficial post-GD gains at a rate of u1 < u0, enhancing fitness (λ1 = a1 − b1 > λ0, detailed in Methods). Our goal is to characterize post-GD gains reaching fixation or dominance in tumors under two scenarios: without and with beneficial post-GD gains. Notably, in both models, post-GD gains are proportional to the mutation rate and time between GD and MRCA, validated partially in (Fig. 5A).

We first analyze the base model. Conditioned on the non-extinction of the population, we can obtain that the number of post-GD gains reaching fixation follows a geometric distribution with parameter \(\frac{{\lambda }_{0}}{{\lambda }_{0}+{u}_{0}}\) and mean \(\frac{{u}_{0}}{{\lambda }_{0}}\) (see Methods). The mode of this distribution is at zero, similar to the cases where GD appears late and post-GD CN gains are rare. To tolerate the inclusion of subclonal but dominant SCNAs as the clonal variants, we further evaluated the dominant post-GD gains shared by the majority (≥90%) of cancer cells. Building on the results of43, we derived the expected number of dominant post-GD gains in a tumor with size N as

$$\tilde{S}=\frac{N}{\lceil 0.9N\rceil }\cdot \frac{{u}_{0}}{{\lambda }_{0}}\approx 1.11\frac{{u}_{0}}{{\lambda }_{0}},$$
(1)

which is only slightly larger than the clonal ones. Assuming that u0 and λ0 are comparable (based on experimentally measured u0 for SCNA around 0.2 and the cancer cell death rate not significantly approaching the birth rate44,45), \(\tilde{S}\) would be no more than just a few. Moreover, numerical simulations show that the number of dominant post-GD gains continues to follow a geometrical-like distribution with the mode at zero (Supplementary Fig. S18). Thus, if post-GD gains do not provide growth benefits, GD would be one of the last events before the MRCA as few of post-GD gains can become dominant in the observed tumor.

However, if cells with GD can acquire an additional beneficial post-GD gain (meaning GD alone is not enough to drive the final expansion), the situation drastically changes. To emphasize how this happens, let us denote cells with the beneficial post-GD gain by type 1 cells and consider the first type 1 cell that grows into an infinite number of descendants. We assume that the descendants of the first type 1 cell dominate the cell population when the sampling is performed (see Methods for details). The expected number of passenger post-GD gains (\(\bar{S}\)) carried by a type 1 cell at the moment of its introduction is proportional to the time of occurrence of the type 1 cell (the birth time of the tumor-initiating cell is set to be 0). In Methods we show that the distribution of the birth time of the first non-extinct type 1 cell, \({\mathbb{P}}\left({\sigma }_{1} \, > \, t| {\Omega }_{\infty }\right)\), where σ1 represents the birth time and Ω represents the event that the population does not go extinct, can be characterized as a function of the rate of beneficial gains u1 and growth parameters of type 1 and type 0 cells, respectively (Lemma 1). \(\bar{S}\) is thus

$$\bar{S}=\int\nolimits_{0}^{\infty }{\mathbb{P}}\left({\sigma }_{1} \, > \, t| {\Omega }_{\infty }\right){u}_{0}dt,$$
(2)

where we utilized the formula (see Section V.6 of46) to calculate the expected birth time of the first non-extinct type 1 cell. This calculation involved utilizing the tail probability (\({\mathbb{P}}\left({\sigma }_{1} \, > \, t| {\Omega }_{\infty }\right)\)) and multiplying the expected birth time by the passenger mutation rate u0.

We explored various choices of growth parameters that capture the fitness difference between type 0 (without beneficial post-GD gain) and type 1 cells (with beneficial post-GD gain). As compared to the base model, the selection model results in a much higher abundance of post-GD gains across a large parameter space (Supplementary Fig. S19). Notably, lowering the fitness level of type 0 cells delays the birth of the type 1 cell (Fig. 5C), conditioned on a fixed net growth rate of the type 1 cell. The prolonged period of post-GD evolution (accompanied by a higher abundance of post-GD gains) could also be attributable to a lower rate of beneficial post-GD gains (Fig. 5C).

If the identified SCNAs in patient data represent the dominant clone of the entire tumor (a scenario more likely in multi-sampling than single-sampling), our model implies that late clonal gains may harbor drivers for final expansion. The presence of early GD in many patient tumors suggests that post-GD gains might offer added advantages for promoting the final expansion. It’s worth noting that our mathematical model doesn’t rule out the possibility of late somatic alterations of other types, beyond copy number gains, driving the final expansion. Thus, investigating late alterations in various forms of somatic changes is an area of interest.

In the context of multiple drivers, previous research28 indicates that the driver conferring the highest fitness advantage is likely to emerge early in random occurrences. Our simulations, shown in Supplementary Fig. S20, support this idea. When two drivers occur at equal rates, the one with the greater fitness increase is more likely to emerge first in the initial cell acquiring both drivers. Additionally, the probability of this early appearance increases with the magnitude of the fitness advantage provided by the more beneficial driver. These findings suggest that early gains may involve drivers with significant fitness effects, while late truncal gains near the MRCA are particularly relevant for the fitness increase driving the final expansion.

Ways evolving CN gains contribute to fitness increase and mutational diversification

As SCNAs have a global impact on gene expression in cancer47, the evolving CN gains potentially affect dosage-sensitive genes whose gains have a functional impact. In the OS and BRCA tumors, as the CN evolves, we can see an enrichment of putative dosage-sensitive genes that are in pathogenic CNV peak regions derived from dbVar48,49 (Fig. 6A). Moreover, we observed a similar enrichment for genes involved in sustaining proliferative signaling: one of the most fundamental capabilities of cancer cells1. No such enrichment is observed when utilizing copy number (CN) data alone (Supplementary Fig. S21), underscoring the valuable additional insights provided by incorporating timing information. MYC, EGFR and KIT are among such genes with late gains in both OS and BRCA, emphasizing their ability in stimulating cell multiplication in multiple tumor types. In addition, post-GD late gains tend to affect genes whose inactivation (upon CRISPR knockout) alters cell proliferation dynamics50 according to the DepMap database (Supplementary Fig. S22).

Fig. 6: Ways the late CN gains contribute to the fitness of cancer progenitor cell.
figure 6

A The gene set enrichment analysis (GSEA) was performed on the gene list ranked by the averaged CN arrival time for BRCA and OS tumors, respectively. The scatter plot on the left shows the normalized enrichment scores (NES) for each set of cancer census genes belonging to the predefined cancer hallmarks by COSMIC database. Red-colored highlighted pathways have a false discovery rate less than 0.15. The vertical bars on the right panels visualize the timing-ranks of genes that belong to the highlighted gene sets. The height of each bar corresponds to the arrival time. B A cartoon illustrates the multiplicity increase of an early sequence variant due to the inclusion of that variant by a late CN gain, with annotations indicating the type of variants (symbol shapes), level of multiplicity (color hues) and the variants' association with a late gain (right arrow) or an early gain (vertical bar). BP: breakpoint. C The SCNA timing plot of an example OS tumor similarly arranged as in Fig. 4, with additional links and symbols highlighting the SV breakpoints in known cancer genes that are amplified by late gains. D The matrix plot demonstrates genes with recurrent somatic variants and their multiplicity across the five tumor types. Note that a high multiplicity indicates that an early somatic variants gained more copies of the mutant. Names for known cancer genes are in bold. Genes with variants showing higher multiplicity levels than gene TTN are also included. Symbol annotations are the same as in (B). Source data are provided as a Source Data file.

The evolving gains could amplify the impact of early functional variants by increasing their multiplicity (Fig. 6B). Such a mechanism potentially affects SV breakpoints in known oncogenes (e.g., MAP3K13, MECOM and PREX2), breakpoints in genes known to be involved in oncogenic fusions (e.g., AFF3, LPP and ERG), and simple mutations in oncogenes (e.g., SSNVs in SMARCA4 and CACNA1A), see Fig. 6C, D. MAP3K13 had been shown to promote tumor growth in high MYC-expressing cells51,52, a similar context as in the OS39.

We note that highly mutated tumor suppressor genes (TSG), such as TP53, RB1 and APC, also have their early mutants duplicated or amplified (Fig. 6D). Whereas these are presumably inactivation variants, the retaining of multiple copies of the variants could suggest different roles that remain unclear, such as a potential gain-of-function of APC mutants in COAD53. The fact that SRE requires the amplifications of some early variants, rather than starting immediately upon acquiring a single copy of these variants, suggests that late-appearing gains could cooperate with the early variants to promote tumor expansion. On the other hand, late SV breakpoints (at single copy state) coupling evolving gains are prominent in genes located in common fragile sites, e.g., FHIT and MACROD2. Late alterations of these genome “caretakers” could facilitate further genome evolution and expedite clonal expansion54,55.

Lastly, the quantitative relation between SCNA evolution and SSNV accumulation, the rationale behind our timing method, implies that SCNA gains bolster mutational diversification between sub-populations during tumor growth. In principle, the higher the truncal CN state of a genomic segment, the higher the mutational divergence between subclones for the corresponding locus. As tumor expands, genomic regions at distinct SCNA states would accumulate SSNVs at different rates, leading to the heterogeneity of the SSNV burden along the genome. For example, when comparing two samples of a tumor, the sample-specific SSNVs are more abundant for regions with higher CN states (Fig. 7A, B). Notably, the overall CN state affects the structure of phylogenetic trees, i.e., it explains more than 50% of the variance of the relative branching distance measured by SSNVs in COAD and PRAD patients, where extensive multi-region sampling is available (Fig. 7C). Furthermore, continued evolution of SCNAs between subpopulations would also alter the SSNV divergence. For example, the SSNVs divergence is particularly enlarged for regions showing different CN states between the two samples (Fig. 7B). As such, increased SSNV diversity in regions with CN gains provides more materials for further selection within the expanding cell populations.

Fig. 7: The effect of SCNA on SSNV diversification during tumor expansion.
figure 7

A The rate of sample-private SSNVs when comparing two samples of a COAD patient tumor. The segmental CN states (total and minor CN) along the autosomal chromosomes for the two tumor samples are shown as gray rectangles above and below the x axis. The rate of sample private SSNVs (per million base pairs, blue line) fluctuates with the CN states, supporting the assumption that point mutations accumulate at a rate proportional to their CN. Genomic regions with different CN states between the two samples are in light red background. B The box plots on the left panel illustrate the rate of private SSNVs in sample P1 detected in regions at a given total CN state. The box represents the interquartile range, covering the central 50% of the data. The line inside the box indicates the median. Whiskers extend to the minimum and maximum values within a specified range, excluding outliers. The half-violin plots on the right panel demonstrate such a rate for regions showing stable or diverging CN states between the two samples with p value of Wilcoxon rank sum test (two-sided, W value = 136214 and 95% confidence interval ranges from 0.7 to 1) and effect size indicated. C The branching distance relative to truncal distance in a tumor’s phylogenetic tree was calculated for each of the COAD, ESCA and PRAD tumors to evaluate the correlation with the averaged CN of the corresponding tumor samples. Annotations show the percentage of variance explained by a linear regression model (with p values of the model fitting shown). Source data are provided as a Source Data file.

Discussion

Despite the well-established link between a chaotic genome in tumors with the CIN phenotype and poor clinical outcomes6, the mechanisms by which specific aberrations contribute to tumor growth remains poorly understood38. In this study, we have created a computational framework for measuring the arrival and initiation time of SCNAs during the somatic evolution of the MRCA of tumor sample(s), including complex gains with high CN states. By applying this method on multi-sampled WGS data of patient tumors, we have found that late truncal CN gains close to the most recent clonal expansion leading to the tumor sample(s) are common across multiple tumor types. Mathematical modeling predicts that these late evolving gains could contain final-expansion driver events, promoting the tumor growth. As CN gains increase the gene dosage and early somatic variants, we further demonstrated that incorporating the SCNA timing into an integrated genomic analysis has a strong potential for isolating the functional effect of specific aberrations.

Early genomic changes are presumably beneficial for tumor initiation28, but it is unclear the effect of late truncal events. Here we have provided evidence that gains occurring late in the somatic evolution, i.e., close to MRCA, can also be beneficial. The simplified two-event cancer development model posits that the cancer-initiating event is followed by the promoting event56. We reason that the evolving CN gains might render the progenitor cell capable of “self-promoting,” as they act similarly as a tumor “promoter” by (a) increasing the dosage of genes causing sustained proliferative signaling, (b) amplifying the mutant allele with early initiating driver variants, and (c) accelerating the accumulation of further genomic alterations. As both the early and late CN alterations could confer fitness advantages, chromosomal regions with SCNAs initiated early and arrived late, i.e., showing repetitive gains accompanying the entire course of the somatic evolution, could function as copy number “addictions.” These underscore the value of Butte in identifying complex SCNAs with early initiation and late arrival.

GD, a landmark event in CN evolution, has a context-dependent fitness effect. The punctuated CN gains successfully induced the SRE in tumors that underwent a late GD. By contrast, for many other tumors, especially osteosarcoma, GD was followed by additional CN gains that produces the MRCA. GD could tolerate the occurrence of deleterious passengers57. However, simply escaping purifying selection was not sufficient to drive the ultimate outgrowth, at least not in the tumors with post-GD gains, where some chromosomal regions can reach higher CN states. Alternatively, GD may create an inflated genome space, accelerating the accumulation of driver alterations (Supplementary Fig. S13). Prolonged evolution following GD or arm-level aneuploidy before the MRCA has recently been associated with a poor prognosis in neuroblastoma58. As GD itself affects many genes, regions with pre- and/or post-GD gains could serve as a reduced search space for CN drivers.

Our method is applicable to a wide range of SCNAs, yet it is still challenging to analyze exceptionally high CN states (i.e., above eight). We note that regions with such a high CN likely evolve over time, such as the unequal segregation of extra-chromosomal oncogene amplifications59,60. As such, late changes are expected for these amplified regions. Some focal high-level gains could involve small segments where the number of SSNVs is inadequate for calculation. This problem can be mitigated by borrowing information from nearby segments with the identical CN state. This strategy is applicable to synchronized SCNAs, such as chromothripsis61,62. In addition, our analysis may have missed some late-appearing SCNAs due to overestimation of tK (Supplementary Figs. S1, S3). Based on benchmarking simulations and the CIs observed in real data, we recommend utilizing > 20 SSNVs for a SCNA segment, coupled with a read coverage depth > 50 and tumor purity > 0.5, to achieve a robust timing inference. Furthermore, deletion was not modeled as it is unidentifiable18; by comparing CN profiles between subpopulations, however, it is possible to study deletions during tumor expansion. Lastly, our framework relies on a constant SSNV rate per base per unit time within a region under SCNA evolution. We focused on the timeline of clonal SSNVs only as mutational signatures can differ in activity between clonal and subclonal lineages63. Specialized pre-selection of SSNVs that faithfully possess a clock-like behavior19 is necessary for tumors whose mutagenic processes varies drastically within the studied timeline.

In our initial endeavor to understand the evolving gains in tumor development, our mathematical modeling focused on the timeline from germline to the tumor founder (or at least the founder of the dominant clone of the tumor). Our results suggest that if the MRCA forms a dominant clone of the tumor, the presence of abundant post-GD gains suggests that late gains may contain final-expansion drivers. Readers should note that even though the timing method works for any samples, the claim that late gains may contain final-expansion drivers relies on the condition that the MRCA of the sample is the founder of the dominant clone. In the analysis of individual samples, certain SCNAs may be associated with a subclone of considerable size, causing them to be identified as clonal. Consequently, some late gains could originate from subclonal events, although their impact on fitness has not been thoroughly investigated yet. Thus, we advocate the use of multi-region WGS data for analyzing late evolving gains.

Our findings also illustrate the existence of a fundamental connection between CN evolution and SSNV diversity, which can explain the positive correlation between aneuploidy and mutational burden when excluding hypermutated tumors64,65. Such a connection also indicates that we need to account for the dynamic nature of ongoing SCNAs when measuring subclonal evolution, which remains a challenge66. Finally, our results suggest that much can be gained by including the SCNA arrival time in studying tumor evolution, thereby shifting focus on exclusively early drivers to the evolving genomic events that affect the rate of tumor progression.

Methods

Somatic variant calling from WGS data

Raw WGS data in bam or fastq formats were downloaded from public databases provided by the original publications (Table 1) through the utilization of specific tools: pyega3 (version 3.4.0) for the European Genome-phenome Archive, sftp (version 3) for the Japanese Genotype-phenotype Archive, and sratoolkit (version 2.10.8) for the Database of Genotypes and Phenotypes. The cumulative read depth distribution along the human genome (hg38) and the tumor purity and ploidy for each sample are illustrated in Supplementary Figs. S23, S24, and S25. We have extended our existing pipeline, which had achieved a balance in sensitivity and specificity in detecting SSNVs by borrowing information across multiple samples67,68, to allow the detection of clonal SCNAs and the breakpoints of structural variants.

SSNVs and INDELs: Analysis-ready read alignment bam files (against hg38) were generated according to the best practices, including indel realignment, base recalibration and flagging of duplicated reads. Raw candidate variants were produced by MuTect (v1.1.7)3. To reduce the false-positive rate due to misalignments or other technical artifacts and to salvage the variants that may be missed due to uneven read coverage between samples, the alignment features surrounding each candidate variant were collected for each sample. The heuristic-based criterion for the read alignment patterns was adopted to refine and variant calls68. Small insertions and deletions were called by using Strelka (v1.0.15)69.

SCNAs: Copy number and tumor purity were estimated by using TitanCNA (v1.26.0)25. Germline heterozygous SNVs used as input to TitanCNA were identified using Samtools (v1.5)70 and subject to the same filtering strategy as was applied to SSNVs. The one-clone solution reported by TitanCNA (i.e., the sample is dominated by a clone with an SCNA profile along the genome) globally fit the data of the read coverage and allelic imbalance well, with a few exceptions for which the two-clones solution are necessary to explain the data of specific genomic regions. The ploidy baseline (CN = 2) is determined by the model complexity in explaining the log read ratio and allelic imbalance of heterozygous SNPs (see Supplementary Fig. S26 as an example).

SVs: We incorporated two distinct SV calling tools relying on orthogonal approaches, i.e., DELLY (v0.7.8, abnormal read pair and split-read analysis)71 and GRIDSS (v2.10.1, local assembly based algorithm)72. We focused on the SV breakpoints found by both tools, as these shared calls generally have higher quality (e.g., with higher breakpoint confidence) than those unique to each tool (Supplementary Fig. S27). SV breakpoints were annotated with AnnotSV73.

Analysis of genomic divergence

SCNA divergence: When multi-samples are available for a patient, the truncal and private SCNAs were identified as follows: (1) we partitioned the genome into disjoint segments by considering all the SCNAs called from the samples of the patient; (2) for each segment, we calculated a generalized likelihood ratio statistics for the comparison between two samples. The statistics is the ratio of the values of the likelihood function (the probability of observing the read depth ratio and B-allele frequency for SNPs in the region) evaluated at the maximum likelihood estimation in the sub-model (two samples have the same CN profile) and at the maximum likelihood estimation in the full-model; and (3) the statistics converges weakly to a random variable with chi-square distribution and thus can be used to determine if a segment shows significantly different SCNA states between the two samples. The term “truncal SCNAs” refers to SCNAs that exhibit no difference in pairwise comparisons.

Sample phylogeny: We applied Treeomics (v1.7.13)74 to construct sample phylogenies from SSNVs that are clonal in at least one specimen. Treeomics takes into account the uncertainty due to purity differences and variations of read depth on the SSNV loci to derive robust sample phylogenies. We note that Treeomics assumes sample homogeneity.

Clonality, multiplicity of SSNVs and SV breakpoints: SSNVs were categorized as either public (present in all tumor cells) or private based on their sharing patterns and allele frequencies in multi-sampling data68. In individual samples, clonal SSNVs were identified as those with the 95% confidence interval of cancer cell fraction (CCF) covering 1. We focused on the public SSNVs (multi-sampling) and clonal SSNVs (single sampling) for the timing analysis. For SSNVs or SV breakpoints existing in an SCNA region, we applied a binomial model to calculate the maximum likelihood estimates of the number of segment copies containing that variant21.

Allele state distribution of SSNVs in an SCNA

For SSNV i in an SCNA region (with CN configuration of Nt : Nb and M ≥ 10 SSNVs in total), we obtained from WGS the read counts carrying the mutant allele mi out of the total number of reads di. Expectation Maximization algorithm was used to estimate the proportion of SSNVs at each possible allele fraction, i.e., a vector q that gives the probability of randomly acquired SSNVs in this region having a purity-adjusted allele frequency (fi = aj) for each possible allele state \(\frac{j}{{N}_{t}}\). Note that we used the same symbol a for allele frequency, as it is intrinsically tied to the allele state. To calculate the likelihood function of observing the particular SSNV data in an SCNA region, we sum across all possible allele states the product of qj and the probability that SSNV i at aj has the observed read counts. Conditioned on the successful detection of the SSNV, the log-likelihood is given by,

$$\mathop{\sum }\limits_{i=1}^{M}\log \Pr ({m}_{i}| {m}_{i} \, > \, 0,\, q)=\mathop{\sum }\limits_{i=1}^{M}\log \left(\frac{\mathop{\sum }\nolimits_{j=1}^{{N}_{t}-{N}_{b}}\Pr ({m}_{i}| {f}_{i}={a}_{j}){q}_{j}}{1-\mathop{\sum }\nolimits_{j=1}^{{N}_{t}-{N}_{b}}{(1-{a}_{j})}^{{d}_{i}}{q}_{j}}\right).$$
(3)

Estimating the upper bounds of SCNA timing

Assume that a clonal SCNA region (with Nt: Nb at 5:1, Fig. 1D) has total M SSNVs. We disregard deletions and model the CN evolution as 3 gain events, creating 4 stages of increasing CN states during the somatic evolution timeline from the germline to the founder cell of the tumor. Denote by the vector t = (t0, …, t3) the fraction of time in each stage. We refer to t0 as the initiation time, and t3 (or tK where K refers to the last stage) the lead time. Let aj represent a possible allele state for a SSNV, i.e., the fraction of the allelic copies with the SSNV. The possible values of aj in this case are {1/5, …, 4/5}. Let the vector q = {q1, …, q4} represent the fraction of the M SSNVs having allele state for each of the possible aj. Let A be a history matrix, with the entry Ajk representing the number of copies in stage k that would result in an allele state of j/5. In other words, all SSNVs originating from stage k on those Ajk copies will lead to the same allele state j/5.

Multiple paths or ordering of gain events could lead to the same SCNA state. In Fig. 1D, we graphically demonstrate the two possible histories of SCNA at 5:1. Our objective is to construct estimators of t0 and t3 using q and A. Denote by c the sum of the fraction of time multiplied by the number of copies in each stage. We have

$$c=2{t}_{0}+3{t}_{1}+4{t}_{2}+5{t}_{3}.$$

Because the probability of a mutation occurring in stage k is proportional to tk−1(k + 1) (the fraction of the lifespan spent in stage k multiplied by the number of copies in stage k), we have

$${q}_{j}\, \approx \, \mathop{\sum }\limits_{i=0}^{3}\frac{{t}_{i}{A}_{j(i+1)}}{c}.$$
(4)

We can write equation (4) in matrix form:

$${{{{{{{\bf{q}}}}}}}}\, \approx \, {{{{{{{\bf{A}}}}}}}}{{{{{{{\bf{t}}}}}}}}/c,$$
(5)

When the history matrix A is invertible, we have

$${{{{{{{\bf{t}}}}}}}}\, \approx \, c{{{{{{{{\bf{A}}}}}}}}}^{-1}{{{{{{{\bf{q}}}}}}}},$$
(6)

and thus the desired estimators can be obtained. Note that in Fig. 1D, history A(a) induces an invertible history matrix, while the history matrix for history A(b) is singular. Therefore, the method introduced in ref. 18 cannot be directly applied.

For single and double gains (Nt: Nb at 3:1, 2:0, 4:1 or 3:0), t0 and tK are directly solved because matrix A is unique and invertible. For other SCNA states, Butte uses linear programming to obtain the upper bounds of timings across all possible history matrices for the corresponding CN configuration (Supplementary Fig. S28). Let s denote the vector of the column sum of matrix A. Let a denote the vector of possible allele states. Abusing the notation, we now interpret q as the vector with entry qj representing the probability of a randomly acquired SSNV having allele frequency for each possible allele state aj. Then the relation between A and t can be expressed as

$${{{{{{{\bf{A}}}}}}}}{{{{{{{\bf{t}}}}}}}}/({{{{{{{{\bf{s}}}}}}}}}^{T}{{{{{{{\bf{t}}}}}}}})={{{{{{{\bf{q}}}}}}}}.$$
(7)

From equation (7), we have

$$({{{{{{{\bf{A}}}}}}}}-{{{{{{{\bf{q}}}}}}}}{{{{{{{{\bf{s}}}}}}}}}^{T}){{{{{{{\bf{t}}}}}}}}={{{{{{{\bf{0}}}}}}}},$$
(8)

where 0 represents a vector with all 0’s. Since t denotes the time fractions of different CN evolution stages, we have

$${{{{{{{\bf{1}}}}}}}}\cdot {{{{{{{\bf{t}}}}}}}}=1,$$
(9)

where 1 represents a vector with all 1’s. Butte solves the following optimization problem by linear programming:

$$\begin{array}{c}\mathop{\max }\limits_{{{{{{{{\bf{t}}}}}}}}}\quad {t}_{K}\\ \,{{\mbox{s.t.}}}\,\quad ({{{{{{{\bf{A}}}}}}}}-{{{{{{{\bf{q}}}}}}}}{{{{{{{{\bf{s}}}}}}}}}^{T}){{{{{{{\bf{t}}}}}}}}={{{{{{{\bf{0}}}}}}}}\\ {{{{{{{\bf{1}}}}}}}}\cdot {{{{{{{\bf{t}}}}}}}}=1,\end{array}$$

where tK is the last element in vector t. The maximum value of tK gives us an upper bound of the lead time given A. For upper bounds of initiation times, we instead maximize t0 which is the first element in t. To tolerate noise in the allele state distribution estimated from sequencing data, we add a slack variable on each capacity constraint, having a penalty cost of 100. The confidence intervals of the estimated upper bounds were calculated through bootstrapping the SSNV data.

Benchmarking the timing method

To evaluate the robustness of Butte, we simulated SSNV data for a SCNA region across different evolutionary histories, depth of coverage (D), number of available mutations (M) and tumor purity (P). Given a known evolutionary history of a SCNA and known timing for each time stage, we generated the allele state distribution vector q. Given the total number of SSNVs M available for the SCNA region, the number of SSNVs at distinct allele states were drawn from a multinomial distribution with probability for these allele states (q) for each iteration. Given the tumor purity P, the true allele frequency f of a mutation at a known allele state was calculated. The sequencing depth of mutations (D) follows a negative binomial distribution D ~ NB(uD, σD), with mean uD, and dispersion σD = uD/10. The resulting read counts of the mutant allele having a true allele frequency of f follows a binomial distribution m ~ Bin(D, f).

To assess the precision, true positive rate (TPR), and false positive rate (FPR) of both early and late event detection, we conducted simulations for SCNA state (Nt: Nb) at 6:2 and 6:1 Supplementary Figs. S11S12. To evaluate late gains, we use a threshold value (T) and set the initiation time (t0) at 0.1, 0.2, and 0.3, respectively. Subsequently, we simulated SSNV data using randomly generated lead time tK values. True positives were identified when the predicted tK was less than T and the actual \({t}_{K}^{{\prime} }\) was also less than T. False positives occurred when the predicted tK was less than T but the actual \({t}_{K}^{{\prime} }\) exceeded T. Conversely, true negatives were cases where both the predicted and actual \({t}_{K}^{{\prime} }\) values were greater than T, and false negatives were instances where the predicted tK exceeded T but the actual \({t}_{K}^{{\prime} }\) was less than T. Similar criteria were applied for early gains by comparing the predicted and actual t0 values with the threshold (T). The performance metrics were evaluated across different threshold values (T).

To examine tumors from29 using Butte (Supplementary Fig. S8), we acquired SSNV and SCNA predictions for the same samples from the PCAWG (PanCancer Analysis of Whole Genomes) dataset26 via the ICGC data portal. Subsequently, Butte was applied to the downloaded dataset, comprising 15 tumor samples with event timing predictions previously reported by ref. 29. These predictions included events such as tetraploidy, trisomy, tandem duplication, and GD, determined using the graph theory-based method21.

Determining the timing of genome doubling

We identified clustered gains by clustering the inferred timing via non-parametric density estimation (R-package pdfCluster)75. We define GD as the prominent and concentrated burst of gains, containing more than 40% of all timed segments. A cutoff of 40% seems suitable for identifying the prominent burst of gains, as depicted in Supplementary Fig. S29. It’s important to emphasize that we do not assume duplications necessarily cover the entire genome during GD events. Gains in these clusters have similar timing estimates (standard deviation σ ~ 0.1) (Supplementary Fig. S30), suggesting that they occurred within a narrow time window. We regard the averaged timing of all the segments in the corresponding cluster as the timing of GD. Post- and pre-GD events were identified as those occurred ± 1.3σ away from the GD, respectively.

Detecting recurrent early or late gains

We partitioned the genome into bins of 1 million base pairs each and ranked these bins in each sample based on their respective timing values (t0 for BRCA and 1-tK for OS, respectively). To avoid ties, we introduced jitter to the timing values. Subsequently, we calculated the deviation from the middle rank of each sample for each bin. This middle rank represented the expected value under the null hypothesis, signifying no recurrent early (for BRCA) or late (for OS) gain regions across patients. For each tumor type, we aggregated these rank deviations across patients for each bin. Normalizing these rank sums by their standard deviations produced standardized values, which approximately followed a standard normal distribution if the null hypothesis held true. A significantly negative standardized rank sum indicated recurrent early initiating gains, while a markedly positive value indicated recurrent late establishment gains. Simultaneously, to assess the prevalence of gains in each genomic bin across patients, we ranked the segment mean (which represented the read depth ratio between tumor and normal samples) in a similar manner as the timing values. Subsequently, we applied the same rank sum normalization technique. A notably positive normalized rank sum for the segment mean would indicate frequent high copy number gains across patients.

Functionality of genes affected by late gains

To see which cancer hallmarks are associated with late gains, we performed Gene Set Enrichment Analysis (GSEA)76 on the gene list ranked by the averaged arrival time across patients of SCNAs affecting a corresponding gene. We used the gene sets representing hallmarks of cancer1 from COSMIC database77. R-package fgsea78 was utilized to perform the GSEA analysis with 50000 permutation.

To further evaluate the fitness effect of genes affected by late gains, we analyzed the gene Chronos score50 (gene knockout fitness effect) provided in the DepMap database. The Chronos score reflects the change in cell proliferation upon the CRISPR knockout of the respective gene in a particular cell line. A lower negative Chronos score indicates that the gene is a denpendency in a cell line. For simplicity, we took the average score for each gene across all the cell lines. For each tumor, we calculated the fraction of genes affected by late gains with a mean Chronos score < − 0.5. We then obtained the normalized ratio (NR) by dividing it to the ratio calculated from all the genes in the database. To get a background of the NR of randomized genomic regions, given a set of segments showing late gains, we randomly sampled regions by keeping the same segment lengths by using R-package regioneR79. We then compared the normalized ratio between patient data and the randomization (Supplementary Fig. S22).

Mathematical modeling of late evolving gains

Consider two contrasting models based on multi-type branching processes with mutations. In both models, the tumor grows from a single tumor-initiating cell which just acquired GD. During the tumor’s progression, cancer cells accumulate mutations (post-GD gains). In the base model (neutral model), all mutations are passenger mutations. Therefore, all cancer cells give birth at a rate of a0 and die at a rate of b0. The net growth rate is λ0 = a0 − b0 > 0. Neutral mutations occur at rate u0 per unit time throughout the lifetime of a cell, and each mutation is distinct following the infinite-sites model80. Clonal post-GD gains are those acquired prior to the cell division (or the onset of expansion) leading to two surviving sublineages. Let this division event (denoted by an effective birth) occur at a rate of λ0 (cf. page 10 of42) conditioned on the non-extinction of the population. By the memoryless property of the exponential distribution, counting the number of post-GD gains prior to the first effective birth is analogous to counting the number of tails until the first head in a sequence of coin tosses, where the probability of a head (effective birth) is \(\frac{{\lambda }_{0}}{{\lambda }_{0}+{u}_{0}}\) and the probability of a tail (neutral mutation) is \(\frac{{u}_{0}}{{\lambda }_{0}+{u}_{0}}\). Therefore, the number of gains before the first effective birth follows a geometric distribution with parameter \(\frac{{\lambda }_{0}}{{\lambda }_{0}+{u}_{0}}\) and mean \(\frac{{u}_{0}}{{\lambda }_{0}}\). We then investigated the number of mutations which are shared by more than 90% of the total population (we refer to them as dominant mutations). Gunnarsson and his co-authors43 derived exact expressions for the expected SFS of a cell population that evolves according to a branching process. We utilized their results on the skeleton subpopulation (see Appendix C of43)—cells with an infinite line of descents which determines the high frequency spectrum—to express the expected number of dominant mutations \(\tilde{S}\) when the tumor reaches a fixed size N as

$$\tilde{S}=\frac{N}{\lceil 0.9N\rceil }\cdot \frac{{u}_{0}}{{\lambda }_{0}}\, \approx \, 1.11\frac{{u}_{0}}{{\lambda }_{0}}.$$
(10)

In the alternative model (selection model), the tumor-initiating cell and its descendants with only passenger mutations form the type 0 population. Type 0 cells give birth at a rate of a0 and die at a rate of b0. The net growth rate is λ0 = a0 − b0 > 0. Type 0 cells mutate to type 1 cells at a rate of u1. Type 1 cells give birth at a rate of a1 and die at a rate of b1. The net growth rate is λ1 = a1 − b1 > λ0. Both type 0 and type 1 cells accumulate passenger mutations at a rate of u0. We assume that when the tumor is sampled, the descendants of the first type 1 cell with infinite lineage dominates the population. This is usually the case in our simulations where the fitness advantage conferred by the beneficial post-GD gain is large. We note that due to stochasticity, descendants of a later-appearing type 1 cell (e.g., the second or the third one, etc.) can also dominate the population. However, type 1 cells acquired later would accrue more post-GD gains on average and thus our claim stays valid. On the other hand, it is the equivalent of the base model if no type 1 cells dominates the population upon sampling. In Lemma 1, we obtained the distribution of the time to the first type 1 cell with infinite lineage conditioned on the non-extinction of the tumor.

Lemma 1

Let σ1 denote the time of occurrence of the first type 1 cell that gives rise to a family which does not die out, and let Ω denote the event of non-extinction of the tumor. Then

$${\mathbb{P}}\left({\sigma }_{1} \, > \, t| {\Omega }_{\infty }\right)=\frac{{a}_{0}\left(1-{q}_{0}\right)+\frac{{u}_{1}\left(1-{q}_{1}\right)}{1-{q}_{0}}}{{a}_{0}\left(1-{q}_{0}\right)+\frac{{u}_{1}\left(1-{q}_{1}\right)}{1-{q}_{0}}{e}^{\zeta t}},$$

where

$${q}_{0}= \frac{{a}_{0}+{b}_{0}+{u}_{1}-\sqrt{{\left({a}_{0}+{b}_{0}+{u}_{1}\right)}^{2}-4{a}_{0}\left({u}_{1}{q}_{1}+{b}_{0}\right)}}{2{a}_{0}},\\ {q}_{1}= \frac{{b}_{1}}{{a}_{1}},\quad {{{{{{{\rm{and}}}}}}}}\\ \zeta= \frac{{u}_{1}\left(1-{q}_{1}\right)}{1-{q}_{0}}+{a}_{0}\left(1-{q}_{0}\right).$$

With Lemma 1, we can obtain the expected number of passenger mutations accumulated in the first type 1 cell with infinite lineage, denoted by \(\bar{S}\):

$$\bar{S}=\int\nolimits_{0}^{\infty }{\mathbb{P}}\left({\sigma }_{1} \, > \, t| {\Omega }_{\infty }\right){u}_{0}dt.$$
(11)

With (10) and (11), we can obtain that the expected number of dominant post-GD gains in the subpopulation generated from the first type 1 cell with infinite lineage is \(1.11\frac{{u}_{0}}{{\lambda }_{1}}+\bar{S}+1\), where the last 1 represents the number of post-GD driver gains. Proof for Lemma 1 and details of (10) can be found in Supplementary Methods.

Mathematical modeling in the context of multiple drivers

In our multi-type branching process model, we examine two driver mutations: mutation one and mutation two. We initiate with a single cell devoid of mutations, which possesses a birth rate of a0 and a death rate of b0. Mutation one, acquired at a rate of u1, adds δ1 to the birth rate (a1 = a0 + δ1). Mutation two, acquired at a rate of u2, adds δ2 to the birth rate (a2 = a0 + δ2). Cells with both mutations have a birth rate of a3 = a0 + δ1 + δ2. Death rate remain the same as b0. We simulated tumor growth from a mutation-free cell until the first cell acquired both drivers without going extinct (100,000 for each parameter set). We calculated the fraction of cases where mutation one occurred first.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.