## Abstract

Recurrent successions of genomic changes, both within and between patients, reflect repeated evolutionary processes that are valuable for the anticipation of cancer progression. Multi-region sequencing allows the temporal order of some genomic changes in a tumor to be inferred, but the robust identification of repeated evolution across patients remains a challenge. We developed a machine-learning method based on transfer learning that allowed us to overcome the stochastic effects of cancer evolution and noise in data and identified hidden evolutionary patterns in cancer cohorts. When applied to multi-region sequencing datasets from lung, breast, renal, and colorectal cancer (768 samples from 178 patients), our method detected repeated evolutionary trajectories in subgroups of patients, which were reproduced in single-sample cohorts (*n* = 2,935). Our method provides a means of classifying patients on the basis of how their tumor evolved, with implications for the anticipation of disease progression.

## Access options

Subscribe to Journal

Get full journal access for 1 year

$227.00

only $18.92 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

## Additional information

**Publisher’s note:** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

- 1.
McGranahan, N. & Swanton, C. Biological and therapeutic impact of intratumor heterogeneity in cancer evolution.

*Cancer Cell***27**, 15–26 (2015). - 2.
McGranahan, N. & Swanton, C. Clonal heterogeneity and tumor evolution: past, present, and the future.

*Cell***168**, 613–628 (2017). - 3.
Greaves, M. & Maley, C. C. Clonal evolution in cancer.

*Nature***481**, 306–313 (2012). - 4.
Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution.

*Nature***501**, 338–345 (2013). - 5.
Gould, S. J.

*Wonderful Life: The Burgess Shale and the Nature of History*(W.W. Norton & Company, New York, 1990). - 6.
Graham, T. A. & Sottoriva, A. Measuring cancer evolution from the genome.

*J. Pathol.***241**, 183–191 (2017). - 7.
Lipinski, K. A. et al. Cancer evolution and the limits of predictability in precision cancer medicine.

*Trends Cancer***2**, 49–63 (2016). - 8.
Beerenwinkel, N. et al. Genetic progression and the waiting time to cancer.

*PLoS Comput. Biol.***3**, e225 (2007). - 9.
Pathare, S., Schäffer, A. A., Beerenwinkel, N. & Mahimkar, M. Construction of oncogenetic tree models reveals multiple pathways of oral cancer progression.

*Int. J. Cancer***124**, 2864–2871 (2009). - 10.
Attolini, C. S.-O. et al. A mathematical framework to determine the temporal sequence of somatic genetic events in cancer.

*Proc. Natl. Acad. Sci. USA***107**, 17604–17609 (2010). - 11.
Caravagna, G. et al. Algorithmic methods to infer the evolutionary trajectories in cancer progression.

*Proc. Natl. Acad. Sci. USA***113**, E4025–E4034 (2016). - 12.
Gerlinger, M. et al. Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing.

*Nat. Genet.***46**, 225–233 (2014). - 13.
de Bruin, E. C. et al. Spatial and temporal diversity in genomic instability processes defines lung cancer evolution.

*Science***346**, 251–256 (2014). - 14.
Sottoriva, A. et al. Intratumor heterogeneity in human glioblastoma reflects cancer evolutionary dynamics.

*Proc. Natl. Acad. Sci. USA***110**, 4009–4014 (2013). - 15.
Yates, L. R. et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing.

*Nat. Med.***21**, 751–759 (2015). - 16.
Kim, J. et al. Spatiotemporal evolution of the primary glioblastoma genome.

*Cancer Cell***28**, 318–328 (2015). - 17.
Kim, H. et al. Whole-genome and multisector exome sequencing of primary and post-treatment glioblastoma reveals patterns of tumor evolution.

*Genome Res.***25**, 316–327 (2015). - 18.
Jamal-Hanjani, M. et al. Tracking the evolution of non-small-cell lung cancer.

*N. Engl. J. Med.***376**, 2109–2121 (2017). - 19.
Pan, S. J. & Yang, Q. A survey on transfer learning.

*IEEE Trans. Knowl. Data Eng.***22**, 1345–1359 (2010). - 20.
Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer.

*Nat. Methods***11**, 396–398 (2014). - 21.
Nik-Zainal, S. et al. The life history of 21 breast cancers.

*Cell***149**, 994–1007 (2012). - 22.
Schwartz, R. & Schäffer, A. A. The evolution of tumour phylogenetics: principles and practice.

*Nat. Rev. Genet.***18**, 213–229 (2017). - 23.
Yuan, K., Sakoparnig, T., Markowetz, F. & Beerenwinkel, N. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies.

*Genome. Biol.***16**, 36 (2015). - 24.
Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors.

*Genome. Biol.***16**, 35 (2015). - 25.
El-Kebir, M., Satas, G., Oesper, L. & Raphael, B. J. Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures.

*Cell Syst.***3**, 43–53 (2016). - 26.
Salehi, S. et al. ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data.

*Genome. Biol.***18**, 44 (2017). - 27.
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0.

*Syst. Biol.***59**, 307–321 (2010). - 28.
Efron, B.

*The Jackknife, the Bootstrap and Other Resampling Plans*(Society for Industrial and Applied Mathematics, Philadelphia, 1982). - 29.
Fearon, E. R. & Vogelstein, B. A genetic model for colorectal tumorigenesis.

*Cell***61**, 759–767 (1990). - 30.
Logan, R. F. A. et al. Outcomes of the bowel cancer screening programme (BCSP) in England after the first 1 million tests.

*Gut***61**, 1439–1446 (2011). - 31.
Zauber, A. G. et al. Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths.

*N. Engl. J. Med.***366**, 687–696 (2012). - 32.
Cross, W. et al. The evolutionary landscape of colorectal carcinogenesis.

*Nat. Ecol. Evol*. (in the press). - 33.
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer.

*Nat. Biotechnol.***30**, 413–421 (2012). - 34.
Prandi, D. et al. Unraveling the clonal hierarchy of somatic genomic aberrations.

*Genome. Biol.***15**, 439 (2014). - 35.
Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma.

*Nature***511**, 543–550 (2014). - 36.
Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers.

*Nature***489**, 519–525 (2012). - 37.
Imielinski, M. et al. Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing.

*Cell***150**, 1107–1120 (2012). - 38.
Campbell, J. D. et al. Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas.

*Nat. Genet.***48**, 607–616 (2016). - 39.
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.

*Nature***486**, 346–352 (2012). - 40.
Pereira, B. et al. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes.

*Nat. Commun.***7**, 11479 (2016). - 41.
Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours.

*Nature***490**, 61–70 (2012). - 42.
Kapur, P. et al. Effects on survival of BAP1 and PBRM1 mutations in sporadic clear-cell renal-cell carcinoma: a retrospective analysis with independent validation.

*Lancet. Oncol.***14**, 159–167 (2013). - 43.
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks.

*Nature***542**, 115–118 (2017). - 44.
Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing.

*N. Engl. J. Med.***366**, 883–892 (2012). - 45.
Davis, A. & Navin, N. E. Computing tumor trees from single cells.

*Genome. Biol.***17**, 113 (2016). - 46.
Swofford, D. L.

*PAUP*: Phylogenetic Analysis Using Parsimony (and Other Methods) 4.0 beta*(Sinauer Associates, 2005). - 47.
Dang, H. X. et al. ClonEvol: clonal ordering and visualization in cancer sequencing.

*Ann. Oncol.***28**, 3076–3082 (2017). - 48.
Olde Loohuis, L. et al. Inferring tree causal models of cancer progression with probability raising.

*PLoS One***9**, e108358 (2014). - 49.
Ramazzotti, D. et al. CAPRI: efficient inference of cancer progression models from cross-sectional data.

*Bioinformatics***31**, 3016–3026 (2015). - 50.
Chow, C. & Liu, C. Approximating discrete probability distributions with dependence trees.

*IEEE Trans. Inf. Theory***14**, 462–467 (1968).

## Acknowledgements

This work is supported by the Wellcome Trust (202778/B/16/Z to A.S.; 202778/Z/16/Z to T.A.G.; 105104/Z/14/Z to the Centre for Evolution and Cancer, Institute of Cancer Research), Cancer Research UK (A22909 to A.S.; A19771 to T.A.G.), the Institute of Cancer Research (Chris Rokos Fellowship in Evolution and Cancer to A.S.), and ERC (MLCS 306999 to G.S.).

## Author information

### Affiliations

#### Evolutionary Genomics and Modelling Lab, Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK

- Giulio Caravagna
- & Andrea Sottoriva

#### School of Informatics, University of Edinburgh, Edinburgh, UK

- Giulio Caravagna
- , Ylenia Giarratano
- & Guido Sanguinetti

#### Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK

- Ylenia Giarratano

#### Department of Pathology, Stanford University, Stanford, CA, USA

- Daniele Ramazzotti

#### Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, UK

- Ian Tomlinson

#### Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK

- Trevor A. Graham

### Authors

### Search for Giulio Caravagna in:

### Search for Ylenia Giarratano in:

### Search for Daniele Ramazzotti in:

### Search for Ian Tomlinson in:

### Search for Trevor A. Graham in:

### Search for Guido Sanguinetti in:

### Search for Andrea Sottoriva in:

### Contributions

G.C., G.S., and A.S. designed the approach and interpreted the results. G.C. defined the method, and G.C. and Y.G. implemented it. G.C., Y.G., and D.R. analyzed the data. I.T. contributed data. G.S. and A.S. supervised the study with input from T.A.G. All of the authors drafted and approved the manuscript.

### Competing interests

The authors declare no competing interests.

### Corresponding authors

Correspondence to Giulio Caravagna or Guido Sanguinetti or Andrea Sottoriva.

## Integrated supplementary information

### Supplementary Figure 1 Synthetic test: example CCF and phylogenies.

Two example cases of ambiguous (A), and non-ambiguous (B) Cancer Cell Fraction (CCF) values. In the top case, we sequence

*r = 3*regions and detect*c = 4*clones (via subclonal deconvolution), where we annotate 4 distinct driver lesions; the color represents the driver. When we compute the possible phylogenetic trees from the CCF values reported in the data matrix, we find 6 equally scoring trees: they all have the same score, and no violations of the pigeonhole principle. Each of the solutions, however, provides a different ordering of the drivers (information transfer; see also Supplementary Figure 2). By chance, in this case the true model is not top-ranked, and hence cannot be trivially retrieved with a standard uncorrelated fit of the CCF data; for this reason, we term this CCF dataset ambiguous. In the bottom case, instead, we show a case of non-ambiguous CCF data where the true model ranks top. We here model a patient with*c = 6*clones; light-gray nodes have no driver annotated. This patient has many more phylogenetic trees associated, but the true model ranks top (linear path). Other equally scoring trees have the same driver’s ordering of the true mode, as they only differ by the placement of clones without drivers (5 and 6). Violations start after the 3rd ranked model; the fifth ranked model mistakes the parent of clone 4, thus transferring “wrong” orderings.### Supplementary Figure 2 Information transfer in REVOLVER.

We can build a model

*T*for a patient that captures the evolutionary trajectories for its tumour. The model is a tree that has, as nodes, the groups of alterations annotated in the patient’s data; some of which are flagged as drivers. There can be one (A) or more (B) drivers annotated in each one of the input groups. We can focus on trajectories that involve recurrent drivers that appear in several patients; we would like these trajectories to be consistent across several patients (*repeated evolution*). REVOLVER is a Transfer Learning approach that will scan several possible models for a patient, trying to match predictions across multiple patients. The method “transfers” orderings estimated from evolutionary trajectories, which are extracted by the transitive closure of a path in the tree. An ordering connects two driver alterations, as annotated in the cartoon; here for instance the green driver is upstream the brown one (A). The germline (GL) is added to transfer the information on the tumour-initiating driver. The information transfer is not necessarily a tree; it will be indeed a graph whenever we have more than one driver annotated in a group (B), as in the bottom panel.### Supplementary Figure 3 REVOLVER.

First (A) and second (B) steps of the REVOLVER algorithm to fit the data from a cohort of cancer patients. The first step is an Expectation Maximization of which we show the optimization gradient in the E and M-steps of the fit. We are interested in repeated trajectories among drivers observed in more than one patient (coloured nodes; see also Supplementary Figure 2). During the fit, we have identified the best model for this patient (left), but the next iteration of the EM might change our best guess for this patient. Here we focus on the trajectory for the gray driver, currently downstream green ones in the information transfer. REVOLVER measures the correlation of this tree against the ones fit to the rest of the cohort: in this example this prediction is supported by only one other model, while three suggest an alternative trajectory initiated by the turquoise driver (central panel). Via

**w**, we define a gradient that can induce a new scoring of the trees by means of a penalized likelihood; the model to the right is the new best (maximum likelihood estimate) since its trajectory is more correlated to the rest of the patients. Notice that we can place the gray mutation in 5 different positions to still obtain the same information transfer; in this case the one that we select is totally driven by the likelihood (red asterisks). This change is driven by a combination of factors: (i) how better the “alternative” model explains this patient's data, with respect to the original model, and (ii) how strong is the consensus/ information transfer on the trajectory of grey/ turquoise drivers. Once we have converged to the EM solution, we can further expand our models (B) with Transfer Learning. Intra-group trajectories for drivers that belong to the same node of the tree cannot be inferred from data of a single patient. This is the case here for A, B, C and D which are clonal drivers. After correlating the structures of the models, however, we can observe the orderings of A, B, C and D in the rest of the cohort via**w**. Here, we show a graph representation of**w**(central panel), and highlight in red the Maximum Likelihood Estimate (MLE) of the driver upstream each one of A, B, C and D (most frequent parent). We than expand the node of our model to reflect those orderings. Uncertainty reflects in the structure of the estimated paths; it should be a linear chain of events (assuming that A, B, C and D are all true drivers) but w might not be able to retrieve it. For instance, in this example, we are not sure if the pink driver is downstream the green or the turquoise one, and we have no evidence of the ordering between gray and pink drivers as well.### Supplementary Figure 4 Synthetic test: performance.

Comparison between REVOLVER and standard uncorrelated inference (A), and test of the effect of noise on the performance (B). In the first test we simulate various cohort sizes (

*n = 10, 50, 100*) and proportion of patients harbouring ambiguous CCF (*p = 10, 20, 50%*; Supplementary Figure 1). We simulate bulk sampling from*r = 1, 3, 10*regions; all patients have the same*r*. There are 4 drivers per true tree, one per clone. The performance is measured as the proportion of correct parents (true positives) for the four drivers, in*N = 20*independent replicates. REVOLVER’s Transfer Learning allows disambiguating almost all cases of ambiguous CCFs in each cohort, validating the approach compared to standard uncorrelated inference. Histograms report summaries for this test, such as the number of phylogenies per patient, the number of combinations of information transfer and the number of cases in which REVOLVER fits a model which, without Transfer Learning, would have fit lower in the rank (non-top). In the second test we assess how the performance decreases when we add Gaussian noise with mean 0 and low/ high standard deviation (*σ = 0.01/ 0.05*), and re-run the analysis. Noise is applied independently to each entry of the input CCF matrix, with fluctuations of ±0.2 (high noise) that heavily confound CCF values. For 100 random cohorts in each simulated experiments we observed that, with high noise over 52% of the models transfer edges with reversed orientation. Very large noise leads to lower performances because REVOLVER transfers “noise”, eventually degenerating to the point where the whole transfer is pointless. Notice that the number of non-top fits increases with noise; this suggests that the true model receives lower rank when we include noise, rendering the inference harder. Top and bottom boxes of the plot are 25th and 75th percentiles, centerline is the mean; the upper whisker is located at the smaller of the maximum value and 75th percentile + 1.5 IQR (Inter Quartile Range), and the lower whisker is located at the larger of the minimum value and 25th percentile—1.5 IQR; dots are outliers, which are less than 25th percentile—1.5 IQR or more than 75th percentile + 1.5 IQR.### Supplementary Figure 5 Analysis of colorectal cancers: full size clustering’s results.

Extended version of Figure 2b, Main Text. In top we show all evolutionary trajectories detected in this cohort, and in bottom all the drivers annotated in the cohort, as well as the clonality status (average of binary observations for this cohort). Both heatmap panels are sorted by frequency of the annotated variable, in the overall breast cohort. In bottom we show the counts of all the trajectories identified with REVOLVER, and their counts in the group. Further annotations show the number of occurrences of each driver, and the clonality status, across the full cohort.

### Supplementary Figure 6 Analysis of TRACERx lung cancers: full size clustering’s results.

Extended version of Fig. 3a, Main Text, without the top dendrogram. In top we show all evolutionary trajectories detected in at least 3 patients, and in bottom all the drivers annotated in the cohort. The top set of drivers, with larger rows, is the set of most frequent ones. Both heatmap panels are sorted by frequency of the annotated variable, in the overall TRACERx cohort.

### Supplementary Figure 7 Repeated evolutionary trajectories in the TRACERx cohort.

The 10 clusters of TRACERx tumours detected by REVOLVER, and their repeated evolutionary trajectories. For each group we report a graph where we annotate the group size (

*n*), the number of times a trajectory is detected within the group on the edge, and the number of times each alteration is clonal or subclonal across all patients. GL stands for germline. In this plot, we show trajectories that occur at least 3 times.### Supplementary Figure 8 Stability of REVOLVER’s cluster for the TRACERx cohort.

We used a jackknife approach to estimate clustering's stability, as measured via the probability that two patients are clustered together in a resampling process (A), and the number of patients harbouring an edge, across all resamples. These statistics are computed by resampling

*N = 1,000*times the cohort and removing, each time, a random percentage (*p = 10*%) of patients and re-computing fit and clustering with the original parameters. The heatmap shows the empirical probability estimated via this jackknife approach, and the boxplot in right is computed per cluster; for each cluster the number of points used is equal to*n(n-1)/2*, where*n*is the cluster size (Main Text, Fig. 3). We report the counts of edges per patients in the bottom boxplot, where we annotate those with median above four. Top and bottom boxes of each boxplot are 25th and 75th percentiles, centerline is the mean; the upper whisker is located at the smaller of the maximum value and 75th percentile + 1.5 IQR (Inter Quartile Range), and the lower whisker is located at the larger of the minimum value and 25th percentile—1.5 IQR; dots are outliers, which are less than 25th percentile—1.5 IQR or more than 75th percentile + 1.5 IQR.### Supplementary Figure 9 REVOLVER’s TRACERx clusters against clusters of occurrences.

We compare REVOLVER’s clusters to those obtained by clustering the binary occurrences of the annotated drivers: we consider the pattern of occurrence in the cohort (A) and the clonality status (B). In both cases the input data is a feature matrix where an entry is 1 if the driver has CCF above 0 in any of the samples of a patient; for both cases we show a tanglegram and the entanglement score, as well the dendrogram coloured with REVOLVER's clusters.

### Supplementary Figure 10 Alternative evolutionary analysis from single-sample cross-sectional data with TRONCO.

We show the data (A) and the fit (B). We run the CAPRI algorithm from TRONCO/ PiCnIc on a feature matrix where an entry is 1 if the driver has CCF above 0 in any of the samples of a patient; here we show only the drivers that occur in at least 7% of the samples in the cohort (

*n = 99*), and the fit is run with all driver that occur in at least 2% of the cohort. CAPRI uses a statistical procedure that scans the patterns of co-occurrence of the input alterations in cross-sectional single-sample tumours, and infers a Suppes-Bayes Causal Network. The graph is annotated with non-parametric bootstrap scores, and with REVOLVER’s edge-specific jackknife scores (Supplementary Notes). NA stands for edges that are never detected across resamples. These predicted associations overlap only partially to the trajectories inferred with REVOLVER, which estimates phylogenetic orderings from each patient. For instance, this model is suggesting a possible transition among PIK3CA and SOX2 which, however, is undetected via phylogenetic analysis of TRACERx CCFs. Conversely, no transitions are estimated for EGFR, while phylogenetic analysis determines two subgroups associated to trajectories initiated by such driver.### Supplementary Figure 11 Survival analysis from single-sample cross-sectional data with REVOLVER’s clusters.

From REVOLVER’s clusters we can create a decision tree (A) to classify large single-sample cross-sectional cohorts (B) and test significant differences in survival outcomes (C). Here the decision tree is manually curated from the most frequent features (edges/ drivers) in the 10 clusters identified by REVOLVER. With this tree we could stratify

*n = 589*tumours from two TCGA and one Broad Institute projects – the groups are annotated with the cluster color – and analyze their disease free survival with standard Kaplan-Meier curves (time unit in months). Curves are compared via logrank test (two-sided) at level 0.05; shaded regions represent 95% confidence intervals of the curves; in the panel we show only pairwise comparisons with significantly different survival risks (*p < 0.05*).### Supplementary Figure 12 Analysis of breast cancers: full size clustering’s results.

Extended version of Fig. 4a, Main Text, without the top dendrogram. In top we show all evolutionary trajectories detected in at least 2 patients, and in bottom all the drivers annotated in the cohort. The top set of drivers, with larger rows, is the set of most frequent ones. Both heatmap panels are sorted by frequency of the annotated variable, in the overall breast cohort.

### Supplementary Figure 13 Repeated evolutionary trajectories in the breast cohort.

The 6 clusters of breast tumours detected by REVOLVER, and their repeated evolutionary trajectories. For each group we report a graph where we annotate the group size (

*n*), the number of times a trajectory is detected within the group on the edge, and the number of times each alteration is clonal or subclonal across all patients. GL stands for germline. In this plot, we show trajectories that occur at least 3 times.### Supplementary Figure 14 Stability of REVOLVER’s cluster for the breast cohort.

We used a jackknife approach to estimate clustering's stability, as measured via the probability that two patients are clustered together in a resampling process (A), and the number of patients harbouring an edge, across all resamples. These statistics are computed by resampling N = 1,000 times the cohort and removing, each time, a random percentage (p = 10%) of patients and re-computing fit and clustering with the original parameters. The heatmap shows the empirical probability estimated via this jackknife approach, and the boxplot in right is computed per cluster; for each cluster the number of points used is equal to

*n*(*n*– 1)/2, where*n*is the cluster size (Main Text, Fig. 4). We report the counts of edges per patients in the bottom boxplot, where we annotate those with median above four. Top and bottom boxes of each boxplot are 25th and 75th percentiles, centerline is the mean; the upper whisker is located at the smaller of the maximum value and 75th percentile + 1.5 IQR (Inter Quartile Range), and the lower whisker is located at the larger of the minimum value and 25th percentile—1.5 IQR; dots are outliers, which are less than 25th percentile—1.5 IQR or more than 75th percentile + 1.5 IQR.### Supplementary Figure 15 REVOLVER’s breast clusters against clusters of occurrences.

We compare REVOLVER’s clusters to those obtained by clustering the binary occurrences of the annotated drivers: we consider the pattern of occurrence in the cohort (A) and the clonality status (B). In both cases the input data is a feature matrix where an entry is 1 if the driver has CCF above 0 in any of the samples of a patient; for both cases we show a tanglegram and the entanglement score, as well the dendrogram coloured with REVOLVER's clusters.

## Supplementary information

### Supplementary Text and Figures

Supplementary Figures 1–15 and Supplementary Notes 1–4

### Reporting Summary

### Supplementary Table 1

Analysis of colorectal cancers

### Supplementary Table 2

Analysis of lung cancers

### Supplementary Table 3

Analysis of breast cancers

### Supplementary Table 4

Analysis of kidney cancers

### Supplementary Software

REVOLVER software package (R source code), along with vignettes

## Rights and permissions

To obtain permission to re-use content from this article visit RightsLink.