Challenge design

To benchmark methods for tumor subclonal reconstruction, we built upon the ICGC–TCGA (International Cancer Genome Consortium–The Cancer Genome Atlas) DREAM Somatic Mutation Calling Challenge and its tumor simulation framework (Fig. 1a)19,20,21. We designed 51 tumor phylogenies (Supplementary Fig. 1) to cover a wide range of biological and technical parameters (Fig. 1b). In total, 25 of these phylogenies were based on manually curated tumors from the Pan-Cancer Analysis of Whole Genomes (PCAWG) study22, while 16 were based on non-PCAWG tumors13,23,24,25,26,27,28 (the Somatic Mutation Calling Tumor Heterogeneity and Evolution Challenge (SMC-Het) cohort). The remaining ten were designed as variations of a single breast tumor, each testing a specific edge case or assumption of subclonal reconstruction algorithms (the special cases; Extended Data Fig. 1a)13. We supplemented these with a five-tumor titration series at 8×, 16×, 32×, 64× and 128× coverage19 (the titration series). For each tumor design, we simulated normal and tumor BAM files using BAMSurgeon19 and then used the Genome Analysis Toolkit (GATK) MuTect29 to identify somatic SNVs and Battenberg13 to identify somatic CNAs and estimate tumor purity. These were provided as inputs to participating groups, who were blinded to all other details of the tumor genome and evolutionary history.

Fig. 1: Design of the challenge. a, Timeline of the SMC-Het DREAM Challenge. The design phase started in 2014 with final reporting in 2021. VM, virtual machine. b, Simulation parameter distributions across the 51 tumors. From left to right: number of subclones, whole-genome doubling status, linear versus branching topologies, NRPCC, total number of SNVs and fraction of subclonal SNVs. c, Examples of tree topologies for three simulated tumors (P3, T12 and S2). For each simulated tumor, its tree topology is shown on top of the truth (column 1) and two example methods predictions (columns 2 and 3) for each subchallenge (rows). MRCA, most recent common ancestor. Full size image

Participating teams submitted 31 containerized workflows that were executed in a reproducible cloud architecture30. Organizers added five reference algorithms: an assessment of random chance predictions, the PCAWG ‘informed brute-force’ clustering31, an algorithm that placed all SNVs in a single cluster at the variant allele frequency (VAF) mode and two state-of-the-art (SOTA) algorithms (DPClust13 and PhyloWGS11). Each method was evaluated on seven subchallenges evaluating different aspects of subclonal reconstruction: sc1A, purity; sc1B, subclone number; sc1C, SNV cellular prevalences (CPs); sc2, clusters of mutations; sc3, phylogenies (Fig. 1c). Note that both subchallenges 2 and 3 have paired deterministic (‘hard’) (sc2A and sc3A) and probabilistic (‘soft’) (sc2B and sc3B) tasks. A Docker container for each entry is publicly available from Synapse (https://www.synapse.org/#!Synapse:syn2813581/files/). Each prediction was scored using an established framework, with scores normalized across methods within {tumor, subchallenge} tuples to range from zero to one19. Runs that generated errors and produced no outputs, that produced malformed outputs or that did not complete within 21 days on a compute node with at least 24 central processing units (CPUs) and 200 GB of random-access memory (RAM) were deemed failures (2,189 runs; Supplementary Table 1). Failures mainly occurred for two tumors with over 100,000 SNVs. To ensure that our conclusions were consistent across software versions, we executed updated versions for five algorithms (Extended Data Fig. 2 and Supplementary Table 1). Differences were modest (r = 0.74) but varied across subchallenges and algorithms; updates particularly influenced assessments of subclone number (sc1B; r = 0.34). In total, we considered 11,432 runs across the seven subchallenges (Supplementary Table 1) and refined these to 6,758 scores after eliminating failed runs and highly correlated submissions (r > 0.75) from the same team, while considering only submissions made during the initial challenge period (Methods and Supplementary Tables 2 and 3).

Top-performing subclonal reconstruction methods

We ranked algorithms on the basis of median scores across all tumors; no single eligible entry was the top performer across multiple subchallenges (Fig. 2a). For each subchallenge, a group of algorithms showed strong and well-correlated performance (Fig. 2b and Extended Data Fig. 3a–e), suggesting multiple near-equivalent top performers. Therefore, we bootstrapped across tumors to test the statistical significance of differences in ranks (that is, to assess rank entry < rank best and assign a P value under the null hypothesis that rank entry = rank best ). sc1A and sc2B had single top-performing submissions, while two statistically indistinguishable (P > 0.1) submissions were identified for sc1B and sc1C, along with three for sc2A (Extended Data Fig. 4 and Table 1). The top performer for sc1A used copy-number calls alone to infer purity, while the second-best and statistically indistinguishable (P16) sc1A method used a consensus of purity estimates from both copy-number and SNV clustering.

Fig. 2: Overview of algorithm performance. a, Ranking of algorithms on each subchallenge based on median score. The size and color of each dot shows the algorithm rank on a given subchallenge, while the background color reflects its median score. The winning submissions are highlighted in red, italic text. b, Algorithm score correlations on sc1C and sc2A with select algorithm features. The top-performing algorithm for each subchallenge is shown in italic text. c,d, Algorithm scores on each tumor for sc1C (n = 805) {tumor, algorithm} (c) and sc2A (n = 731 {tumor, algorithm} (d) scores. Bottom panels show the algorithm scores for each tumor with select tumor covariates shown above The distribution of relative ranks for each algorithm across tumors is shown in the left panel. Boxes extend from the 0.25 to the 0.75 quartile of the data range, with a line showing the median. Whiskers extend to the furthest data point within 1.5 times the interquartile range. Top panels show scores for each tumor across algorithms, with the median highlighted in red. Tumors are sorted by difficulty from highest (left) to lowest (right), estimated as the median score across all algorithms. Full size image

Table 1 Top-performing methods for each subchallenge (subchallenges where the method was a top performer are indicated with X) Full size table

Seven algorithms were submitted to the phylogenetic reconstruction tasks (sc3A and sc3B). Multiple algorithms were statistically indistinguishable as top performers in both challenges (Extended Data Fig. 4) but accuracy differed widely across and within tumors. Two examples of divergent predictions are given in Supplementary Fig. 2a,b. The predicted and true phylogenies for all tumors can be found at https://mtarabichi.shinyapps.io/smchet_results/; true phylogenies are provided in Supplementary Fig. 1. Algorithms differed in their ability to identify branching phylogenies (Supplementary Fig. 2c) and in their tendency to merge or split individual nodes (Supplementary Fig. 2d). Parent clone inference errors shared similarities across algorithms; the ancestor inference for SNVs within a node was more likely to be correct if the node was closely related to the normal (that is, if it was the clonal node or its child) (Supplementary Fig. 2e,f). When algorithms inferred the wrong parent for a given SNV, most assignment errors were to closely related nodes (Supplementary Fig. 2g). As expected, these results emphasize that single-sample phylogenetic reconstruction was most reliable for variants with higher expected alternate read counts (that is, clonal variants) and their direct descendants; detailed phylogenies varied widely across tumors and algorithms.

The scores of methods across subchallenges were correlated (Extended Data Fig. 3f). This was in part driven by patterns in the set of submissions that tackled each problem and in part by underlying biological relationships among the problems. For example, sc1C, sc2A and sc2B assessed different aspects of SNV clustering and their scores were strongly correlated with one another but not with tumor purity estimation scores (sc1A). Rather, numerous algorithms scored highly on sc1A, suggesting that different approaches were effective at estimating CP (Extended Data Fig. 4).

Algorithm performance is largely invariant to tumor biology

To understand the determinants of the variability in algorithm performance between and within tumors, we considered the influence of tumor intrinsic features. We ranked tumors by difficulty, quantified as the median score across all algorithms for each subchallenge (Fig. 2c,d and Extended Data Fig. 3g–k). The most and least difficult tumors differed across subchallenges (Supplementary Fig. 3a) and tumor ranks across subchallenges were moderately correlated (Supplementary Fig. 3b). sc2A and sc2B were the most (ρ = 0.61) while sc1C and sc3B were the least correlated (ρ = −0.10).

To determine whether specific aspects of tumor biology influence reconstruction accuracy, we identified 18 plausible tumor characteristics. We supplemented these with four features that represent key experimental or technical parameters (for example, read depth; Supplementary Table 2). These 22 ‘data-intrinsic’ features were generally poorly or moderately correlated to one another, with a few expected exceptions such as ploidy being well correlated with whole-genome duplication (WGD; Extended Data Fig. 5a). For each subchallenge, we assessed the univariate associations of each feature with the pool of scores from all algorithms that ranked above the one-cluster solution (Extended Data Fig. 5b). As a reference, we also considered the tumor identifier (ID), which captures all data-intrinsic features as a single categorical variable. We focused on the subchallenges with large numbers of submissions and where scores could be modeled as continuous proportions using β regression (Methods). Individual data-intrinsic features explained a small fraction of the variance for sc1A, sc1C, sc2A and sc2B. Tumor ID explained ~15% of the variance in scores and no individual feature explained over 10%, suggesting that data-intrinsic features were not exerting consistently large influences on subclonal reconstruction accuracy across algorithms.

We hypothesized that data-intrinsic features might, therefore, exhibit a method-specific effect that would be clearer in algorithms with generally strong performance. We repeated this univariate analysis on scores from the top five algorithms in each subchallenge, which were moderately correlated (Supplementary Fig. 3c). This modestly enhanced the strength of the detected associations. In sc1C, the varying sensitivity of SNV detection across tumors (relative to the simulated ground truth) explained 15.7% of variance in accuracy (Fig. 3a). In sc2A, the read depth adjusted for purity and ploidy (termed NRPCC, number of reads per chromosome copy10) explained 19.8% of the variance across tumors. The total number of SNVs and the number of subclonal SNVs explained 9.3% and 9.2% of the variance for sc1C, as might be expected, because both define the resolution for subclonal reconstruction10. These results indicate that data-intrinsic features either explained little of the variability in subclonal reconstruction accuracy or did so in ways that differed widely across algorithms.

Fig. 3: Tumor features influence subclonal reconstruction performance and biases. a, Score variance explained by univariate regressions for the top five algorithms in each subchallenge. The heatmap shows the R2 values for univariate regressions for features (x axis) on subchallenge score (y axis) when considering only the top five algorithms. The right and upper panels show the marginal R2 distributions generated when running the univariate models separately on each algorithm, grouped by subchallenge (right) and feature (upper). Lines show the median R2 for each feature across the marginal models for each subchallenge. b, Models for NRPCC on sc1C and sc2A scores when controlling for algorithm ID. The left column shows the model fit in the training set composed of titration-series tumors (sampled at five depths each) and five additional tumors (n = 10 individual tumors). The right column shows the fit in the test set (n = 30 tumors, comprising the remaining SMC-Het tumors after removing the edge cases). Blue dotted lines with a shaded region show the mean and 95% confidence interval based on scoring ten random algorithm outputs on the corresponding tumor set. The top-performing algorithm for each subchallenge is shown in italic text. c, Effect of NRPCC on purity error. The top panels show the purity error with NRPCC accounting for algorithm ID with fitted regression lines. The sc1A scores across tumors for each algorithm are shown in the panel below. The bottom heatmap shows Spearman’s ρ between purity error and NRPCC for each algorithm. The winning entry is shown in bold text. Two-sided P values from linear models testing the effect of NRPCC on sc1A error (with algorithm ID) are shown. TS, titration series. d, Error in subclone number estimation by tumor. The bottom panel shows the subclone number estimation error (y axis) for each tumor (x axis) with the number of algorithms that output a given error for a given tumor. Tumor features are shown above. See Methods for detailed descriptions of each of these. Full size image

Algorithmic and experimental choices drive accuracy

Given the relatively modest impact of data-intrinsic features on performance, we next focused on algorithm-intrinsic features. We first modeled performance as a function of algorithm ID, which captures all algorithmic features. Algorithm choice alone explained 19–35% of the variance in scores in each subchallenge (Extended Data Fig. 5c). This exceeded the ~15% explained by tumor ID, despite our assessment of more tumors than algorithms.

To better understand the effect of algorithm choice, we quantified 12 algorithm characteristics. For example, we annotated whether each method adjusted allele frequencies for local copy number (Extended Data Fig. 5d). The variance explained by the most informative algorithm feature was 1.5–3 times higher than that of the most informative tumor feature (Extended Data Fig. 5c). Our analysis highlighted Gaussian noise models as particularly disadvantageous for SNV coclustering (sc2A) relative to binomial or β binomial noise models (generalized linear model (GLM) B Gaussian = −0.98, P = 1.43 × 10−15, R2 = 0.11). This trend became stronger when we compared algorithms with Gaussian noise models to those with binomial noise models and adjusted for tumor ID (B Gaussian = −1.11, P < 2 × 10−16, R2 = 0.35).

The strong impact of algorithm choice on performance led us to hypothesize that data-intrinsic features show algorithm-specific influences on performance. Therefore, we developed multivariate models to control for algorithm ID when modeling data-intrinsic features. After making this change, SNV caller sensitivity, tumor purity and experimental read depth were significantly associated with increased scores for nearly all subchallenges (q < 0.05). These associations were consistent whether we analyzed all algorithms that exceeded the baseline (Extended Data Fig. 5e) or only the top five algorithms for each subchallenge (Supplementary Fig. 3d). Our results show that algorithm choice was the strongest driver of subclonal reconstruction accuracy, followed by technical data-intrinsic features. Biological data-intrinsic features were weak determinants of subclonal reconstruction accuracy.

Optimizing experimental design for subclonal reconstruction

Most data-intrinsic features reflect aspects of tumor biology not known a priori. In contrast, the main controllable technical feature is sequencing coverage. We investigated the sensitivity of subclonal reconstruction to this experimental design choice by considering NRPCC. By adjusting sequencing coverage for tumor purity and ploidy, NRPCC provides an excellent estimate of power in subclonal reconstruction10. We modeled the relationship between NRPCC and SNV coclustering subchallenge scores (sc1C and sc2A) using a GLM in which we controlled for algorithm ID, because of the strong influence of this feature in our univariate analyses above. We fit the model on five tumors with a coverage titration series (five points per tumor19) and on five randomly selected tumors, leading to 373 scores from these ten tumors. We then assessed model generalizability on 466 scores from 30 tumors. Nine edge cases and two tumors with a high mutation burden (>50,000 SNVs) were excluded from both the training and testing cohorts. As expected, higher NRPCC increased sc1C and sc2A scores for most algorithms (Fig. 3b). Increasing NRPCC improves coclustering by reducing read-sampling noise, thereby improving subclone resolution10,31. We observed an unexpected saturation effect; at high NRPCC, most variability in scores was because of differences among algorithms. These data quantify a clear benefit to tumor sequencing to an NRPCC of at least 32 for subclonal reconstruction from a single sample across the range of algorithms tested here.

We replicated these analyses for estimation of tumor purity (sc1A). Lower NRPCC was associated with an overestimation of tumor purity (sc1A) in both the titration-series and the SMC-Het cohort (Fig. 3c). This likely occurred because, in low-coverage sequencing data, SNVs detected on a few reads were indistinguishable from background data. These false negatives led to a truncated binomial distribution and overestimation of the average frequencies of detected SNV clusters10,31. Conversely, high NRPCC increased the number of subclonal mutations detected, causing some algorithms to underestimate purity (especially the naive one-cluster and random algorithms). In a similar way, NRPCC influenced the prediction of subclone number (sc1B). More algorithms underpredicted the number of subclones as the tree depth and the true subclone number increased (Fig. 3d; B tree depth = −1.18, P = 1.60 × 10−41, ordinal regression, likelihood ratio test), suggesting there was a limit to how many subclones could be distinguished at a given NRPCC. The number of subclones predicted increased with NRPCC for a given tumor for most algorithms (Extended Data Fig. 6a; B = 0.71, P = 2.99 ×10−24). These data emphasize that it is critical to report NRPCC and interpret estimates of tumor subclonal diversity in that context.

Lastly, we asked whether other tumor features might bias the prediction of purity and subclone number. We used multivariate penalized regression with leave-one-out cross-validation to model sc1A and sc1B errors. After controlling for algorithm ID, the sc1A model explained 40.1% of the variance and the sc1B model explained 57.1%. The multivariate model for purity estimation error highlighted that increasing SNV clonal fraction (CF) and percentage genome altered (PGA) reduced the purity underestimation errors but algorithms were more likely to overestimate purity when the true purity was low (Extended Data Fig. 6b). The subclone number error model showed that algorithms were more likely to underestimate the number of subclones if there was a WGD. These results suggest that increasing power (that is, NRPCC) is especially important if there is a priori knowledge that a given tumor or tumor type is prone to low purity, CF or PGA or is likely to harbor a WGD10,31. These results also confirmed NRPCC as a crucial study design parameter that should be considered when interpreting subclonal reconstruction results.

Sources of error in SNV CP estimation

Estimating the fraction of cancer cells in which each SNV occurs is one of the most fundamental goals of subclonal reconstruction, shedding light on the evolution of mutational processes in a tumor3,31,32,33. To understand errors in these estimates, we focused on the 20 algorithms that produced submissions for both sc1C and sc2A. For each tumor, we annotated the SNV subclone assignments (sc2A output) with the predicted CP for that subclone (sc1C output; Fig. 4a). Most algorithms accurately determined whether an SNV was clonal; 14 of 20 had both median specificity and sensitivity above 80% (Fig. 4b). Clonal assignment specificity increased with NRPCC, as more subclonal SNVs were correctly assigned, leading to improved accuracy (Fig. 4c and Supplementary Fig. 3a; B log2(NRPCC) = 0.29, q = 3.11 × 10−17), and decreased with SNV caller precision (B log2(precision) = −1.24 q = 1.94 × 10−14; Supplementary Fig. 4a). Accuracy also slightly decreased with mutational burden and tumor CF (Supplementary Fig. 4a).

Fig. 4: Impacts of genomic features on SNV subclonality predictions. a, Schematic showing how outputs from sc1C and sc2A were used to annotate SNV CP for each entry. FN, false negative; FP, false positive; TN, true negative; TP, true positive. b, Mean clonal SNV detection sensitivity and specificity for each algorithm with standard errors (n = 727 {tumor, algorithm} predictions). c, Clonal SNV detection F scores for each entry on each tumor. d, Top, clonal accuracy for each algorithm, CNA category and tumor tuple (n = 5,392); bottom, SNV CP estimation error for each algorithm (n = 4,868,460 {algorithm, SNV CP} predictions). Boxes extend from the 0.25 to the 0.75 quartile of the data range, with a line showing the median. Whiskers extend to the furthest data point within 1.5 times the interquartile range. e, Effect size and false discovery rate-adjusted two-sided P values from entry-specific linear regression models for SNV CP error by CNA type and SNV clonality with median sc1C and sc2A scores. Top performing entries are shown in italic text. f, SNV CP error grouped by subclone for a corner-case tumor simulated at two depths (n = 395,364 {algorithm, tumor, SNV} prediction errors). Boxes extend from the 0.25 to the 0.75 quartile of the data range, with a line showing the median. Whiskers extend to the furthest data point within 1.5 times the interquartile range. g, Correlation between BAM features and Battenberg output features with SNV CP error for each entry. Only features that had an absolute correlation > 0.1 are shown. Battenberg features are noted with a star and top-performing algorithms are highlighted in italic text. Full size image

The inference of SNV clonality was impacted by underlying copy-number states. Subclonal CNAs significantly reduced SNV clonality assignment accuracy relative to clonal CNAs after controlling for algorithm and tumor ID (B subclonal CNA = −0.21, P = 1.14 × 10−6, GLM). SNVs that arose clonally in a region that then experienced a subclonal loss had the least accurate clonal estimates (Fig. 4d; B clonal SNV × subclonal loss = −0.33, P = 3.06 × 10−2; Supplementary Table 3). Subclonal losses on the mutation-bearing DNA copy reduced VAF, causing many algorithms to underestimate the CP of these SNVs (W SNV clonal = 1.04 × 1010, P < 2.2 × 10−16, Wilcoxon rank-sum test for SNVs in subclonal deletions; Supplementary Table 3). Similarly, algorithms overestimated SNV CP in regions with subclonal gains and subclonal SNVs (W SNV clonal = 2.96 × 109, P < 2.2×10−16, Wilcoxon rank-sum test; Supplementary Table 4). This resulted in lower accuracy (B subclonal SNV × subclonal gain = −0.32, P = 8.0 × 10−3, GLM; Fig. 4d and Supplementary Table 4). Biases in CP estimation because of CNAs differed among algorithms (Fig. 4e). To assess whether robustness to CNAs impacts performance, we associated the proportion of variance in SNV CP error explained by CNA status and SNV clonality in these models with algorithm score. Algorithms whose CP estimates were more robust to CNAs better estimated the overall subclonal CP distribution (sc1C; ρ CNA = −0.43) and better coclustered SNVs (sc2A; ρ CNA = −0.37; Supplementary Fig. 4b).

Because subclonal CNAs can be difficult to detect, we investigated whether copy-number calling errors aggravated the effects of CNAs on estimation of CP. As expected, clonal CNA regions were nearly perfectly detected by our CNA caller (Battenberg; Extended Data Fig. 7a). By contrast, 7 of 68 subclonal losses and 25 of 48 subclonal gains were entirely missed and six more were misestimated. The accuracy of subclonal CNA detection was strongly influenced by tumor NRPCC (Extended Data Fig. 7b). Elastic net logistic regression showed that CNAs in low-CP subclones and SNP-poor regions were less accurately detected (Extended Data Fig. 7c). While Battenberg CNA calling errors did not significantly impact the accuracy of SNV clonality assignment, algorithms were more likely to overestimate CP for SNVs on segments with incorrect CNA states, with consistent direction of error biases (Extended Data Fig. 7d and Supplementary Table 5).

SNV features also shaped error profiles independently of CNAs. Almost all algorithms were more likely to overestimate the CP of subclonal SNVs (Fig. 4d,e) because of reduced power at lower tumor read depths10,13,31. Examining two edge-case tumors with identical architectures emphasized that this bias increased for lower subclone CP and NRPCC (Fig. 4f). To quantify how other sources of error in SNV and CNA calls propagate to subclonal reconstruction, we derived 53 measures of variant call quality from the BAM files, VCF files and Battenberg outputs (Methods) that we hypothesized could impact CP estimation and correlated them with CP error. Variant call quality was associated with CP error in patterns that varied across metrics and algorithms, with mean SNV mapping quality showing positive associations for many algorithms (Fig. 4g).

Impact of neutral tail mutations on subclonal reconstruction

Recent work showed that the ever-growing tail of point mutations at ever lower frequency may impact subclonal reconstruction16. These so-called ‘neutral tails’ can be explicitly modeled in subclonal reconstruction; however, because of their low CP, their practical importance at conventional whole-genome sequencing (WGS) coverages has been unclear34. To quantify their impact, we inserted neutral tail mutations into four titration-series tumors. We used agent-based cell division34 to derive the number and prevalence of neutral mutations, varying the tumor’s overall mutation rate (Extended Data Fig. 8, Methods and Supplementary Note 2). We tested the five best algorithms for sc1A, sc1B, sc1C and sc2A (18 methods; 1,440 reconstructions).

The effect of neutral tail mutations on subclonal reconstruction was generally modest in terms of both algorithm ranking and absolute scores (Extended Data Fig. 8), as well as error profiles (Extended Data Fig. 9). Their impact was observed at higher sequencing depths (>64×) where they tended to increase subclone number estimates (sc1B; β = 0.42, P = 3.52 ×10−3; Extended Data Fig. 9). At 128× coverage, most algorithms assigned tail mutations to low-VAF subclones with a high proportion of tail mutations and the predicted CP of SNVs outside the neutral tail was largely unaffected (Extended Data Fig. 9). At high depths, it may then be advantageous to explicitly account for tail mutations to avoid spurious low-VAF clusters.

Consistent with these findings, MOBSTER filtering, which identifies and removes tail mutations, significantly improved mutation assignment scores, especially as the branching tail size increased and at a depth > 64× (Supplementary Fig. 5). It reduced spurious clusters and removed many false-positive mutations. Thus, prefiltering could be incorporated into subclonal reconstruction pipelines when there is sufficient sequencing depth (>64×). The precise benefits of such filtering across a broad range of tumor and genomic contexts remain unclear but our results suggest that they may be worth defining, especially in the face of high-NRPCC sequencing.

Pragmatic optimization of algorithm selection

We next sought to optimize algorithm selection across an arbitrary set of subchallenges. To visualize algorithm performance across all subchallenges, we projected both algorithms and subchallenges onto the first two principal components of the scoring space, explaining 66% of total variance (Fig. 5a). The blue ‘decision axis’ shows the axis of average score across subchallenges when all subchallenges were weighted equally and this axis was stable to small fluctuations in these weights (Fig. 5a). We randomly varied tumor and subchallenge weights 40,000 times across three groups of subchallenges: {sc1B, sc1C}, {sc1B, sc1C, sc2A} and {sc1B, sc1C, sc2A, sc2B} (Fig. 5b and Supplementary Note 3). Twelve algorithms (35%) reached a top rank within at least one study, while 22 (65%) were never ranked first. Because the choice of weights is ultimately user dependent, we created a dynamic web application for modeling the influence of different selections (https://mtarabichi.shinyapps.io/smchet_results/).

Fig. 5: Performance across multiple algorithms and subchallenges. a, Projections of the algorithms and subchallenge axes in the principal components of the score space. A decision axis is also projected and corresponds to the axis of best scores across all subchallenges and tumors, when these are given equal weights. The five best methods according to this axis are projected onto it. A decision ‘brane’ in blue shows the density of decision axis coordinates after adding random fluctuations to the weights. b, Rank distribution of each method from 40,000 sets of independent random uniform weights given to each tumor and subchallenge in the overall score. From left to right: sc1B + sc1C; sc1B + sc1C + sc2A; sc1B + sc1C + sc2A + sc2B. Names of the algorithms have a star if they were ranked first at least once. c, Four subchallenges for each of which one ensemble approach could be used (sc1A, median; sc1B, floor of the median; sc1C, WeMe; sc2A, CICC; Methods); the median and the first and second tertiles (error bars) of the median scores are shown across tumors of independent ensembles based on different combinations of n methods (n is varied on the x axis). The dashed line represents the best individual score. d, Color-coded hexbin densities of median ensemble versus median individual scores across all combinations of input methods. The identity line is shown to delimit the area of improvement. e, Same as d for maximum individual scores instead of median scores. Full size image

Ensemble approaches have previously been used in many different areas of biological data science to combine outputs from multiple algorithms and improve robustness21,31,35,36. They have not been widely explored for subclonal reconstruction, in part because many subclonal reconstruction outputs are complex and heterogeneous31. To assess whether ensemble approaches could improve subclonal reconstruction, we identified and ran ensemble methods for individual subchallenges based on median or voting approaches, which served as conservative baselines (Methods).

The median ensemble performance increased with the number of input algorithms for all subchallenges (Fig. 5c). Ensemble performance was more consistent across tumors for sc1A and sc1B when more input algorithms were used, as shown by the decreasing variance in scores (Supplementary Fig. 6). Ensemble approaches outperformed the best individual methods for sc1B but not for sc1A, sc1C or sc2A (Fig. 5c), although above-median performance was achieved (Fig. 5d,e). These results show that the tested ensemble methods could match or modestly improve performance when the best algorithm was not known but at substantial computational costs (Supplementary Note 3).