Engineered in-vitro cell line mixtures and robust evaluation of computational methods for clonal decomposition and longitudinal dynamics in cancer

Characterization and quantification of tumour clonal populations over time via longitudinal sampling are essential components in understanding and predicting the response to therapeutic interventions. Computational methods for inferring tumour clonal composition from deep-targeted sequencing data are ubiquitous, however due to the lack of a ground truth biological data, evaluating their performance is difficult. In this work, we generate a benchmark data set that simulates tumour longitudinal growth and heterogeneity by in vitro mixing of cancer cell lines with known proportions. We apply four different algorithms to our ground truth data set and assess their performance in inferring clonal composition using different metrics. We also analyse the performance of these algorithms on breast tumour xenograft samples. We conclude that methods that can simultaneously analyse multiple samples while accounting for copy number alterations as a factor in allelic measurements exhibit the most accurate predictions. These results will inform future functional genomics oriented studies of model systems where time series measurements in the context of therapeutic interventions are becoming increasingly common. These studies will need computational models which accurately reflect the multi-factorial nature of allele measurement in cancer including, as we show here, segmental aneuploidies.


Description of the supplementary tables
The following supplementary tables are available for this manuscript.
Table S1a (S1a.xls): Genomic coordinates of targeted SNVs for Experiment 1 along with cell line group membership and number of reference and variant read counts per each sample.
Table S1b (S1b.xls): Genomic coordinates and copy numbers of targeted SNVs for Experiment 2 along with cell line group membership and number of reference and variant read counts per each sample.        Table S10 (S10.xlsx): Median absolute errors and interquartile ranges (IQR) in estimating mutation prevalences for each algorithm in Experiment 2.

Statistical procedure for validation of SNV target positions
In order to validate and obtain the final list of SNV target positions for our experiments we proceeded as follows. For each amplicon region we used the number of reads with reference and variant alleles for each position to calculate a background variant probability, we then used a binomial test with this probability to obtain a p-value for the significance of the target SNV in that region. To obtain the final list of SNVs specific to only one of the cell lines we applied our binomial test to all amplicon regions in the two replicate mixtures with 100% that cell line (see Table 2 of the main text). We selected the positions with p-value < 10 −16 in both 100% mixtures that intersect with the initial list of positions for that cell line. In order to obtain the final list of shared SNV positions, first we take again the intersection of the positions with p-value < 10 −16 in the two 100% replicates for each cell line obtaining two lists, we then found the intersection between these two lists and the original list of chosen shared target positions.

Selected primers
See Tables S12a and S12b.

Subsampling results for single samples
Figures 5 and 6 show the V-measure scores obtained when downsampling the number of targets and read depth, respectively, and applying SciClone and PyClone to each single sample in Experiment 1 separately. We observe in 5 that PyClone presents greater V-measure scores than SciClone. We also observe in 6 that decreasing read depth leads to smaller V-measure values and that SciClone could not provide any results when read depth was smaller than one hundred.

2/10
Figures 7 and 8 present the V-measure scores obtained when downsampling the number of targets and read depth, respectively, and applying PyClone to each single sample in Experiment 2 separately. We can observe that smaller V-measure scores when the number of targets increase and higher median V-measure scores when read depth increases.
Figures 9 and 10 show that the absolute prevalence errors produced by PyClone decrease as the number of targets and read depth increases. Figure 11 shows the absolute errors in estimating mutation prevalence when applying PhyloWGS to the original data of each sample in Experiment 1 and 2 separately. We observe that the diploid samples in Experiment 1 lead to much smaller errors than the aneuploid samples in Experiment 2.      (e) Histogram of the VAFs from sample 6 in Experiment 1 (25% 184-hTERT-L2 and 75% HCT116).

Figure legends
(f) Histogram of the VAFs from sample 6 in Experiment 2 (25% DAH55 and 75% DAH56). (a) Co-clustering performance of PyClone. This figure shows the performance of PyClone in correctly assigning each SNV to its corresponding true cluster (HCT116, 184-hTERT-L2 and shared). Each inferred cluster corresponds to a different color.
(c) Co-clustering performance of SciClone. SNVs that are not correctly assigned to their correct cluster are in black. (a) Co-clustering performance of PyClone using the correct copy number information.
(c) Co-clustering performance of PyClone with noisy copy numbers.