Somatic mutation detection and classification through probabilistic integration of clonal population information

Somatic mutations are a primary contributor to malignancy in human cells. Accurate detection of mutations is needed to define the clonal composition of tumours whereby clones may have distinct phenotypic properties. Although analysis of mutations over multiple tumour samples from the same patient has the potential to enhance identification of clones, few analytic methods exploit the correlation structure across samples. We posited that incorporating clonal information into joint analysis over multiple samples would improve mutation detection, particularly those with low prevalence. In this paper, we develop a new procedure called MuClone, for detection of mutations across multiple tumour samples of a patient from whole genome or exome sequencing data. In addition to mutation detection, MuClone classifies mutations into biologically meaningful groups and allows us to study clonal dynamics. We show that, on lung and ovarian cancer datasets, MuClone improves somatic mutation detection sensitivity over competing approaches without compromising specificity.

G enomic accumulation of somatic point mutations, or single nucleotide variants (SNVs), can disrupt the regular activity of cells and result in cancer initiation and progression. Collectively, the complete repertoire of SNVs across a cancer genome (numbering in the thousands) form a statistically robust marker for inferring clonal populations and studying tumour evolution. As such, accurate detection of all somatic SNVs, including those with low prevalence, is vital as they can define clones with phenotypic properties of interest. Mechanistic association of specific clones with properties such as treatment resistance, metastatic potential, and fitness under therapeutic selective pressures remains a key objective of biomedical investigators studying tumour progression.
Phylogenetic analysis can encode the evolutionary lineage of tumour cells across time and anatomic space [1][2][3][4][5][6][7] . Sequencing of multiple samples of a cancer to reconstruct evolutionary patterns and drug response profiles are increasingly common. For example, in rapid autopsy programs, at the time of a patient's death, tens to hundreds of metastatic samples are collected for future study 8,9 . Recent multi-sample sequencing studies in renal, lung, ovary, breast, colorectal, and other cancers have revealed striking evolutionary and clinically important properties of cancers 5,7,10,11 . However, the analytical methods to detect mutations from such experimental designs are still immature, and few studies have leveraged shared statistical strength across samples to detect mutations with greater sensitivity.
In the limit case, all cells likely harbour unique genomes, however due to the nature of branched evolutionary processes, clones can be coarsely modelled as major clades in the cell lineage phylogeny of a cancer. These clades share the majority of mutations, and therefore define first approximations to the genotypes of clones. Clonal genotypes and their relative abundances in the cancer cell population can be approximated by clustering mutations measured in bulk tissues and estimating their cellular prevalences (the variant tumour cell fraction) 12,13 .
Phylogenetic algorithms mostly use mutations (represented as binary genetic markers), as inputs to infer the branched evolutionary lineages of tumour cells 14,15 . Thus, mutation detection accuracy will ultimately impact the performance of phylogenetic inference algorithms.
Detection of low prevalence mutations is a major challenge due to typically small signal to noise ratios, owing to: (i) contamination by normal cells; (ii) genome copy number alteration; and (iii) the presence of mutations in only a small fraction of tumour cells (intra-tumour heterogeneity). In this work, we illustrate that knowledge of the clonal population structure improves detection of mutations defining low prevalence clonal genotypes.
Although SNV calling algorithms are ubiquitous in the literature, it remains challenging to detect low prevalence mutations. Algorithms have been developed for calling mutations from a single sample 16,17 , paired (matched normal and tumour) samples [18][19][20][21][22] , or multiple samples 23,24 . We list several popular algorithms. Mutationseq uses a feature based classifier for calling mutations 20 , where the features are constructed from matched paired normal and tumour samples. Strelka is a method for somatic SNV and small indel detection from sequencing data of matched normal and tumour samples 18 . It is based on a Bayesian approach that uses normal and tumour samples' allele frequencies with normal expected genotype structure. MuTect uses a Bayesian classifier that employs various filters to ensure high specificity to detect mutations from matched normal and tumour samples 21 . FreeBayes uses short read alignments to call the most likely genotypes for the population at each position. It can be run in single mode using only one tumour sample or in multiple mode utilizing multiple tumour samples from the same patient 25 .
FreeBayes can detect somatic mutations if germlines are manually removed. MultiSNV jointly analyses all available samples under a Bayesian framework to improve the performance of calling shared mutations 23 . SNV calls from GATK 26 are refined and corrected by using phylogeny information across multiple samples 24,27 .
In our proposal, MuClone, we exploit prior knowledge of tumour content, tumour cellular prevalence, and copy number information across multiple samples to improve detection of somatic SNVs, and in particular, low prevalence ones. Our model uses mutation clusters and copy number information obtained from standard approaches 28,29 . In the first step, a set of stringent SNV calls or validated SNVs (using targeted sequencing data) is used to infer mutation cluster information. Then, MuClone uses the inferred cluster information to more accurately call mutations across genome (whole genome or exome sequencing data). In addition to calling mutations, MuClone also classifies mutations into clusters based on cellular prevalence. This provides the user with the opportunity to profile mutation changes across time and space, and adds a rich layer of interpretation into the detection process.
We test MuClone through simulation studies and an application to real, multiple sample, patient data. These experiments reveal that incorporating the cellular prevalences of different clusters improves accuracy. Moreover, in real data, MuClone exhibits higher sensitivity (true positive rate or recall) in detecting mutations without compromising specificity (true negative rate) compared with other methods.

Results
Synthetic data. In this section, we examine the performance of MuClone on simulated data. In what follows, we generate N loci from M samples with K underlying tumour mutation clusters with sequencing error rate ϵ and tumour content t m .
We first randomly generate an evolutionary relationship between clusters, viewed as a binary phylogenetic tree. Each node in the tree represents a mutation cluster. The root node represents the ancestral cluster. For each sample, the cellular prevalence of the first descendant, ϕ 1 st , is sampled from a Uniform distribution over [0, ϕ parent ], where ϕ parent is the cellular prevalence of the parent node (cluster). The cellular prevalence of the second descendant, ϕ 2 nd , is sampled from a Uniform distribution over ½0; ϕ parent À ϕ 1 st , defined so that the sum of the children's prevalences do not exceed their parent's cellular prevalence. The absence or presence of each cluster, in each sample, is sampled from a Bernoulli distribution that assigns equal probability to both outcomes. If a cluster is not present in a sample, the corresponding cellular prevalence will be 0. See Supplementary Fig. 1 for an example of this process. Loci are assigned to a cluster uniformly at random from {0, …, K}, where cluster 0 represents the wildtype cluster and {1, …, K} are mutation clusters. For each locus, in each sample, the number of reads overlapping the locus (depth) is sampled from a Poisson distribution with mean d m . Wildtype copy number is deterministically set to 2, and a copy number profile (major and minor copy number) is generated according to the following steps: The total copy number, C t , is sampled uniformly at random from {1, …, C max }. An integer number, C b , is randomly (following a discrete Uniform distribution) picked from 1 to C t , and C a is defined as C a = C t − C b . Lastly, the major copy number is set to the maximum of C b and C a ; the minor copy number is set to the minimum of those two values. Then, corresponding to each cluster, the number of variant reads are sampled from the Beta-Binomial distribution described in Eq. (7) with precision parameter equal to 1000.
Synthetic data evaluation. We simulated 10 synthetic data for 20,000 loci from 4 samples of a patient, with 5 underlying clusters, including an ancestral cluster. The maximum copy number was 5, and the error rate was 0.01. The average sequencing depth was assumed to be 100 for all samples.
To assess the performance and robustness of MuClone, we systematically shielded MuClone from clonal information (Fig. 1). In particular, the cellular prevalence information was perturbed by (i) adding noise to its value, or (ii) removing the cellular prevalence information of the clusters. The noise was generated  MuClone's performance as a function of wildtype prior: 10 synthetic datasets generated for 20,000 loci, from 4 samples of a hypothetical patient, with 5 underlying clusters. The maximum copy number is 5 and the error rate is set to 0.01, and average sequencing depth is approximately 100. We assess the performance for Φ T equal to 0.001 (dark purple), 0.01 (light purple), 0.02 (white smoke), 0.03 (light blue), 0.04 (dark blue). a Sensitivity and b Specificity of MuClone with parameters: error rate = 0.01, tumour content = 0.75, and precision parameter = 1000 from a normal distribution with mean zero and standard deviations: 0, 0.01, 0.1, and 0.2. The noise value, ν, was added to the cellular prevalence of the cluster, while bounding the resulting value between 0 and 1, that is, where ϕ Ãz m and ϕ z m are the perturbed and original cellular prevalence of cluster z and sample m, respectively. The clusters which their clonal information was removed, were randomly chosen with equal probabilities. As expected, both sensitivity and specificity were highest with complete and accurate clonal information; see Fig. 1. This suggests that incorporating clonal information can improve mutation detection accuracy and gives evidence to support MuClone's approach. Furthermore, since the sensitivities were only marginally impacted by adding noise to the clonal information, MuClone should be able to cope with modest misspecification of the prior. However, specificity can decrease if the cellular prevalence is reduced to levels associated with the wildtype cluster and sensitivity can improve if adding noise increases the cellular prevalence to levels associated with a removed mutation cluster. Naturally, accuracy is most severely impacted with reduced/ corrupted clonal information; see Fig. 1. For the modest level of noise (noise standard deviation 0 and 0.01), the sensitivity and specificity of removing various numbers of clusters were compared through a Kruskal-Wallis test (4e−5 ≤ p-values ≤ 1e−4) which shows that the change in performance due to clonal information is significant. In noiseless settings, the confidence interval for the difference (of zero and four removed clusters) in mean sensitivity and specificity are [0. 16 We also explored how sensitivity and specificity change as a function of the wildtype prior and the threshold Φ T used to distinguish the cellular prevalence cutoff of a mutation cluster. In Fig. 2, we tested MuClone with wildtype prior values 0.5, 0.75, and 0.99, and with Φ T values 0.001, 0.01, 0.02, 0.03, and 0.04. In the case that the wildtype prior equals 0.5, we assumed that a locus is equally likely to be a mutation or not (when no other information is provided). MuClone's sensitivity and specificity are near 1 for Φ T equal to 0.02 and wildtype prior equal to 0.5. As expected, with small values of Φ T , the sensitivity and specificity decrease since it is difficult to distinguish between wildtypes and mutations with small cellular prevalences. The sensitivity also decreases for large values of Φ T because mutations are miscalled as wildtypes. When the error rate was 0.01, and wildtype prior was 0.5, the optimal Φ T was about 0.02. We used these values for the following experiments.
The performance of MuClone was tested with various tumour content (from 0.1 to 0.99) and different error rates (0.01 and 0.001); see Fig. 3. For samples with tumour content greater than 0.5, the sensitivity and specificity remain close to 1. Sensitivity and specificity decreased to only about 0.9 when the tumour content in the sample is as low as 0.1. These results establish promising performance over different ranges of tumour content with different error rates (likely scenarios in real data).
In addition, we also explored the performance of MuClone for samples with different coverage (mean depth): 30, 60 and 100; see Fig. 4. Intuitively, the performance is higher when we have more coverage. Since MuClone leverages cellular prevalence information to improve the performance of mutation detection, the performance gain is noticeable when the variant allelic ratio resolution supports the given cellular prevalence resolution (in our analysis the cellular prevalence of mutations is greater than 0.02). Figure 5 demonstrates how well mutations are classified by MuClone. The input clonal information was perturbed by adding noise from zero mean normal distribution with standard deviation 0.01 to simulate a more realistic scenario. In Fig. 5a, each bin (i, j) shows the fraction of mutations in cluster i that are classified into cluster j by MuClone. Figure 5a shows that 85% of mutations are classified into the correct cluster.
In order to show that the classification errors occurred between clusters with small phylogenetic distance, we define a misclassification index to quantify phylogenetic distance; calculated as where q (i,j) is the number of mutations in cluster i that are classified into cluster j, and the Euclidean distance between the cellular prevalences of cluster i and j is denoted by dist (i,j) . The distance of the closest and farthest cluster to cluster i is denoted by dist min i and dist max i , respectively. In Fig. 5b, small misclassification indices demonstrate that misclassified mutations occur between close clusters. This can be interpreted as phylogenetically recently separated clusters.
Real data. Two real data sets with multiple samples for each patient were used to evaluate the performance of MuClone. The first data set was multiple whole genome sequencing data from 7 patients with high grade serous ovarian cancer. The second data set was multiple whole exome sequencing data from 8 patients with non-small-cell lung cancer (NSCLC).
High grade serous ovarian cancer. We tested MuClone's performance on whole genome sequencing data (with depth 30x) from multiple tumour samples surgically resected from high grade serous ovarian cancer patients 5  The copy number, tumour purity, and mutation cluster information for experimentally re-validated mutation status were taken from the phylogenetic study of high-grade serous ovarian cancer (see the supplementary note of the paper 5 ). Mutation clusters were estimated with PyClone 12 on the deep targeted sequencing data (>1000x coverage) from the same samples and in three patients with accompanying single cell sequencing data (see Table S16 in the phylogenetic study of high-grade serous ovarian cancer paper 5 ). Copy number and tumour purity estimates were calculated with the TITAN software 28 . In order to eliminate germlines, loci with variant nucleotides in the corresponding normal sample were removed from the dataset. Then, the performance of MuClone was benchmarked against Strelka 18  In Fig. 6, the performance of MuClone is compared with other methods executed with default settings. For each patient, p, we assessed performance by averaging Youden's index, sensitivity, and specificity across different samples. For patient p, with n p samples, these are calculated as Sensitivity i p ; where Sensitivity i p , Specificity i p and Youden's index i p are the sensitivity, specificity and Youden's index of sample i and patient p, respectively. In aggregate, MuClone outperforms other methods by improving sensitivity without compromising specificity; see Fig. 6. False negatives arise mainly because the WGS data is under-represented (the average depth of the WGS data is about 30x) and lacks variant alleles that are present in the targeted sequencing data. False positives arise due to erroneous signal from sequencing technical artefacts. In Fig. 6, Strelka, MutationSeq, MuTect and Naive MuClone have lower performance as they do not incorporate information across multiple samples. FreeBayes was run on multiple samples and germlines were removed manually, but since the method only considers tumour samples, it has the lower performance versus other methods.
To assess the performance of MuClone and MultiSNV, we conducted a two-sided t-test for the difference in the mean of Youden's index evaluated on mutation calls from MuClone and MultiSNV. The 95% confidence interval is [0.03, 0.1], with pvalue equal to 0.0006; this shows that the difference is statistically significant. Importantly, MuClone improves sensitivity, enabling the detection of more mutations across the whole genome. Figure 7 depicts the classification of mutations into clusters relative to the ground truth, as defined by running PyClone on the data (omitting singleton clusters 5 ). Each bin (i, j) of Fig. 7 shows the fraction of mutations in cluster i that are classified into cluster j by MuClone; 93% are correctly classified by MuClone. Moreover, we notice that misclassified mutations are classified into phylogenetically similar clusters (the misclassification index for patient 1 was 0.015).
Non-small-cell lung cancer. We tested MuClone's performance on early-stage NSCLC samples from the TRACERx data set 7 . To help obtain the clonal and subclonal census, multiple tumour regions for each patient were sequenced by Illumina HiSeq. We used the copy number, purity estimate, and the mutation cluster information available in TRACERx study Supplementary Material 7 . In the TRACERX study, the cellular prevalence was calculated from the whole exome sequencing data on a set of stringent mutations that were selected from MuTect and VarS-can2 results with post-processing. In addition, the TRACERx study added a few mutations to reduce missed subclonal mutations; see Supplementary Appendix of TRACERx study 7 .
To compare the performance of MuClone with Strelka, MultiSNV, and MuTect, we randomly selected 8 patients with subclonal mutations from the TRACERx data set (see Table S2 Supplementary Appendix 1 of the paper 7 ). The TRACERx study generated a re-validated and curated list of mutations for their analysis; see Supplementary Appendix 2 of TRACERx study 7 . The mutations with full copy number information across all 8 patients were used as ground truth to evaluate performance.
We evaluated the false negative rates of mutation calling across several methods; see Table 1. Altogether, out of 7238 mutations, MuClone missed 475 mutations while Strelka, MultiSNV and MuTect missed 7205, 5720, and 1086 mutations respectively.  Hence, borrowing statistical strength, as done in MuClone, across samples likely increases sensitivity to real mutations. We next ran MuClone, MultiSNV and MuTect on the whole exome data from multiple samples of 8 patients to ascertain specificity. We note that MuClone removes reads with mapping quality less than 5 and for positions that have (i) a variant nucleotide in a normal sample, (ii) more than 40% filtered basecalls (A basecall is filtered if more than 3 mismatches occur between the read and the reference within a window of 20 bases on each side of the site.); or (iii) more than 75% of the reads that cross the site have deletions in any of the samples 18 . For exome sequencing data, mutations were called if the corresponding MuClone mutation probability is greater than 0.9. The other methods were executed with default settings. The total number of calls and the number of common calls between different methods (restricted to positions with copy number information) at the patient level is depicted in Fig. 8. A high degree of variation across callers is observed. Altogether, MuClone called 13,556 mutations while MultiSNV and MuTect called 31,374 and 11,915, respectively. MultiSNV output the largest number of calls in all of the samples, while MuTect and MuClone output similar number of calls. Figure 8 also demonstrates the mutations used in the TRACERx study and their overlap with the mutation calls in different methods. The set of mutations overlapping between MuClone and TRACERx is most similar; this suggests that the increase in sensitivity conferred by MuClone does not come at the expense of specificity.
We also explored the performance of MuClone when clonal information differs in the number of input clusters or the value of the cellular prevalence; see Fig. 9. We perturbed the value of the cellular prevalences (estimated by PyClone) by adding noise from a mean zero normal distribution with different standard deviations: 0, 0.01, 0.1, and 0.2. We see that MuClone is robust to slight changes of cellular prevalence values. We also shielded MuClone from different fractions of the clonal information and that decreased the performance more than adding noise. In general, this result shows that more accurate clonal information provides better mutation detection.

Conclusion.
We studied the use of clonal information for the purpose of somatic mutation detection and classification in multisample whole-genome sequencing data. The proposed statistical framework uses the clusters cellular prevalences and copy number information for detection and classification of low prevalence mutations. Our proposal, MuClone, outperformed other popular mutation detection tools, while providing the added benefit of classifying whole genome sequencing mutations into biologically relevant groups. Both synthetic and real data results showed that using the cellular prevalences of tumour clusters can improve mutation detection sensitivity. Importantly, our results suggest improvement in sensitivity can be achieved without compromising specificity.
Since the accuracy of detecting mutations can affect the performance of phylogenetic analysis, we suggest improvement from using MuClone will impact the field of multi-region sequencing for cancer evolution studies. As the field matures, we expect that the method presented here will be incorporated into more analytically comprehensive modelling of whole genome sequencing data when multiple samples are used to infer properties of clonal dynamics. Next steps are in developing a unified iterative algorithm that alternates between identifying the phylogenetic structure of the constituent clones comprising each tumour sample, and detection of mutations leveraging the new phylogenetic structure.
As sequencing costs continue to decrease (e.g., with Illumina's NovoSeq platform), multi-sample whole genome sequencing of tumours will continue to proliferate (e.g., rapid autopsy program) as a viable experimental design. Thus, MuClone will be an asset in the arsenal of analytical methods deployed to interpret evolutionary properties of cancer and to gain insights into clonal dynamics in time and space.

Methods
Description of MuClone. MuClone uses previously known cellular prevalence information to improve mutation detection and classification. For each sample, MuClone detects mutations from joint analysis of multiple samples. We encode this process in a generative probabilistic framework to perform joint statistical inference of multiple observations (from multiple samples) of the variant allele counts of a mutation of interest. The inputs to the model are: the number of variant reads, and the depth for a set of sequenced loci from multiple samples derived from the same patient; a measure of allele-specific copy number at each locus, in each sample, with tumour content; and the cellular prevalence and the abundance of underlying mutation clusters. MuClone outputs (i) a probability for each locus, at each sample, of being a mutation, and (ii) its cluster.
The probabilistic graphical model of MuClone is depicted in Supplementary  Fig. 2.
Model definition. We first define g n m the genotype of a given locus n in sample m, taking values in G ¼ fA; B; AB; AAB; ABB; g. For example, the genotype ABB where vðg n m Þ : G ! N and cðg n m Þ : G ! N return the number of the variant allele and the copy number of genotype g n m respectively, for example v(ABB) = 2 and c (ABB) = 3. The variable ϵ > 0 is a small positive constant that accounts for sequencing error. It allows for non-zero variant reads, due to sequencing error, when there are no variant alleles in genotype g n m . However, since the sequenced reads are independently sampled from an infinite pool of DNA fragments, at a given locus, each read may belong to the normal, reference, or variant population. The normal population stands for normal cells; The reference population are tumour cells which do not have the mutation at the given locus; and the variant population are the ones carrying the mutation. Therefore, using a single genotype state, g n m , introduces error into our analysis. To account for this fact, we consider using the full genotype state, ψ n m ¼ g N n m ; g R n m ; g V n m À Á , at a given locus n and sample m, to model the number of variant reads. Normal population fraction is 1 − t m where t m is the tumour content of sample m and the cellular prevalence of the mutation is ϕ z m which is the fraction of tumour cells carrying the mutation. According to our prior knowledge, we assume mutations are clustered into K clusters. For a given locus n, Z n = z ∈ {1, …, K} defines which cluster the mutation belongs to. If a position is not a mutation then it belongs to wildtype cluster identified by Z n = 0.
Therefore, for a mutation at a given locus n and sample m with cellular prevalence ϕ z m and tumour content t m , the variant allele probability is denoted by ξðψ n m ; ϕ z m ; t m Þ. It is proportional to the sum of the (properly scaled) variant probabilities from each population: Considering the full genotype state, the number of reads containing the variant alleles at a given locus n that belongs to cluster Z n follows a Binomial distribution with probability μðZ n Þ ¼ ϵ if Z n ¼ 0 ξðψ n m ; ϕ z m ; t m Þ if Z n ¼ z and z 2 f1; ; Kg; If locus n is mutated in at least one of the M samples, then the probability of mutation, in each sample, is calculated separately as where J Ã m is the set of clusters of sample m whose cellular prevalences are greater than a fixed positive threshold called Φ T , The threshold Φ T distinguishes the clusters of sample m in which their non-zero cellular prevalence are due to actual variant alleles. The default value of Φ T is zero. However, depending on the method used for estimating cellular prevalences, it can be set to another positive value, if some non-zero input cellular prevalences indicate wildtype clusters. In addition, MuClone assigns the locus to cluster z * that maximizes z Ã ¼ argmax This classifies mutations to one of the previously known clusters. The classification of mutations helps in biological interpretation and phylogenetic analysis of the data.

Data availability
The high grade serous ovarian cancer data used in the current study are available on-line at https://bitbucket.org/fdorri/muclone. All sequencing data has been deposited in the European Genome-Phenome Archive under study accession EGAS00001000547 5 . The NSCLC data used for the analysis is included in suppleantary material of TRACERx study manuscript. Sequencing data is also available for download in the European Genome-Phenome Archive under accession number EGAS00001002247 7 . The patient consent for high grade serous ovarian cancer and NSCLC data are outlined in their primary manuscript 5,7 .