Estimating the Frequency of Single Point Driver Mutations across Common Solid Tumours

For cancers, such as common solid tumours, variants in the genome give a selective growth advantage to certain cells. It has recently been argued that the mean count of coding single nucleotide variants acting as disease-drivers in common solid tumours is frequently small in size, but significantly variable by cancer type (hypermutation is excluded from this study). In this paper we investigate this proposal through the use of integrative machine-learning-based classifiers we have proposed recently for predicting the disease-driver status of single nucleotide variants (SNVs) in the human cancer genome. We find that predicted driver counts are compatible with this proposal, have similar variabilities by cancer type and, to a certain extent, the drivers are identifiable by these machine learning methods. We further discuss predicted driver counts stratified by stage of disease and driver counts in non-coding regions of the cancer genome, in addition to driver-genes.


Method and Datasets
The CScape classifier is described in Rogers et al [1] and is trained with positive (disease-driver) variant data from COSMIC [2] and neutral variant data from the 1000 Genomes project [3]. The classifier presents an associated confidence measure, or p-score, on the range 0 for neutral to 1 for disease-driver, thus 0.7 means an 70% probability of neutral. This predictor is available at http://cscape.biocompute.org.uk/. CScape has sub-classifiers covering prediction in coding and noncoding regions of the cancer genome (CS-coding and CS-noncoding). To construct this classifier we investigated a variety of kernel-based methods [4] using the scikit-learning package (version 0.17.1). We found that gradient boosting [8] gave the best performance on validation data. We used a variety of different data sources to train the classifier, which are more fully described in our earlier paper [1]. For example, for coding prediction we used feature groups labelled VEP (variant effect prediction, inclusive of amino acid substitution), 46-way and 100-way conservation, genomic context measures and spectrum kernels [4] with the latter covering genomic sequence information. Thus single nucleotide variants in genomic regions which are highly conserved across species are more likely to be functional in human disease, relative to variants in regions where there has been significant sequence variation across species. This observation is then used as a possibly informative data source via the feature groups 46-way conservation and 100-way conservation (the split between these two is based on the range of species considered). For non-coding prediction we used feature groups such as 46-way conservation, 100-way conservation, spectrum, genomic context and mappability (the latter measuring the uniqueness of a region), for example. To incorporate these feature groups and construct the sub-classifiers, CS-coding and CS-noncoding, we used a greedy sequential learning approach based on leave-one-chromosome-out cross-validation (LOCO-CV). Thus we can order different prospective data sources according to accuracy on unseen validation data. Via a greedy approach we start by combining the two top-ranked data sources into a single kernel [4] and record its balanced accuracy according to LOCO-CV. We then add further prospective data sources in descending order of balanced accuracy, constructing a kernel for each combination of data sources. We terminate this greedy sequential addition of data sources if the balanced validation accuracy reaches a plateau or starts to decline. This terminates the learning process and therefore we can proceed to evaluation on unseen test data. To derive predicted SNV-driver counts we used unseen test data from the International Cancer Genome Consortium [5]. In Supplementary Table 1 we present the total sample sizes, number of hypermutator samples excluded and the number of zero-counts (i.e. for the given threshold on the p-score no disease-driver single nucleotide variants were predicted). The hypermutator estimations and zerocount estimations in Table 1 correspond to a threshold on the p-score determined by a FDR (false discovery rate) choice of 5% and are derived from coding region prediction only (i.e. from CS-coding). We excluded samples with evidence of hypermutation in the determination of the results in Figures 1 to 5 (main paper). Our criterion for exclusion was prediction of more than 500 SNV-drivers in coding regions. We did not include predictions from non-coding regions in our estimation of a prospective hypermutation example because of the weak performance of the non-coding predictor (CS-noncoding) [1] and the poor extent of known functionality of non-coding genomic regions. Skin cutaneous melanoma (SKM) had the highest proportion of hypermutators at 36.7%. Next were gastric adenocarcinoma (STAD) and colon adenocarcinoma (COAD) at 22.6% and 20.9% respectively. In the Table we state zero-count instances for prediction in coding regions and zero-count instances were included in counts. For the majority of cancer types there are only a limited number of instances with a SNV-driver count of zero, in coding regions. Thyroid cancer, with its very low overall mean count, could be expected to have a higher number of these but the proportion is only 6.6% (for a FDR of 5%). Neuroblastoma, though, also has a similarly low mean count for SNV-drivers and the proportion of samples with a zero count is high at 46.0%. This may indicate a more crucial role for other types of drivers with this disease, beyond single point mutations. ) used as test data in our study (under sample size), followed by the number of hypermutators and numbers with zero counts for SNV-drivers (for a FDR of 5%) in the latter two columns (for coding region prediction). Samples exhibiting potential hypermutation were excluded from our study, instances where zero SNV-driver counts were predicted, were included. The mean or median number of number of disease-driver mutations (y-axis) is plotted against the threshold on the p-score (x-axis) which is an estimation of the confidence in a class assignment to positive (disease-driver) status for a variant identified in the cancer genome. The horizontal line is the estimate of the mean proposed by Martincorena et al [7]. If we lower the threshold on the confidence (p-score) we allow through more positive predictions.

Threshold dependency of the counts and comparison with other methods
In Supplementary Figure 1 we plot the median and means for the predicted numbers of SNV-drivers in coding regions of the breast cancer (left) and thyroid cancer (right) genome. The horizontal line is the estimate from Martincorena et al [7]. At a threshold on the p-score of 0.9 the estimates are in approximate agreement. However, as noted in the main text, if we should make a less stringent choice on the p-score (the confidence in the prediction) then this lets through more positive (disease-driver) predictions. However, retaining this choice for the p-score threshold (0.9), we see from Supplementary Figure 2, that there is an approximate agreement on the cancer types with the smallest mean for the SNV-driver counts (e.g. thyroid cancer) and the largest (e.g. bladder urothelial cancer). CScape consistently gives higher mean and median counts over Martincorena et al [7]. However, the concept that the sizes of coding SNV-driver sets is relatively small is confirmed.
3 Additional plots complementing Figure 3 of the main paper We give two additional plots below complementing Figure 3 of the main text. In Supplementary  Figure 3 we present the full set of curves for the mean counts. In Supplementary Figure 4 we present the full set of curves of the median counts of SNV-drivers across all cancers (since only a selection is presented in Figure 3 of the main paper).

Estimating the number of SNV-drivers by stage of disease
Data was extracted from the International Cancer Genome Consortium database [5]. Only in a subset of instances were we able to extract the clinical annotation by stage and the data used for constructing Supplementary Tables 2 and 3 therefore differs from, and is a subset of, data used for deriving the figures in the main paper. Not all cancer samples have been staged since some cancer staging requires molecular characteristics to be taken into account (for example, breast cancer): these cancers were omitted from the staging analysis. In line with our discussion in the main paper we used a cutoff on the p-score of 0.88 in coding regions. A trend towards increasing numbers of SNV-drivers with increasing stage of disease is not well established for malignant lymphoma, oral, pancreatic cancer, neuroblastoma, renal and thyroid cancer. For neuroblastoma, thyroid and renal cancer the numbers of SNV-drivers is low with initial stage of disease and remains fairly constant and low throughout. For other cancers there is a more pronounced trend towards increasing numbers of SNV-drivers with stage of disease. This observation could be applied to colorectal cancer where the numbers of SNV-drivers evolves from a mean of 16.6 for Stage I to 42.6 at Stage IV, supported by large sample sizes at each stage. Liver cancer is another cancer with increasing number of SNV-drivers with stage of disease. Finally, both early onset (EOPC) and late onset prostate cancer (PRAD) have a systematic trend of increasing numbers of SNV-drivers as we proceed from early stage to late stage disease.       [3] data, this constitutes an independent test set. A variant was labelled as a driver if the associated p-score for the confidence in that status exceeded 0.88. This value for the p-score cutoff was selected because it gives a false discovery rate (FDR) of 5% (see main paper, Section 2). If a gene had at least one such SNV-driver, we incremented the sum and divided the final total by the number of donor samples considered for that cancer type. Given that sample sizes (number of donors) are generally quite large, the differences between occurrence rates of such driver mutations by gene are very statistically significant. The lists below cover the top five genes by type of cancer, as discussed in the main paper. At the CScape website (http://cscape.biocompute.org.uk/), under the Help/Documentation webpage, we give a downloadable file (driver-genes) which gives these ranked genes down to the level of no SNV-drivers in the gene with confidence greater than 0.88.

Prediction on non-coding disease-drivers
We pursued a study of non-coding SNV-drivers proposed in the Pan-Cancer Analysis of Whole Genomes (PCAWG) study of Rheinbay et al [10]. The dataset used is derived from the International Cancer Genome Consortium [5] and the The Cancer Genome Atlas [6] and independent of the datasets used to train CScape (COSMIC [2] and 1000 Genomes [3]). The results are tabulated in Supplementary Table 9 and derive from listed prospective non-coding drivers available among their list of the top 50 single point mutations drivers (Extended Data Figure 1 in [10]). A restriction has been made to prospective drivers located on autosomes and labelled as residing in non-coding regions of the cancer genome.  Figure 1 in [10]). This table only gives single nucleotide variants located on autosomes and labelled by our classifier as residing in non-coding regions. The table presents the chromosome (Chr.), position and reference nucleotide (Ref.) based on the GRCh37 reference genome. The three prospective variants are presented (Mut.) with the confidence of driverstatus given in the next column, in the same relative order, and derived from our predictor CScape (http://cscape.biocompute.org.uk)). Mutation at a position is labelled oncogenic if all three variants from reference are predicted as having disease-driver status. Mutation at a position is labelled possibly oncogenic if some variants from reference are predicted as having disease-driver status.