Estimating the predictive power of silent mutations on cancer classification and prognosis

In recent years it has been shown that silent mutations, in and out of the coding region, can affect gene expression and may be related to tumorigenesis and cancer cell fitness. However, the predictive ability of these mutations for cancer type diagnosis and prognosis has not been evaluated yet. In the current study, based on the analysis of 9,915 cancer genomes and approximately three million mutations, we provide a comprehensive quantitative evaluation of the predictive power of various types of silent and non-silent mutations over cancer classification and prognosis. The results indicate that silent-mutation models outperform the equivalent null models in classifying all examined cancer types and in estimating the probability of survival 10 years after the initial diagnosis. Additionally, combining both non-silent and silent mutations achieved the best classification results for 68% of the cancer types and the best survival estimation results for up to nine years after the diagnosis. Thus, silent mutations hold considerable predictive power over both cancer classification and prognosis, most likely due to their effect on gene expression. It is highly advised that silent mutations are integrated in cancer research in order to unravel the full genomic landscape of cancer and its ramifications on cancer fitness.


Supplementary Figure 2:
The correlation between the increase in mutational burden and the F1 score improvement obtained by adding silent features to non-silent features. The x-axis depicts the percentage of additional mutational burden that was added on average per patient when adding silent features to non-silent features. The y-axis depicts the percent of improvement gained in F1 score by adding silent features to non-silent features. Every dot represents a single cancer type.

Supplementary Figure 3: Spearman correlation between Jaccard similarity scores and misclassification rates of pairs of cancer types.
Every dot represents a pair of cancer types. The x axis denotes the pair's Jaccard similarity score and the y axis denotes their misclassification rate. The Spearman coefficient (Rho) and respective p value are noted above each graph.

Supplementary Figure 4: Feature-type distribution of the balanced all-features dataset and of the top ranked features for the classification task.
Feature-type distribution of the all-features dataset (top row), top ranked 100 features (middle row) and top ranked 10 features (bottom row). The feature rankings were obtained from the allfeatures models and were averaged across cancer types. The legend indicates the enrichment in the amount of each feature-type in the top 10 features when compared to its original amount in the balanced all-features dataset (ratio between bottom and top row). Figure 5: Polymorphism type distributions in the initial datasets, top 100 features and top 10 features obtained from the OVA models. Each sub-figure (a-b) denotes a model. Within a sub-figure, every three clustered columns represent the distribution of the initial dataset (left column), top 100 features (middle column) and top 10 features (right column) of a single cancer type. The analysis was conducted using the feature importance rankings that were obtained from the balanced datasets. The Synonymous models contain only SNPs and thus are excluded from this analysis. s.

Supplementary Figure 6: Spearman correlations between gene rankings of pairs of models per cancer type.
The all-features model was excluded from the analysis. Every subplot (a-s) represents a single cancer type. Within a subplot, every graph depicts the correlation between two models. A dot in the graph represents a gene. The x axis denotes the gene's rank given by the first model and the y axis denotes its rank given by the second model. The Spearman coefficient (Rho) and respective p value are noted above each graph. Figure 7: Spearman correlation between the number of mutations documented per gene in the TCGA database and the gene's ranking obtained from the all-features models of the 19 cancer types. Each graph represents a single cancer type. A dot represents a single gene. The x axis denotes the number of mutations documented in TCGA for the gene and the y axis denotes the gene's ranking obtained from the all-features model. The Spearman coefficient (Rho) and respective p value are noted above each graph.

Supplementary Table 5: Mutations within genomic positions spanned by the silent features in the 10 top ranked features list of the all-features model that were found to have an impact on regulation
For each cancer type, we searched for mutations that are in the genomic positions spanned by the top 10 genomic elements (whether they are high, medium or low resolution features) that impact expression regulation. This table lists silent mutations that affected at least 0.5% of patients of a certain cancer type and were found to have an impact on regulation. Columns: Chrthe chromosome in which the mutation occurred. Gene the gene in which the mutation occurred. Cancerthe patient cohort that was found affected by the mutation.

Supplementary Table 6: Mutations within genomic positions spanned by the non-silent features in the 10 top ranked features list of the all-features model that were found to have an impact on regulation
For each cancer type, we searched for mutations that are in the genomic positions spanned by the top 10 genomic elements (whether they are high, medium or low resolution features) that impact expression regulation. This table lists non-silent mutations that affected at least 0.5% of patients of a certain cancer type and were found to have an impact on regulation. Columns are as defined for Supplementary