A computational approach for the discovery of significant cancer genes by weighted mutation and asymmetric spreading strength in networks

Identifying significantly mutated genes in cancer is essential for understanding the mechanisms of tumor initiation and progression. This task is a key challenge since large-scale genomic studies have reported an endless number of genes mutated at a shallow frequency. Towards uncovering infrequently mutated genes, gene interaction networks combined with mutation data have been explored. This work proposes Discovering Significant Cancer Genes (DiSCaGe), a computational method for discovering significant genes for cancer. DiSCaGe computes a mutation score for the genes based on the type of mutations they have. The influence received for their neighbors in the network is also considered and obtained through an asymmetric spreading strength applied to a consensus gene network. DiSCaGe produces a ranking of prioritized possible cancer genes. An experimental evaluation with six types of cancer revealed the potential of DiSCaGe for discovering known and possible novel significant cancer genes.

Cancer mutation data were subjected to a preprocessing routine, and networks underwent a link prediction process. Lists of prioritized genes were evaluated concerning precision and discounted cumulative gain, using recent cancer driver genes benchmarks. The results showed DiSCaGe's potential for discovering known and possible novel cancer genes, including very-low-frequency mutated genes.

Results
DiSCaGe overview. DiSCaGe uses cancer mutation data (SNVs and InDels in an MAF file format) and a set of N ≥ 1 undirected and unweighted gene interaction networks (in edge lists) as input, whereas the output is a ranking of prioritized mutated cancer genes. DiSCaGe is composed of 6 steps, as illustrated in Fig. 1. In Step 1, a weighted mutation matrix (WMM) is built and assigned a real value for each patient-gene pair, according to the weight defined for the variant classification of the mutation and the number of mutated patients.
Step 2 uses WMM to obtain a mutation score for each gene, called weighted mutation frequency. Next, in Step 3, a union operation is performed on the gene interactions networks, resulting in an undirected and weighted consensus gene interaction network. In Step 4, a gene spreading strength network (GSSN) is obtained, according to the spreading strength from a gene to its direct and indirect neighbors.
Step 5 extracts a mutation influence exerted on all genes by their neighbors, based on GSSN and gene mutation scores. Finally, in Step 6, each gene mutation score is enriched with the neighbors' influence, and a sorted list of prioritized genes is obtained. All the steps are described in the Methods section. The algorithms and a running example can be found in the supplementary material.
Experimental study. The  www.nature.com/scientificreports/ jected to an enrichment process towards inferences about possible new interactions among the genes (see Gene interaction networks for further details).
Evaluation metrics. The evaluation of computational methods that identify significant mutations in cancer remains a challenging task. The lack of gold-standard databases for driver and passenger genes hampers obtaining an optimal measure of the output. Nonetheless, some gene databases are widely used and continuously updated. Towards a well-defined set of known cancer genes, the following six reliable and recent available benchmarks were considered: sets of specific drivers for each type of cancer, based on the known specific driver from IntOGen, i.e., each set contains genes related to a specific type of cancer, whose specific benchmark is essential, since many genes are important in specific types of cancer, and probably irrelevant in others 29  With these sets of known driver gene benchmarks, the following metrics were used to evaluate the proposed approach: • Precision: the fraction of prioritized genes that are known to be related to cancer. The precision of the ranking can be computed and is obtained by Precision = |PG∩D| |PG| , where PG is the set of prioritized genes, and D is the set of known driver genes. A union of the four described benchmarks was performed to obtain set D, resulting in a single set of 951 known cancer drivers, i.e., D = D NCG ∪ D CGC ∪ D IntOGen ∪ D Bailey .
• Discounted Cumulative Gain (DCG): a measure that considers the relevance of the genes and their position in the ranking, being reduced logarithmically, proportional to this position 30 . DCG p of a ranking of genes up to position p is defined as DCG p = PG p i=1 rel ct j g i log 2 (i+1) , where PG p is the ranking list of p prioritized genes and rel ct j g i is the relevance of a gene g i in cancer type ct j . Such relevance is calculated by incrementing 1 whenever g i is contained in D NCG , D CGC , D IntOGen , or D Bailey . If g i is contained in specific driver benchmark SD ct j IntOGen , then rel ct j g i is incremented by 4 (thus, specific cancer drivers have the same weight if it appears in all four general benchmarks), if it is contained in FD NCG , then rel ct j g i is decreased by 1. Finally, if g i is not in any gene set, its value is zero.
Top 30 genes on benchmarks. The top 30 prioritized genes for each type of cancer were selected to be presented in a matrix for each type, that shows the presence of prioritized genes on the benchmarks, as seen in Fig. 2. Each driver gene benchmark is presented in a row with a specific color, and genes discovered by DiSCaGe are presented in the columns; for example, NCG is the first row and has the green color. If a gene is contained in the benchmark, the matrix cell is colored; otherwise, it is white. It can be noticed how the genes are arranged on the benchmarks and show that results are consistent. Some important genes appear in the results, which are reported on most benchmarks. Few genes do not appear in any benchmark, which can be subjected to further analysis.
Comparison with related methods. In this experiment, DiSCaGe is compared with related methods MutSigCV 31 , MUFFINN 10 , and nCOP 11 , which were employed with standard parameters and configurations towards avoiding possible running influence, described as follows: (1) MutSigCV: executed with only MAF file. A difference in the preprocessing routine was assigned to MAF files for MutSigCV, in which Silent mutations were kept in MAF, as they are mandatory data for the method, which uses them to extract the background mutation rate; (2) MUFFIN: executed with the number of mutated patients for each gene as the gene mutation score (mutation occurrence data). MUFFINN has four variations, which are a combination of the approach (DNmax or DNsum) with the gene network (STRING or HumanNet). DNmax + HumanNet was selected for this experiment because it provided the best results; and (3) nCOP: executed with its preprocessed HPRD gene network with no weights for the nodes in the network. The optimal value for alpha was obtained for nCOP itself.    www.nature.com/scientificreports/ a secondary study on those genes, an automated literature review was performed using CancerMine 29 , a recent tool based on text mining. CancerMine provides information on genes in the cancer context, classifying them as drivers, oncogenes, and tumor-suppressor genes, through mining research papers. Figure 6 displays, for each type of cancer, the genes that are not in the driver benchmarks and their respective number of citations found by CancerMine (Query performed on February 2021). Most genes were cited as cancer genes at least once in the research papers, which suggests even prioritized genes that are not in driver benchmarks can be related to cancer. The remaining non-classified genes should be further evaluated and suggested as possible novel genes for their respective cancer types. Although it is a secondary study, this experiment allows comparing genes known to be drivers with genes not yet known.
To address the biological impact of the findings, we also used the Cancer Genome Interpreter (CGI) datasets to identify drug response and known drivers 32 . The CGI knowledge bases for drug predictions are represented by several publications, the CIVIC database, ASCO and ESMO abstracts, FDA guidelines, among others. The known drivers from CGI datasets represent a complete reference from literature, also including the OncoKB precision oncology knowledge base and ClinVar information about genomic variation. Such analysis showed that genes found by DiSCaGe are associated with drugs and are known drivers, reinforcing the relevance of the findings. The gene sets are presented in the supplementary material.

Discussion
This work introduced DiSCaGe, a computational method for discovering significant genes for cancer, which takes into account weighted mutations in genes and how they can affect network neighborhood through an asymmetric spreading strength measure. DiSCaGe was built to be easy to use by the end-user, demanding mutation data and gene networks as input in standard formats. Additionally, the user must define mutation weights, with no need to define unclear hyper-parameters, thus facilitating the use. An experimental evaluation was conducted with a set of known cancer genes benchmarks and an automated literature review of genes prioritized by DiSCaGe. DiSCaGe was able to (1) prioritize genes known to be related to cancer, (2) prioritize genes related to cancer with low mutation frequency, (3) suggest genes that are not in benchmarks but are cited in research papers as cancer-related ones, and (4) suggest possible novel cancer genes.
DiSCaGe presents several differences in comparison with related methods. Most methods use the simple frequency or a binary mutation matrix to compute the mutation score, such as MUFFINN 10 , nCOP 11 , and MEMo 9 . DiSCaGe employs a simple way to get weighted frequency based on the definition of weights on the impact of each type o mutation. Such impacts are user-centric, i.e., the final user can define the weights based on the objective of the analysis. DriverML 33 automatizes the definition of the impact of mutation types through a machine-learning approach, based on previous information. Related to the use of gene interaction networks, nCOP seeks to identify connected subnetworks that are significant altered across the patients, using these finding to ranking the genes, while DiSCaGe does not consider the subnetworks, only the gene neighborhood. In this way, MUFFINN is closely related to DiSCaGe, but the neighborhood influence is obtained through the maximum of the direct neighbor mutation score or the sum of the direct neighbor, divided by its degree. Also, DiSCaGe differs on using the union network to infer the spread strength from a mutated gene, thus considering it on the neighbor influence. Gene interaction networks are the only previous knowledge information that DiSCaGe employs for cancer gene prioritization. None previous information about the gene cancer significance is used. The machine-learning methods use these known cancer genes in order to discover possible novel ones. MutSigCV 31 is a frequency-based method that estimates the background mutation, while other methods, such Figure 6. A secondary study for the DiSCaGe results: Number of papers (y-axis), reported by CancerMine, that mention specific genes (x-axis) as "cancer gene". DiSCaGe prioritized the presented genes, and they are not contained in any known driver gene benchmark. www.nature.com/scientificreports/ as Dendrix 34 and WExT 35 uses only mutation data to infer a group of mutually exclusive genes. DiSCaGe does not employ such features. A limitation of the work refers to a lack of a systematic biological evaluation of the findings. Despite being a complex task, further laboratory in-vitro and in-vivo investigations can be performed to confirm the results of experiments. In this study, it was used an in silico validation using datasets from knowledge bases to assess the relevance of the findings, showing the potential of DiSCaGe to find several genes associated with drugs. However, several other genes could be evaluated as potential novel information, which is beyond the scope of this study. Also, DiSCaGe can be subjected to a pan-cancer study to be evaluated in a large number of cancer types, thus providing subsidies for its characterization and understanding of cases in which it can be appropriately used. Furthermore, DiSCage could allow the analysis of synonymous mutation. A natural extension of DiSCaGe is to allow the method to suggest possible driver pathways with the use of the final network and gene mutation score for finding significantly related genes. Finally, the development of an online tool will facilitate the use of DiSCaGe by end-users.

Methods
Step 1: Building the weighted mutation matrix. In this first step, the preprocessed MAF file is used as a source for the construction of the Weighted Mutation Matrix (WMM), in which rows are patients and columns are genes. In WMM matrix wmm, entry wmm p i g j , for each pair of patient p i in the set of patients P and gene g j in the set of genes G, a score is obtained according to its type of mutation vc (Variant_Classification from MAF input file) and in a weight w(vc) assigned for each vc. Such weights can be defined by the user in the input of the method according to the purpose of the analysis and the user's knowledge about the mutation data. However, there is a list of predefined weights in the method, which was predefined based on the functional impact of each mutation related by experts in cancer genomics.
Considering a patient p i , and all mutations in a gene g j , entry wmm p i g j is defined as where VC p i g j is the list of mutations that patient p i undergoes in gene g j , and w(vc) is the weight defined for the type of specific mutation. Such a process is performed for all patient-gene pairs. Therefore, all pairs of a mutated gene g j in a patient p i have a score wmm p i g j , which represents the importance of that mutation in that patient. The weighted frequency of mutations is used to consider the mutations and the possible functional impact of them and because sometimes it is necessary a set of mutations to initiate the cell carcinogenesis. Furthermore, the use of average avoids possible errors and noise in the sequencing data.
Step 2: Generating a mutation score for each gene. A single score for each gene is extracted from wmm, and called weighted frequency wf (g i ) , which is the sum of the gene scores for all patients, divided by the number of patients, defined as where P is the set of patients. Such a frequency is extracted for all genes, generating a set wf with weighted frequencies for all genes. The single final score is normalized by the largest value in wf, thus yielding a normalized weighted frequency nwf (g i ) , defined as Step 3: Consensus of gene interaction networks. An important component of DiSCaGe is the gene network, which significantly impacts the method result. Towards reducing the bias of choice of a single network, DiSCaGe accepts multiple networks as input, i.e., the method can be executed with one or more networks.
Each gene network GN i of the input set GN 1 , ..., GN N was treated as undirected and unweighted networks. The union operation on these networks generates an undirected and weighted network UGN. Weights on UGN interactions are the average of times an interaction occurs in each network, i.e., the original interactions are maintained, and a bigger weight is assigned to the interactions that appear more times. With this, such interactions would be more likely to exist, once they are in more networks.
Step 4: Extraction of the Gene Spreading Strength Network (GSSN). According to the local hypothesis 16 , genes (and their associated proteins) involved in a specific disease tend to interact with each other, and some mutations can influence other genes in the same pathway 36 . In this context, if a gene is mutated, such a mutation can impact its neighbors and propagate to the network.
An adapted spreading strength measure proposed by 37 was defined for quantifying the spreading strength of a mutated gene through the neighborhood in UGN. Such a measure takes into account both direct and indirect neighbors, and quantifies the spread of a mutation from a node g i to a node g j , defined as www.nature.com/scientificreports/ where r i is the sum of the edge weights of g i ; r out j is the sum of the edge weights of g j that are not edges of g i , i.e, r out j = g∈(N(g j )\N(g i )) p(g j , g) , where N(g) is the set of neighbors of g and p(g i , g j ) is the weight of edge (g i , g j ) ∈ UGN ; and p ij = p(g i , g j ) . The spreading strength is an asymmetric measure, i.e, ss(g i , g j ) = ss(g j , g i ) .
Considering the term (1 + r i × r out j ) , value 1 represents a single spreading from g i to g j and r i × r out j denotes the impact of g i through g j , taking into account their indirect neighbors which are direct neighbors of g j . At the end, such a value is tuned by the weight of the edge (g i , g j ) . The final spreading strength measure is normalized by the largest value of ss, thus obtaining a normalized spreading strength nss(g i , g j ) , defined as After the extraction of the normalized spreading strength measure for all neighbor genes, a directed and weighted network, called Gene Spreading Strength Network (GSSN), is obtained. In GSSN, directed interaction weights represent the degree of spreading at which a mutation in a gene g i can pass through a gene g j . Finally, the mutation score of each gene is assigned to the GSSN network.
Step 5: Extraction of mutation neighbors influence. The spreading strength among genes represents how much a single gene can be affected by mutations of its neighborhood, and how much it can affect its neighbors. In this step, the received influence of a mutated gene is extracted by a function r(g i ) , which represents how much influence g i receives from its neighbors, defined as where N(g i ) are direct neighbors of g i on GSSN. After the calculation of r(g i ) for all genes of GSSN, a maximum value normalization is applied on r(g i ) , as follows: Step 6: Gene mutation score enrichment based on GSSN and gene prioritization. In this step, the final mutation score of each mutated gene g is obtained, taking into account the individual mutation score of gene nwf(g) and the influence nr(g) score from its neighbors. The final mutation score ms(g i ) of a gene g i is the sum of its mutation score and its neighbor's mutations influence, given by After ms(g) has been obtained for all genes, the final ranking of prioritized mutated genes is extracted through their sorting by ms. Genes with the highest ms values are likely to be significantly mutated and related with significantly mutated neighbors.

Data and preprocessing
Cancer data. The selected cancer mutation data sets were subjected to a preprocessing routine. This is a crucial activity in cancer data analyses, since such data contain a large amount of information that can be suppressed (intron region, for example) when exome mutation data are analyzed. Furthermore, outlier samples can also be removed from the original data sets. The preprocessing routine involves the following two steps: (1) Maintenance of specific somatic variants: only the following somatic variants were kept in MAF file: 3'UTR , 5'UTR , Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Nonstop_Mutation, Splice_Site, and Translation_Start_ Site; and (2) Removal of hypermutated samples: patients are considered hypermutated when they have a much greater number of mutations than most patients in the set. Hypermutated samples should be removed, since they are usually noisy or outliers, which can bias the analyses. Among the several strategies that identify such samples, we used the one proposed by 38 , according to which a sample is hypermutated when it contains more than (Q3 + 4.5 × IQR) somatic mutations, where Q3 is the third quartile, and IQR is the interquartile range of the distribution of mutations across all data samples. If a set of hypermutated samples is identified, it is removed from the MAF file.
No specific genes were removed from the data set. For example, FLAGS (frequently mutated genes) 39 were kept in the analyses, since the aim was the evaluation of the proposed approach with all genes of the preprocessed data set.
Gene interaction networks. The following two gene networks that use protein-protein interaction (PPIs) as the main source of interactions were selected for the development of the computational approach and the experiments: 1) Reactome Functional Interactions (Reactome); and 2) Human Protein Reference Database (HPRD). Each network was treated as undirected and unweighted network G = (V , E) , where set of vertices V = {g 1 , g 2 , ..., g n } are genes and (g i , g j ) ∈ E if gene g i interacts with gene g j . ss(g i , g j ) = (1 + r i × r out j ) × p ij nss(g i , g j ) = ss(g i , g j ) max (g,g ′ )∈G×G (ss(g, g ′ )) r(g i ) = g k ∈N(g i ) nwf (g k ) × nss(g k , g i ) nr(g i ) = r(g i ) max g∈G (r(g)) ms(g i ) = nwf (g i ) + nr(g i )