A genome-wide positioning systems network algorithm for in silico drug repurposing

Recent advances in DNA/RNA sequencing have made it possible to identify new targets rapidly and to repurpose approved drugs for treating heterogeneous diseases by the ‘precise’ targeting of individualized disease modules. In this study, we develop a Genome-wide Positioning Systems network (GPSnet) algorithm for drug repurposing by specifically targeting disease modules derived from individual patient’s DNA and RNA sequencing profiles mapped to the human protein-protein interactome network. We investigate whole-exome sequencing and transcriptome profiles from ~5,000 patients across 15 cancer types from The Cancer Genome Atlas. We show that GPSnet-predicted disease modules can predict drug responses and prioritize new indications for 140 approved drugs. Importantly, we experimentally validate that an approved cardiac arrhythmia and heart failure drug, ouabain, shows potential antitumor activities in lung adenocarcinoma by uniquely targeting a HIF1α/LEO1-mediated cell metabolism pathway. In summary, GPSnet offers a network-based, in silico drug repurposing framework for more efficacious therapeutic selections.

Disease proteins are not scattered randomly in the human protein-protein interactome, but form one or several connected subgraphs, defining the disease module 1 . Previous studies have suggested the literature bias for the human protein-protein interactome, with well-studied proteins often having high connectivity (degree) in the literaturederived data 2 . To inspect the potential literature biased, we utilized the unbiased, systematic human protein-protein interactome identified by (unbiased) yeast two-hybrid (Y2H) assays (see Methods). We found that the significantly mutated genes or highly mutated genes form significant modules in this unbiased interactome, as well ( Supplementary Fig. 1), suggesting low literature data bias. Supplementary Fig. 1. Proof-of-concept of disease module for mutant genes derived from patient-specific DNA sequencing data in the unbiased, comprehensive human protein-protein interactome. Both significantly mutated genes (SMGs, Supplementary Data 1) identified by statistical approaches and highly mutated genes ranked by mutation frequency have the closest network distance compared to random genes by degree-control randomization in this unbiased, comprehensive human protein-protein interactome (https://ccsb.dana-farber.org/interactome-data.html).

Supplementary Note 2: Genes are highly mutated in network modules from the co-expressed protein-protein interaction network
We define a network module based on the RNA-seq data and PPI network. For each cancer type, we computed the Pearson Correlation Coefficient ( ( , )) for each PPI coding gene pair between gene i and gene j, and we only retained the significantly coexpressed pairs (p-value less than 0.05, F-statistic) for both tumor samples ( ( , ) ( ) and normal samples ( ( , ) ) ) based on RNA-seq data. We used | ( , ) ( − ( , ) ) | > 0.7 as a cutoff to select the differentially co-expressed protein-protein interactions, and defined the largest connected component 1 as the network module for the corresponding cancer type. We found that genes in the network modules identified from the RNA-seq data-based co-expressed PPI network are more likely mutated as shown in Supplementary Fig. 2. Furthermore, known significantly mutated genes (SMGs) are significantly enriched in the network modules identified from RNA-seq databased co-expressed PPI networks across four selected cancer types, as well ( Supplementary Fig. 3). These observations support the hypothesis that highly mutated genes are more differentially co-expressed in the human interactome.

Supplementary Note 3: Methodology and Detailed Description of GPSnet
Here we present GPSnet, an integrated, network-based methodology for patientspecific disease module identification and in silico drug repurposing. Supplementary   Fig. 4 illustrates the pipeline of the GPSnet algorithm.
Supplementary Fig. 4. A diagram illustrating the GPSnet methodology as described as below.
We aim to find the hyper-mutated module for each caner type, where the number of mutations of the genes in the module is significantly larger than random modules. We set the initial score of each gene (i) in each cancer type as 1 ( ) = 3(4) 5 (4) , where ( ) is the number of the mutation of gene i in the corresponding cancer type, and ( ) is the cDNA length of gene i. In order to eliminate the influence of the sparse somatic mutations, the network smoothing method is used to transmit the score across the whole human protein-protein interactome network.
The random walk with restart process (RWR) is applied to calculate the smoothing gene score. Consider a random walker starting from gene i, who will move to a random neighbor with probability (1 − α) or will return to gene i with probability α at each iterative time step, where α ∈ [0 1] is the parameter that drives the restart probability of the random walk process. The RWR process is run until a steady-state is reached. We denote > ????⃗ as the score vector at iterative step t, and the resulting propagation process can be described as where the i-th element of ⃗ is the smoothing score of gene i. The module is defined as a sub-graph within the network of each cancer type, and the score of the module M is , where m is the number of genes in module M and is the average score over the whole gene set for the corresponding cancer type. We denote L as the set of the genes that interact with module M.
The following steps are used for the random searching process needed to generate the module.
(1) Initially, a random gene is selected as the "seed" module.
(2) For each gene ∈ L , we calculate the connectivity significance as follows (extended from the hypergeometric distribution): where 4 is the degree of gene i, is the number of genes in the module, 3 is the number of gene i's neighbors that belong to the module and is the total number of the gene set. (5) Steps (2)-(4) are repeated until no more genes can be added.
Repeating the above steps, we obtain a set of modules. We rank the modules according to the descending order of their final score. The gene confidence is calculated as the number of times that the genes appear in the top 1 percent of modules. Genes are then sorted in descending order of the confidence score, and the top L genes are considered as the final consensus module (Supplementary Data 2).
Supplementary Fig. 5. Network plot of the number of cancer-specific and shared genes between the disease modules across 15 cancer types. The numbers in the cyan circles denote the numbers of common (shared or overlapped) genes among the corresponding cancer types. The purple circles represent the number of unique genes for specific cancer types. Supplementary Fig. 6. Known cancer driver genes (named significantly mutated genes, Supplementary Data 1) are appreciably enriched in cancer type-specific disease modules across 15 cancer types. Disease modules were identified by GPSnet when α = 0.5 to balance the degree bias. The significantly mutated genes were collected from TCGA projects as described in previous studies 3,4 .
Supplementary Fig. 7. Known cancer-associated genes (Supplementary Data 3) are appreciably enriched in cancer type-specific disease modules across 14 cancer types. Disease modules were identified by GPSnet when α = 0.5 to balance the degree bias. HNSC was excluded for validation owing to lack of known cancer-associated genes from publicly available databases. Known cancer-associated genes were collected from four public databases: the Online Mendelian Inheritance in Man (OMIM) database 5 , HuGE Navigator 6 , PharmGKB 7 , and Comparative Toxicogenomics Database (CTD) 8 , as described in our recent study 9 .
Supplementary  e P a n -c a n c e r are more likely to be targeted by approved drugs compared to driver genes or CGC gene products alone. We identified significant neighbors of proteins from the human protein-protein interactome via the DIAMOnD algorithm 10 . Herein, we tested two types of drug targets: (a) drug-protein binding affinity (IC50) less than 1 µM, and (b) drugprotein binding affinity (IC50) less than 10 µM. Supplementary