Identification of disease treatment mechanisms through the multiscale interactome

Most diseases disrupt multiple proteins, and drugs treat such diseases by restoring the functions of the disrupted proteins. How drugs restore these functions, however, is often unknown as a drug’s therapeutic effects are not limited to the proteins that the drug directly targets. Here, we develop the multiscale interactome, a powerful approach to explain disease treatment. We integrate disease-perturbed proteins, drug targets, and biological functions into a multiscale interactome network. We then develop a random walk-based method that captures how drug effects propagate through a hierarchy of biological functions and physical protein-protein interactions. On three key pharmacological tasks, the multiscale interactome predicts drug-disease treatment, identifies proteins and biological functions related to treatment, and predicts genes that alter a treatment’s efficacy and adverse reactions. Our results indicate that physical interactions between proteins alone cannot explain treatment since many drugs treat diseases by affecting the biological functions disrupted by the disease rather than directly targeting disease proteins or their regulators. We provide a general framework for explaining treatment, even when drugs seem unrelated to the diseases they are recommended for.

biological functions. The diffusion profile is computed by biased random walks that start at the 108 drug or disease node. At every step, the walker can restart its walk or jump to an adjacent node 109 based on optimized edge weights. The diffusion profile r ∈ R |V | measures how often each node 110 in the multiscale interactome is visited, thus encoding the effect of the drug or disease on every 111 protein and biological function. 112 Diffusion profiles contribute three methodological advances. First, diffusion profiles provide 113 a general framework to adaptively integrate physical interactions between proteins and a hierarchy 114 of biological functions. When continuing its walk, the random walker jumps between proteins 115 and biological functions at different hierarchical levels based on optimized edge weights. These 116 edge weights encode the relative importance of different types of nodes: w drug , w disease , w protein , 117 w biological function , w higher-level biological function , w lower-level biological function . These weights are hyperparame-118 ters which we optimize when predicting the drugs that treat a given disease (Methods). For drug 119 and disease treatments, these optimized edge weights encode the knowledge that proteins and bi-  (Figure 2a, b, Methods). Note that drug-disease treatment relationships are 143 never directly encoded into our network. Instead, the multiscale interactome learns to effectively 144 predict drug-disease treatment relationships it has never previously seen. 145 Moreover, the multiscale interactome accurately models classes of drugs that rely on biolog- 146 ical functions and which molecular-scale interactome approaches thus cannot model effectively. 147 Indeed, the top overall performing drug classes (i.e., sex hormones, modulators of the genital sys-     201 We find that genetic variants alter drug efficacy and cause serious adverse reactions by tar-202 geting genes that are highly visited in the drug and disease diffusion profiles (Figure 3b). We 203 define the treatment importance of a gene according to the visitation frequency of the correspond-204 ing protein in the drug and disease diffusion profiles (Methods). Genes that alter drug efficacy and 205 cause adverse reactions exhibit substantially higher treatment importance scores than other genes 206 (median network importance = 0.912 vs. 0.513; p = 2.95 × 10 -107 , Mood's median test), indicating 207 that these treatment altering genes occur at highly visited nodes. We thus provide evidence that the 208 topological position of a gene influences its ability to alter drug efficacy or cause serious adverse 209 reactions. 210 We find that the network importance of a gene in the drug and disease diffusion profiles 211 predicts whether that gene alters drug efficacy and causes adverse reactions for that particular 212 treatment (AUROC = 0.79, Average Precision = 0.82) ( Figure 3c). Importantly, the knowledge 213 that a gene alters a given treatment is never directly encoded into our network. Instead, diffusion 214 profiles are able to predict treatment altering relationships that the multiscale interactome has never 215 previously seen. Our diffusion profiles thereby provide a systematic approach to identify genes 216 with the potential to alter treatment, with implications for patient stratification in clinical trials and 217 personalized therapeutic selection [62]. Our finding is complementary to high-resolution, temporal 218 approaches such as discrete dynamic models which model drug resistance and adverse reactions 219 by first curating genes and pathways deemed relevant to a particular treatment [26][27][28][29]. Diffusion 220 profiles may help provide candidate genes and pathways for inclusion in these detailed approaches, 221 including genes not previously expected to be relevant.

222
Finally, we find that when a gene in a diseased patient alters the efficacy of one indicated drug 223 but not another, that gene primarily targets the genes important to treatment for the resistant drug 224 (Figure 3d, e). Overall, 71.0% of the genes known to alter the efficacy of one indicated drug but not 225 another exhibit higher network importance in the altered treatments than in the unaltered treatment. 226 We thus provide a network formalism explaining how changes to genes can alter efficacy and cause 227 adverse reactions in only some drugs indicated to treat a disease.      Finally, our study shows that both physical interactions and biological functions can propa-280 gate the effects of drugs and diseases. We find that many drugs neither directly target the proteins 281 associated with the disease they treat nor target proximal proteins. Instead, these drugs affect the  Data availability. All data used in the paper, including the multiscale interactome, approved 290 drug-disease treatments, drug and disease classifications, gene expression signatures, and pharma-291 cogenomic relationships is available at github.com/snap-stanford/multiscale-interactome.

292
Code availability. Python implementation of our methodology is available at github.com/snap-293 stanford/multiscale-interactome. The code is written in Python. Please read the README for 294 information on downloading and running the code.       . By comparing these diffusion profiles, we thus predict that Rosuvastatin treats Hyperlipoproteinemia Type III, identify the HMGCR and APOE-driven cholesterol metabolic functions relevant to treatment, and predict that mutations in APOE and HMGCR may interfere with treatment and thus alter drug efficacy or cause dangerous adverse reactions.         Mood's median test; median and 95% CI shown). We define the network treatment importance of a gene according to its visitation frequency in the drug and disease diffusion profiles (Methods). (c) The treatment importance of a gene in the drug and disease diffusion profiles predicts whether that gene alters drug efficacy and causes serious adverse reactions for that particular treatment (AUROC = 0.79, Average Precision = 0.82). (d) Genes uniquely alter efficacy in one indicated drug but not another by primarily targeting the genes and biological functions used in treatment by the affected drug. In patients with Hypertensive Disease, a mutation in AGT alters the efficacy of Benazepril but not Diltiazem. Indeed, AGT exhibits a higher network importance in Benazepril treatment than in Diltiazem treatment, ranked as the 45 th most important gene rather than the 418 th most important gene. (e) Overall, 71.0% of genes known to alter efficacy in one indicated drug but not another exhibit higher network importance in treatment by the affected drug. (f) Diffusion profiles can identify biological functions that may help explain alterations in treatment. Shown are the proteins and biological functions relevant to treatment of Hypertensive Disease by Benazepril and Diltiazem. AGT, which uniquely causes resistance in Benazepril, is a key regulator of the renin-angiotensin system, a biological function harnessed by Benazepril in treatment but not by Diltiazem [70][71][72].

300
The multiscale interactome. The multiscale interactome captures how drugs use both a net-301 work of physical interactions and a rich hierarchy of biological functions to treat diseases. In    To avoid circularity in the analysis, we remove disease-gene associations marked as therapeutic. 333 Finally, we filter disease-gene relationships to only consider genes whose protein products were    result from a yeast two-hybrid system, all protein-protein interactions are physical and ex-362 perimentally verified. We thus include all protein-protein interactions across these networks.

363
Proteins are already provided with their Entrez ID so no mapping is required.    Database to only include approved treatment relationships between drugs and diseases. 441 We map drugs to DrugBank IDs by using the provided CAS and ChEBI IDs as well as Drug-  To ensure that drug-disease relationships specifically represent treatment relationships, we 448 filter drug-disease pairs based on the "indication subtype." We remove drug-indication pairs 449 where the indication subtype described is not treatment (i.e. preventative/prophylaxis, di-450 agnosis, adjunct, palliative, reduction, causes/inducing/associated, and mechanism). We ad- drug-disease relationships that represent FDA-approved treatments. 455 Finally, we remove overly broad diseases from the Drug Indication Database. We remove 456 disease categories (i.e. diseases with "Diseases" in their name such as "Cardiovascular Dis-457 eases" and "Metabolic Diseases). We also remove diseases with more than 130 approved 458 drugs (i.e. Disorder of Eye -290 approved drugs).

459
After compiling approved drug-disease treatment pairs, we remove treatments for which 460 drugs rely on binding to non-human proteins (i.e. viral or bacterial proteins) to induce their effect.     Next, we encode G and the set of scalar weights W into a biased transition matrix M ∈ 542 R |V |×|V | . Each entry M ij denotes the probability p i→j a random walker jumps from node i to node 543 j when continuing its walk. Consider a random walker at node i jumping to neighbor j of type t.

544
Let T be the set of all node types adjacent to node i. We compute p i→j in two steps. 545 1. First, we compute the probability of the random walker jumping to a node of type t rather than a node of a different type. w t is the weight of node type t as specified in W : 2. Second, we compute the probability that the random walker jumps to node j rather than to another adjacent node of type t. Let n t be the number of adjacent nodes of type t: After constructing M, we finally compute the diffusion profile through power iteration as shown in Algorithm 1. The key equation is: from node without out-edges .
At each step, the random walker can restart its walk at the drug or disease node according to 546 (1 − α)s or continue its walk. If the random walker continues its walk from a node with out-edges, 547 then it jumps to an adjacent node according to α(r (k) M). If the random walker continues its walk 548 from a node without out-edges (i.e. a sink node), then it restarts its walk according to α(s where J is the set of sink nodes in the graph. At every iteration, i r i = 1.

552
Algorithm 1 Diffusion profiles through power iteration % Initialize diffusion profile 1 > do % Start new walk at drug or disease node or continue walk.
Predicting what drugs will treat a given disease with diffusion profiles. For a drug to treat a 553 disease, it must affect proteins and biological functions similar to those disrupted by the disease.

554
The diffusion profiles of the drug r (c) and the disease r (d) encode the effect of the drug and the 555 disease on proteins and biological functions. Therefore, comparing r (c) and r (d) allows us to predict 556 what drugs treat a given disease.

557
For each drug and each disease, we compute the diffusion profile as described above. For 558 each disease, we then rank-order the drugs most likely to treat the disease based on the similarity 559 of the drug and disease diffusion profiles SIM(r (c) , r (d) ) and a series of baseline methods. 560 We test five metrics of vector similarity: . 566 We additionally test two proximity metrics. In particular, we consider the visitation fre- .

588
Additional baseline metrics of functional overlap between drug targets and disease proteins. 589 We tested 10 baseline methods that compare the functional overlap between drug targets and dis- • The Z-scored Jaccard Similarity or Intersection between the multisets of GO terms asso-609 ciated with the drug targets and the set of GO terms associated with the disease proteins: We compute reference distributions for z-scored metrics by following the approach in [10, 612 13]. Specifically, we randomly permute the set of disease proteins S and the set of drug targets T 613 to sets of proteins that match the size and degrees of the original disease proteins and drug targets 614 in the network. We then generate the GO sets and multisets that correspond to the permuted S and 615 T , compute the relevant baseline metric, and repeat this for random permutations of S and T to 616 generate a reference distribution. Finally, we compute a z-score by comparing the baseline metric 617 for the true S and T to the reference distribution.

618
Evaluating predictions of what drugs will treat a disease. We evaluate how effectively a model 619 ranks the drugs that will treat a disease by using AUROC, Average Precision, and Recall@50.

620
For each disease, a model produces a ranked list of drugs. We identify the drugs approved to  and r (d) (Figure 2b, c). We utilize these optimal weights for the multiscale interactome for all We use rank transformed gene expression signatures and diffusion profiles. We only allow the Computing treatment importance of a gene based on diffusion profiles. We define the treatment importance of gene i as the product of the visitation frequency of the corresponding protein in the drug and disease diffusion profiles. For a treatment composed of drug compound c and disease d, the treatment importance of gene i is: We define the treatment importance percentile as the percentile rank of TI(i|c, d) compared 729 to all other genes for the same drug and disease. Intuitively, gene i is important to a treatment if 730 the corresponding protein is frequently visited in both the drug and disease diffusion profiles.

731
Comparing treatment importance of treatment altering genetic mutations vs other genetic 732 mutations. We compare the treatment importance of genes known to alter a treatment with the 733 treatment importance of other genes (Figure 3b). In particular, we compare the set of (drug, disease, 734 gene) triplets where the gene is known to alter the drug-disease treatment with an equivalently sized 735 set of (drug, disease, gene) triplets where the gene is not known to alter treatment. We construct 736 the latter set by sampling drugs, diseases, and genes uniformly at random that are not known to 737 alter treatment from PharmGKB [65]. The drugs and diseases in all triplets correspond to approved 738 drug-disease pairs. Thereby, we construct a distribution of the treatment importance for "treatment 739 altering genes" and a distribution of the treatment importance for "other genes" (Figure 3b).

740
Predicting genes that alter a treatment based on treatment importance. We evaluate the abil-741 ity of treatment importance to predict the genes that will alter a given treatment (Figure 3c). For 742 each (drug, disease, gene) triplet, we use the treatment importance of the gene TI(i|c, d) to predict 743 whether the gene alters treatment or not for that drug-disease pair (i.e. binary classification). We . We assess performance using AUROC and Average Precision (Figure 3c).

747
Comparing treatment importance of genes that alter one drug indicated to treat a disease but 748 not another. We analyze how often a gene has a higher treatment importance in the treatments it 749 alters than in those it does not alter (Figure 3e).

750
Formally, let i be a gene. Consider a triplet (d, c altered , c unaltered ) of a disease d, a drug c altered approved to treat the disease whose treatment is altered due to a mutation in i, and a drug c unaltered approved to treat the disease whose treatment is not altered due to a mutation in i. Let n triplets be the total number of such triplets for gene i. For each gene i, we measure the fraction f of triplets (d, c altered , c unaltered ) for which the treatment importance of i is higher in the (c altered , d) treatment than in the (c unaltered , d) treatment, as shown below. We only consider genes for which n triplets ≥ 100.

759
For each protein pair with a common biological function, we then: 760 1. Compute the shortest path distance in the PPI network between these two proteins.  3. Using the true shortest path distance between the proteins and the random reference distribu-766 tion, we compute a z-score. The z-score captures whether the proteins with a shared function 767 are closer or further away than expected by random chance in the PPI network.