Network-based in silico drug efficacy screening

The increasing cost of drug development together with a significant drop in the number of new drug approvals raises the need for innovative approaches for target identification and efficacy prediction. Here, we take advantage of our increasing understanding of the network-based origins of diseases to introduce a drug-disease proximity measure that quantifies the interplay between drugs targets and diseases. By correcting for the known biases of the interactome, proximity helps us uncover the therapeutic effect of drugs, as well as to distinguish palliative from effective treatments. Our analysis of 238 drugs used in 78 diseases indicates that the therapeutic effect of drugs is localized in a small network neighborhood of the disease genes and highlights efficacy issues for drugs used in Parkinson and several inflammatory disorders. Finally, network-based proximity allows us to predict novel drug-disease associations that offer unprecedented opportunities for drug repurposing and the detection of adverse effects.

The low f-score is due to the positives constituting a small portion of the all drug-disease associations and the negatives including potential "positives" (repurposing opportunities or drugs worsening the disease condition), giving rise to low Precision. (c) F-score versus proximity using 100 groups of randomly sampled unknown drug-disease associations as negatives. Each group contains the same number of negative instances as positive instances (known drug-disease pairs). The blue line shows the average F-score over 100 random groupings. The balanced number of positive and negative instances yields better F-scores. To pinpoint drug-disease associations even when the target is not a disease protein, we defined the drug-disease proximity using several network-based distance measures. We observe that the closest measure captures the drug-disease proximity better than the remaining measures, suggesting that drug targets do not necessarily have to be close to all the proteins in the disease module. Motivated by this observation, we test the performance of the network-based proximity using only (i ) disease proteins at most l steps away from a drug target (seed subset), (ii ) the drug targets at most l steps away from a disease protein (target subset), (iii ) the drug target and disease protein pairs that are at most l steps away from each other (target-seed subset). Note that the seed and target subset approaches are not symmetric: Given a set of drug targets T = {t 1 , t 2 } and a set of disease proteins S = {s 1 , s 2 }, say while the closest disease protein to the drug target t 1 is s 1 , the closest drug target to s 1 might be t 2 but not t 1 . To restrict the distance calculation to a given distance l, we first calculate the shortest path distances between each pair of drug target (t i ) and disease protein (s j ), sort these distances and then consider only the pairs Through exhaustive search of parameter space (l ∈ {0, 1, 2, 3, 4}), we find that the AUC does not change significantly after l = 2 ( Supplementary Fig. 2a). Furthermore, the AUC at l = 2 is comparable to AUCs when all disease genes or all drug targets are considered.
Indeed, the distribution of distances between drug targets and disease proteins among known drug-disease pairs shows that 90% of the drugs have a known disease protein within two steps ( Supplementary Fig. 2b). This suggests that most drugs exert their therapeutic effect on the disease proteins that are at most two steps away.

Supplementary Note 3: Proximity and drug similarity based repurposing
Drug-drug similarity is often used to predict a novel use for a given drug. The similarity between two drugs is usually defined based on sharing chemical structure [1], targets [1,2,3], functional annotations (of the targets) [1] or side effects [4,1] as well as shortest path distance between targets in the interactome [1]. Accordingly, given two drugs X and Y with targets T X and T Y , we calculate: (i) the interactome-based distance between the targets of X and Y : and d(u, v) denoting the shortest path distance between proteins (u, v) in the interactome. Accordingly, two drugs X and Y are similar if their targets are close to each other in the interactome. For defining proximity-based similarity, we use z c (X, Y ) instead of l(X, Y ).
(ii) the ratio of common drug targets of X and Y : where w t , the disease-specificity of each target (the number of diseases for which a drug with target t is used), is given by with D being all the diseases analyzed in this study and I t i being an indicator variable defined as That is, the similarity between drugs X and Y is based on the number and diseasespecificity of their shared targets. Note that if w t = 1 for all targets, the similarity reduces to the Jaccard index of the targets of X and Y ignoring whether the targets are disease-specific or not.
(iii) chemical similarity between X and Y : (iv) the ratio of GO terms shared among the targets of X and Y : where M X and M Y are the set of GO molecular function terms annotated for T X and T Y , respectively and w m is the disease-specificity of each common GO term m calculated based on the number of diseases m appears among the targets of the drugs used for each disease. Thus, δ GO (X, Y ) gives the functional similarity of drugs X and Y as the common disease-specific molecular function GO terms. Gene annotations were downloaded from GO web page (geneontology.org/page/downloads) in July, 2013.
(v) the ratio of common side effects of X and Y: where E X and E Y are known side effects of drugs X and Y , respectively and we is the disease-specificity of each common side effect e calculated based on the number of diseases for which a drug with e exists. The side effects of drugs are retrieved using SIDER database [5]. The drugs are mapped to each other via the PubChem identifiers provided in DrugBank and SIDER databases.
(vi) the perturbation profile similarity of X and Y : corresponding to the ratio of common differentially regulated genes in the perturbation profiles of X and Y in LINCS database located at lincsproject.org where P X and P Y are the gene sets that are differentially expressed upon perturbation by drugs X and Y , respectively. The differentially expressed 100 landmark genes (lm100) upon drug perturbations were retrieved using LINCS API in June, 2014 (api.lincscloud.org) and in case of multiple perturbations for the same drug (i.e. multiple cell lines, perturbation times or dosages), the perturbations resulting in highest similarity (δ LINCS (X, Y )) are used.
Although predicted side effects, drug targets or disease-disease similarity information can increase the coverage of these methods, their use is likely to have a significant impact on the prediction performance due to the limited reliability of available prediction methods. Furthermore, it is not possible to discover novel drugs whose targets have not been explored for a particular disease or to find drugs that do not have a certain (e.g., undesired) side effect because of the dependence on the existing drug and disease information. Drug-disease proximity overcomes these limitations, as it does not depend on the existing knowledge of drug-disease associations.

Supplementary Note 4: Comparing proximity to gene expression based repurposing
To identify drugs that can potentially account for the gene expression changes induced by diseases, recent studies proposed using correlation of gene expression between the disease state and after treatment with drug [6,7]. The premise of these studies is to find drugs whose perturbation profiles are anti-correlated with the genes perturbed in the disease such that the treatment with the drug can revert the expression changes in the disease state. That is, for instance, if a gene is over-expressed in the disease condition, the goal is to find a drug that yields the under-expression of that gene. We test this hypothesis using Drug versus Disease (DvD) R package [8]

Supplementary Note 5: Robustness of drug-disease proximity threshold
To define proximal and distant drug-disease pairs, we examine the coverage of known and unknown drug-disease associations at various thresholds and choose the threshold, z threshold , that gives both high coverage and low false positive rate (Sensitivity and 1-Specificity, respectively) identified by the threshold for which Sensitivity and Specificity have both high values. We use ROCR package [11] to calculate the Sensitivity and Specificity values and then find the cutoff for which these values are equally high (i.e. the difference between the two values are within |∆| ≤ 1%). For the original data set used in the analysis, z threshold = −0.15 with a Sensitivity of 59% and Specificity of 60%.
We confirm that the selected interactome-based proximity threshold does not change significantly by repeating our analyses using drug-disease associations from (i ) NDF-RT and (ii ) KEGG. On both data sets, we find that the threshold is similar to that of the original data set (z threshold NDF−RT = −0.10 and z threshold KEGG = −0.07, respectively). We also check the enrichment of known drug-disease pairs among proximal and distant drug-disease pairs to ensure that our findings on the relationship between the proximity and a drug's therapeutic effect generalizes over different data sets. Consistent with the original analysis we find that drugs proximal to a disease are at least 2 times more likely to be effective on that disease in both data sets (Fisher's exact test, OR = 2.2, P = 4.8 × 10 −9 using NDF-RT and OR = 3.0, P = 4.8 × 10 −6 using KEGG).

Supplementary Note 6: Controlling for data quality
Data incompleteness and study bias pose substantial challenges in the systematic analysis and interpretation of biological data. Current literature provides a snapshot of drugs known to be effective in several diseases, known drug targets, disease genes and proteinprotein interactions. To make sure that the drug, disease and interaction data sets used in our analysis constitute an accurate representation of the state-of-the-art, we test the performance of drug-disease proximity measure across different data sets (Supplementary   Table 2).
To evaluate the effect of the underlying network on proximity, in addition to the integrated human interactome (PPI), we use the binary human interactome compiled from high-quality yeast two-hybrid interaction detection screens and literature [12] (Lit-BM-13 and HI-II-14 at interactome.dfci.harvard.edu/H sapiens/host.php). The binary interactome covers 7,544 proteins and 24,202 interactions between them, thus it is much smaller than PPI. The AUC corresponding to discrimination of known and unknown drug-disease pairs drops significantly, indicating that the coverage of the interactome has a significant effect on the drug-disease proximity. Though binary assays provide systematic high-quality data, their coverage is limited [13]. To counterbalance this limitation, we use a functional association network from STRING database [14] containing interactions with a confidence score 700 or higher. The STRING network has 16,086 proteins and 314,656 interactions, more than double the number of interactions in the PPI network. Yet, the AUC is slightly higher than that of binary interactome, suggesting that both the quality and the coverage of the protein interaction data have a significant impact on the proximity between drugs and diseases.
Next, we assess the effect of disease annotations on drug-disease proximity by using only disease gene information from either the OMIM database or the GWAS Catalogue. The AUC using only OMIM data is higher than the original AUC (using both OMIM and GWAS genes), whereas the AUC using only GWAS data is substantially lower. However, among 78 diseases in the original data set, there are 43 diseases that have no associated genes in increases the coverage of the diseases.
To account for the limitations of drug-target association data [15], we also use drug target information from STITCH database [16] that integrates known and predicted drug target associations based on evidence in the literature. For each drug, the proteins with confidence score greater than 700 are considered to be targeted by the drug in addition to the targets provided in DrugBank. This data set contains 2,244 distinct targets for 212 drugs. The median number of targets per drug using STITCH is significantly higher (15 targets per drug vs. 2 targets per drug using DrugBank). Nonetheless, the AUC is slightly lower, suggesting that quality of drug-target information is at least as important as the coverage.
To make sure that the drug-disease annotations used in our analysis is of high confidence, in addition to MEDI-HPS, we collect drug-disease assocations from National Drug File -Resource Terminology (NDF-RT) [17] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [18]. We retrieve the drug-disease associations using NDF-RT (rxnav.nlm.nih.gov/ NdfrtAPIs.html) and KEGG (rest.kegg.jp) REST APIs, respectively. In NDF-RT, a drug is considered to be indicated for a disease if and only if the drug's NDF-RT entry contained a "may treat" relationship with the disease. Similar to the drug-disease associations used in the original analysis, we filter these drug-disease associations using Metab2Mesh [19] (qvalue < 1 × 10 −8 ). The AUC is considerably higher using drug-disease associations from KEGG, suggesting that the annotations in KEGG tend to be more reliable. Nonetheless, the number of drugs and diseases included in the analysis is significantly lower compared to the annotations from MEDI-HPS. Hence, MEDI-HPS offers a good compromise between accuracy and coverage of drug-disease associations, allowing us to analyze the most number of drugs and diseases.
We also examine the AUC value for all diseases with one or more corresponding gene, as opposed to restricting to the diseases with at least 20 genes. As expected, the inclusion of these diseases with fewer genes are known lowers the prediction performance, yet it remains significantly higher than the random expectation. Given that the drug disease proximity is not biased with respect to number of disease genes, the drop in the AUC can be attributed to the diseases with less genes being genetically less understood. On the other hand, as several diseases used in the original analysis are broader categories involving more specific conditions, we assess the effect of excluding the broader MeSH disease categories from the analysis (e.g., liver cirrhosis is removed and liver cirrhosis biliary is kept). To do this we identify the disease pairs that have substantial portion of their genes in common (i.e. that have a Jaccard index higher than 0.5) and keep only the specific MeSH term in the MeSH hierarchy (lower in the hierarchy). We observe that the resulting prediction accuracy is comparable to the AUC using all the diseases.
In the original analysis, we assume that the known drug targets are typically the therapeutic targets (for which the drug is intended for). To check whether the analysis depends on the number of targets a drug has, we limit the analysis to those drugs that had at least three targets. In line with our expectation, the AUC does not change substantially compared to using all drugs. Similarly, to confirm that proximity can pick drug-disease associations for drugs whose targets are not disease genes, we repeat the analysis excluding the drug-disease pairs in which all drug targets are also disease genes (d c = 0). The AUC values are only slightly lower, suggesting that relative proximity can successfully identify indirect relationships between drugs and diseases.