Network neighbors of viral targets and differentially expressed genes in COVID-19 are drug target candidates

The COVID-19 pandemic is raging. It revealed the importance of rapid scientific advancement towards understanding and treating new diseases. To address this challenge, we adapt an explainable artificial intelligence algorithm for data fusion and utilize it on new omics data on viral–host interactions, human protein interactions, and drugs to better understand SARS-CoV-2 infection mechanisms and predict new drug–target interactions for COVID-19. We discover that in the human interactome, the human proteins targeted by SARS-CoV-2 proteins and the genes that are differentially expressed after the infection have common neighbors central in the interactome that may be key to the disease mechanisms. We uncover 185 new drug–target interactions targeting 49 of these key genes and suggest re-purposing of 149 FDA-approved drugs, including drugs targeting VEGF and nitric oxide signaling, whose pathways coincide with the observed COVID-19 symptoms. Our integrative methodology is universal and can enable insight into this and other serious diseases.

randomly generated clusters (p-value ≤ 0.01), confirming that the joint decomposition of VHIs and DTIs successfully extracts meaningful information from these data and is capable of predicting novel drug-target relations.

The data fusion framework can predict unseen DTIs
To validate that the s A scores in the reconstructed matrix can predict unseen DTIs, we perform a 10-fold cross-validation with stratified folds (i.e., ensuring the folds preserve the percentage of samples for each class). We used as ground truth the input DTIs (i.e., those DTIs present in DrugBank). As shown in Figure S3, the PR-AUC and the ROC-AUC for the validation set are lower than for the training set as expected (Training: PR-AUC=0.694 ± 0.004 and ROC-AUC=0.996 ± 0.001; Validation: PR-AUC=0.332 ± 0.014, ROC-AUC=0.847 ± 0.015; AUCs reported as the mean and standard deviation with respect to the 10 folds). However, as shown in Figure S3b, the precision of the method is still high in the validation set for high thresholds. In fact, the chosen threshold (s A = 0.296) obtains a high precision (precision > 0.8 for all the folds) over the validation set.

The holistic view of the human interactome uncovers additional DTIs
To assess which is the improvement when using the holistic view of the relationship between genes (i.e., MIN) instead of the PPI network, we applied the same framework only using the PPI network as the relation that must be conserved during the data fusion process.
After obtaining the factor matrices, we cluster the genes and drugs by applying hard clustering to the corresponding matrix factors, G 2 and G 3 , respectively (for more details, see "Extracting clusters of genes and drug" in Methods). As a validation step, we assess that the framework captures the functional relationships between genes (as captured by Gene Ontology (GO) annotations) and between the drugs (as captured by DrugBank "Drug Category" (DC) annotations)y, by performing an enrichment analysis on the gene and drug clusters obtained by the framework (for more details, see "Enrichment analysis of gene and drug clusters" in Methods). As shown in Figure S4, more than 80% of the of clusters of genes have GO term enrichments for the three GO domain (i.e. Biological Process, Cellular Component, Molecular Function), while at least 15% of the genes have at least one of their annotations enriched in their clusters over all annotated genes. Similarly, 90% of the clusters of drugs have DC enrichments ( Figure S4). These results are very similar to those obtained when using the MIN ( Figure 2 in the main document). Thus, the fusion framework successfully captures meaningful information encoded in the network either using the MIN or PPI network as the input of the genes relation that must be conserved.
To predict new DTIs, we used the matrix completion property to reconstruct the DTI matrix. Each entry of the reconstructed matrix contains an association score, s A , corresponding to a drug-gene pair. This score can be interpreted as a relative measure of confidence for each drug-gene association (for more details, see section "Prediction of new drug-target interactions for drug re-purposing" in Methods). Then, we assess that the score s A can be used to separate DTIs from non-interacting pairs performing precision-recall (PR) and receiver operating characteristic (ROC) curves analysis using all the input DTIs as ground truth. As shown in Supplementary Figure S5, these PR and ROC curves are very similar when using MIN or PPI, having the same ROC-AUC (ROC-AUC = 0.997) and almost identical PR-AUC (PR-AUC PPI = 0.704; PR-AUC MIN = 0.696). In addition, we showed that s A score can predict unseen DTIs by using 10-fold cross-validation. As shown in Figure S6, the PR-AUC and the ROC-AUC for the validation set are lower than for the training set as expected (Training: PR-AUC=0.698 ± 0.006 and ROC-AUC=0.996 ± 0.001; Validation: PR-AUC=0.326 ± 0.014, ROC-AUC=0.845 ± 0.009; AUCs reported as the mean and standard deviation with respect to the 10 folds). Finally, to predict new DTIs, we define an optimal threshold based on s A using F1-score and, then, we consider the false positive as predicted DTIs. The best F1-score (F 1 = 0.733) is associated with a threshold of s A = 0.340, yielding 533 newly predicted DTIs with 399 drugs targeting 131 genes (Supplementary Table S3).
The list of predicted DTIs using MIN and PPI, have an overlap of 500 DTIs (see Supplementary Figure S5), meaning that by using PPI only 61.43% of the DTIs predicted using the MIN were also predicted when using the PPI. Moreover, only 33 out of the 533 DTIs predicted by using the PPI were not predicted by using the MIN (23 targeted by FDA-approved drugs and 10 by experimental ones). In particular, these 33 DTIs have small association scores (i.e., they are at the bottom of the list). Therefore, we obtain more putative DTIs by enforcing that the framework preserved not only PPI between genes but also GI and MI.

Supplementary Figures
Supplementary Figure S1. Comparison between the molecular interaction network (MIN) and its constituent networks: protein-protein interactions (PPI), genetic interactions (GI) and metabolic interactions (MI) networks. (a-b) Overlap of the genes and interactions of the constituent networks, respectively. (c) GDV signature for the constituent networks and the MIN; counts (on the vertical axis) of the orbits (denoted by 0 to 14 on on the horizontal axis).

3/13
Supplementary Figure S2. Enrichment analysis for assessing the functional relevance of the gene and drug clusters obtained by the framework. The gene clusters are analyzed by using GO term annotations for the three domains: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC); and the drug clusters are analyzed by using "Drug Categories" (DC) from DrugBank (horizontal axis). The probability that an annotation is enriched in a cluster was computed using a hypergeometric test. Then, we computed three percentages: out of the total number of clusters of genes (drugs), the percentage that have GO terms (Drug Categories) enrichments (in blue); in all clusters of genes (drugs) taken together, the percentage of all leaf GO terms (Drug Categories) in them that are enriched in at least one cluster (in red); and in all clusters of genes (drugs) taken together, the percentage of all genes (drugs) in them out of all human genes (drugs) in the network that have at least one of their annotations enriched in their clusters (in purple).

5/13
Supplementary Figure S4. Enrichment analysis for assessing the functional relevance of the gene and drug clusters obtained by the framework by only using the PPI. The gene clusters are analyzed by using GO term annotations for the three domains: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC); and the drug clusters are analyzed by using "Drug Categories" (DC) from DrugBank (horizontal axis). The probability that an annotation is enriched in a cluster was computed using a hypergeometric test. Then, we computed three percentages: out of the total number of clusters of genes (drugs), the percentage that have GO terms (Drug Categories) enrichments (in blue); in all clusters of genes (drugs) taken together, the percentage of all leaf GO terms (Drug Categories) in them that are enriched in at least one cluster (in red); and in all clusters of genes (drugs) taken together, the percentage of all genes (drugs) in them out of all human genes (drugs) in the network that have at least one of their annotations enriched in their clusters (in purple) (in purple).

6/13
Supplementary Figure S5. Prediction of new DTIs using only PPI as relation between genes. (a-b) Comparison of the Precision-Recall (PR) and Receiver Operating Characteristic (ROC) curves for the framework using the MIN or only the PPI as the relation between genes. AUC -area under the curve. (c) Distribution of the association scores of the reconstructed matrix when only using the PPI as the relation between the genes; for the original DTIs (orange) and new drug-gene pairs obtained due to the matrix completion property of GNMTF (blue). New drug-gene pairs on the right side of the threshold (dashed line) were considered to be newly predicted DTIs.  Figure S7. Mean of the three dispersion coefficients (ρ 1 . ρ 2 . ρ 3 ) for all the values explored for choosing the parameters k 1 . k 2 and k 3 . Coefficients for k 1 = 3 are on the left and for k 1 = 5 on the right. The most stable clustering was achieved by k 1 = 3. k 2 = 120 and k 3 = 80 (mean ρ 1 .ρ 2 .ρ 3 = 0.661).

9/13
Supplementary Figure S8. Illustration of graphlets up to 4-nodes and their 15 automorphism orbits. The ten non-redundant orbits, whose counts cannot be derived from the counts of the other orbits. are highlighted in red. Tables S2, S3, S4, S5, S6, S7, S8, S9, S10 and S11 are provided as comma-separated values (csv) files due to its extension. Table S1. Network properties of the molecular interaction network (MIN) and its constituent networks: protein-protein interaction (PPI), genetic interaction (GI) network and metabolic interaction (MI) network. The four networks are compared by the following commonly used network properties: four centrality measures (degree, eigenvector, betweenness and closeness centrality) and clustering coefficient. Note: this table is also provided as csv file.