A novel graph mining approach to predict and evaluate food-drug interactions

Food-drug interactions (FDIs) arise when nutritional dietary consumption regulates biochemical mechanisms involved in drug metabolism. This study proposes FDMine, a novel systematic framework that models the FDI problem as a homogenous graph. Our dataset consists of 788 unique approved small molecule drugs with metabolism-related drug-drug interactions and 320 unique food items, composed of 563 unique compounds. The potential number of interactions is 87,192 and 92,143 for disjoint and joint versions of the graph. We defined several similarity subnetworks comprising food-drug similarity, drug-drug similarity, and food-food similarity networks. A unique part of the graph involves encoding the food composition as a set of nodes and calculating a content contribution score. To predict new FDIs, we considered several link prediction algorithms and various performance metrics, including the precision@top (top 1%, 2%, and 5%) of the newly predicted links. The shortest path-based method has achieved a precision of 84%, 60% and 40% for the top 1%, 2% and 5% of FDIs identified, respectively. We validated the top FDIs predicted using FDMine to demonstrate its applicability, and we relate therapeutic anti-inflammatory effects of food items informed by FDIs. FDMine is publicly available to support clinicians and researchers.


Methods
Databases and datasets preparation. DrugBank. We used the DrugBank (v 5.1.7) database, which contains substantial drug target information (i.e., sequence, pathway, and structure) as well as specific information for each drug (i.e., chemical, medicinal, and pharmacological data) [32][33][34] . A total of 13,680 distinct drugs are represented in the database. Approved, experimental, investigational, nutraceutical, and withdrawn are the five categories of drugs in DrugBank. Drugs can be differentiated as small molecules or biotechnology-driven. The database provides access to the SMILE strings of the drug molecules and reports drug-drug interactions 33 .
In this study, we considered the drugs assigned to the approved drug group and have small molecules. This resulted in 1683 drugs. We further reduced this set of molecules by considering only those having "metabolism (increase or decrease)" related interactions, resulting in 788 unique approved small molecule drugs. FDI interactions are mainly detected with metabolic mechanisms 14 . The details of the drug extraction procedure from the DrugBank dataset can be found in Supplementary file 1: Fig. S1.
FooDB. We used the JSON formatted 30,31 FooDB version 1.0 dataset, which includes numerous records on foods, compounds, content, nutrients, and health effects. We used the FooDB content dataset, which directly matched foods to food compound composition. Then, we cleaned the retrieved data by discarding the list of unknown data elements using the criteria "DATABASE" as the "citation type" and "COMPOUND" as "source type". This provides a more accurate source of information. Finally, we only considered the food items mapped to a compound, resulting in 16,230 objects for further analysis.
After the parsing step, we mapped the resulting dataset with the "Compound" information to collect the required details for each compound, including the SMILE description and the content contribution. In FoodDB, the content range of each compound within a food item is presented (e.g., Strawberry has a content range of Potassium of 0.000-187.000 mg/100 g). Finally, we have the SMILE description of the corresponding compounds along with the contribution of each food compound.
We filtered data from the health effects dataset to find food compounds with reported health impacts on the human body. The resulting dataset contains 8846 objects, including 320 unique foods, and 563 unique food compounds having 179 unique health effects. Benzoic acid from American cranberry having an allergenic health effect is one example food compound with reported health impacts on the human body.
Since the same compounds can be found in different foods, it is necessary to store these data with a naming convention that allows us to correctly differentiate each food with its composition. In this study, we used the following naming convention: FOODXXXX_FDBXXXXX_CompoundName. For example, the data entries "FOOD00005_ FDB000633_ Kaempferol" and "FOOD00008_ FDB000633_ Kaempferol" refer to the same compound Kaempferol with the compound identifier FDB000633 from two different foods (FOOD00005 and FOOD00008). Each compound can be treated differently based on the reported content range in the food item.
A diagram illustrating the data preparation steps from FooDB database can be found in Supplementary file 1: Fig. S2. DrugBank (v 5.1.7) database and FooDB version 1.0 dataset are publicly available. All the methods were performed in accordance with the relevant guidelines and regulations.
Food composition and compound contribution. Each food item is composed of a set of chemical compounds. Clearly, the "amount of the original content" of any compound is not the same for each food. For example, the amount of the phytic acid in carrot is 5270.000 ml/100 g, and buckwheat is 1800.000 ml/100 g. Carrot contains approximately three times more phytic acid than buckwheat by mass. Therefore, the contribution of the phytic acid is different for carrot and buckwheat. Consequently, we used the following equation to calculate the contribution of each compound for each food based on the amount contained in the food: The range of the normalized contribution is from 0 to 1. Where a 0 and 1 contribution refer to a food compound with no contribution or full contribution, respectively. Assuming the amount of a compound in a food is given in mg/100 g and all compounds that a food is comprised of are considered, the original content of all compounds should always be 100 g/100 g = 1. Yet, some of the compounds might be excluded if some relevant information is missing from FoodDB (e.g., missing content value or missing health effect information or missing reference database). Hence, normalization is necessary for computing the contribution score.
In the graph, the food item and its compound composition are represented as separate nodes. The normalized contribution score scales edge weights of links connecting compounds to the food item. More details and an example on the contribution score of food compounds is provided in Supplementary file 1: Table S1.
Homogenous graph representation. We consider a set of food compounds, F = f 1 , f 2 , . . . , f m and a set of drugs, D = {d 1 , d 2 , . . . , d n } with m representing the number of food compounds and n representing the number of drugs included. We merged all drugs and food compounds into a single graph. So, in our representation, we have a set of drug and food compounds FD = f 1 , f 2 , . . . , f m , d 1 , d 2 , . . . , d n . Then, we considered the set of m * n dimensional structure similarity matrices between drugs, food compounds, and food-drug pairings. The strength of similarity is measured by a score between [0, 1]. A similarity score close to 0 implies that two items are not identical to each other, where the most similar items are represented by a similarity score close to 1. Using this similarity assessment, we derived a homogenous graph. For this homogenous graph, we will apply different path categories and neighborhood-based similarity-based algorithms to predict the novel FDIs.
(1) Contribution score (normalized) = Compound original content ∈ a food item Total original content of all compounds ∈ a food  35,36 . The Tanimoto coefficient is defined as the number of common chemical fingerprints compared to all chemical fingerprints of the two drugs. Chemical fingerprints for each drug were calculated using Morgan/Circular fingerprints 37 (also known as extendedconnectivity fingerprint ECFP4 38 ) that is widely used. ECFP4 demonstrated the best performing fingerprints in the target prediction benchmarks 39,40 and in small molecule virtual screening 41 . The calculating procedure of the SSP can be found in Supplementary file 1: Fig S3. Sparse matrix representation. We used the similarity profile to derive a sparse matrix representation, which is used for plotting the graphs. In this matrix, we eliminated the zero entries and applied a threshold since some similarity scores contain trivial values and thus may not be indicative of significant changes. For determining the threshold, we have considered the similarity score distribution. Most similarity values lie between 0.3 and 0.6, hence, selecting a high similarity value may drastically change the dataset size. Also, a high threshold will always lead to potential pairs having increased probability of interaction. Several studies have referred to different values in the range of 0.5-0.85 for applying a similarity threshold for the Tanimoto coefficient [42][43][44] . While a higher threshold reduces the probability of reporting spurious findings, it can limit the number of genuinely novel predictions. Table 1 highlights the number of links of each subnetwork after applying a range of similarity thresholds. Compared to a threshold of 0.6, a value of 0.7 would result in 75% fewer possible FDIs, thus, we selected 0.6 at this step. It should be noted that this parameter is provided as an input argument for the user of FDMine.
Updating similarity scores using food-compound contribution. We obtained a total of 4,177,383 similarities using the SSP. Then, we multiplied the similarity score by the normalized contribution of the food compound (Eq. 1). As illustrated in Supplementary file 1: Table S2, when we have a food-drug pair (see row 1), we multiply the similarity score by the contribution of the food compound. Similarly, we multiplied the similarity score by the higher contribution of the food compound. For example, the contribution of the FOOD00006_ FDB000474_ L-Lysine is 0.007301117, and the FOOD00006_ FDB000556_ L-Alanine is 0.009780473. So, we have considered the maximum value of 0.009780473 to update the similarity score. For drug pairs, similarity scores were preserved.
After updating the similarity scores in the graph, we consider another threshold using the contribution score. Here, we consider a more relaxed range of thresholds (0.3, 0.4, 0.5 and 0.6) applied to the Tanimoto coefficient. In our literature validation, we prepare and discuss another batch of results using a similarity score of 0.3, though a value of 0.5 has been employed to generate our primary findings. For a threshold of 0.5, we ended up with 87,192 interactions and 92,143 for the disjoint and joint datasets respectively. In Supplementary file 1: Table 3lists the number of interactions for the considered range.

Link prediction algorithms.
After applying the similarity thresholds, the generated graph had several disjoint subgraphs. We call this the disjoint version. Some link prediction algorithms like Adamic and Adar Coefficient (AA), as well as Common Neighbour (CN), cannot handle the disjoint version, therefore, we prepared a joint graph. We chose any node (randomly) from each subgraph and added an edge to link all subgraphs to make the joint graph network. Then, a very small edge weight of 1e-5 was assigned to the newly added links (see Table S22 for results using an edge weight of 1), limiting their effect on generating biased hypotheses. We generated results for both versions. A detailed description is available in Supplementary file 1: Fig S4. Path category-based algorithm. Our goal is to use the created homogenous graph to predict novel (i.e., unknown) FDIs. A homogenous graph is one where all nodes are of the same type. Nodes in our graph are chemicals, which discerns this approach from DTI heterogenous graphs (e.g., drug-protein). One class of algorithms runs the shortest path to find candidate interactions for the considered food and drug pair. Here, we employed 2-length and 3-length pathways. For example, a 2-length path is "Drug1-Food1-Food2" (see Supplementary file 1: Fig S5) connects the Drug1 node with the Food2 node through the similarity between "Drug1 and Food1" and "Food1 and Food2". This is defined as a D-F-F path. As illustrated in Supplementary file 1: Fig S5,  Any path can be taken to predict novel interactions. The same applies for 3-length pathway prediction. For example, we can get another new link using 3-path length (Food-Food-Drug-Food). The score for the newly predicted link is calculated according to Eq. (3), where P is the path, n is the total number of paths, w is the weight of the path and P w is the path weight: Dijkstra's algorithm was used for finding the shortest path between all pairs of nodes in the graph. For each pair of FDI, the Dijkstra algorithm will find the path with the smallest values (i.e., lower similarity scores). In order to compensate for this issue, we rank the final list in descending order, giving preference for interactions with higher similarity. For example, given Dijkstra score(DrugA-FoodX) = 1.12 and Dijkstra score(DrugB-FoodY) = 2, the preference will be assigned to DrugB-FoodY. Another approach to handle the challenge of Dijkstra's algorithm reporting the shortest path, would have been to invert the similarity scores (so that they represent dissimilarities) and then use a standard ascending rank. After applying Dijkstra's algorithm, we performed filtering with a path length of 2 or 3, and considered the results as possible interactions.
Neighbourhood-based similarity-based link prediction. In the link prediction, given a graph G , the main aim is to predict new edges (drug-food) from the existing graph. Predictions are useful to suggest unknown relations (or interactions) based on edges in the observed graph. In the link prediction, we build a similarity measure between pairs of nodes and link the most similar nodes. Link prediction algorithms are widespread in many application domains such as, identifying protein-protein interactions 45 , drug-drug interactions 29 , DTIs 28 , social networks 46 , reconstructing networks 47 , document recommendation, recommendation systems 48 , biological networks 49 , disease prediction 50 , bipartite networks 51 , etc.
Adamic and adar coefficient (AA). The Adamic and Adar Coefficient (AA) gives preference to node pairs with more common neighbors but with a lower degree. If there are no common neighbors for a node pair, then the AA score is 0. The AA measure is formulated to connect node pairs that have common neighbors.
Here, a and b are two nodes, and z denotes a common neighbor to both a and b . k is the degree of node z.

Common neighbor (CN).
In the Common Neighbor (CN) algorithm, the score for link prediction is computed by finding the number of common neighbors between two distinct nodes. Where, a and b are two nodes. Γ(a ) and Γ(b ) denote the set of neighbors of nodes a and b , respectively.
Jaccard coefficient (JAC). The JAC measure considers only node pairs that have at least one common neighbor. The JAC measure gives equal weight to all common neighbors and does not consider the degree of the common neighbors. The JAC measure gives preferences to node pairs that share a larger fraction of their neighbor. The JAC measure always ranges from 0 to 1 irrespective of the size of the neighborhoods of the vertices. The formula is given below to calculate the JAC. Γ(a ) and Γ(b ) denote the set of neighbors of nodes a and b , respectively.
Resource allocation (RA). Resource Allocation (RA) calculates the score based on irregular nodes connecting node a and b . The number of resources node a receives from node b through indirect links is called the similarity index. In the RA, each intermediate link contributes a unit of the resource. The RA is also symmetric. z denotes a common neighbor of both a and b nodes and k denotes the degree of node z. www.nature.com/scientificreports/ Multiple paths of length L = 3 (L3). Links of high degree nodes prompt multiple and unspecific shortcuts in the network, resulting in biased predictions. This can be avoided by using a proper degree of normalization. Such a degree of normalization is significant for L3. To eliminate potential degree biases caused by lower degree nodes, we assign a degree normalized L3 score to each node pair a and b . Here, u and v are intermediate nodes in the 3-length path, A au , A uv , A vb are the link weight, K u , K v are the degree of node u and v respectively.
Dice coefficient. Dice coefficient is similar to the Jaccard Coefficient (JAC). The Dice coefficient is calculated using Eq. (9), where, a and b are two nodes and Γ(a ) and Γ(b ) denote the set of neighbors of nodes a and b , respectively.
Performance evaluation. To measure the performance of applied link prediction approaches, we adopted the idea of precision@k 57,58 or top k predictive rate 53,59 . This metric is also known as r-precision [60][61][62][63] . precision@k is the recommended measure for link prediction algorithms 64 . It refers to the percentage of true positives among only the top k ranked predicted links. Given the ranked output of the graph, we need to evaluate the ranking precision of the methods. Following 26 , we chose the top 1%, 2%, and 5% as the value of k . In general, the area under the receiver operating characteristic curve (AUROC) or (AUC) is used to evaluate the performance of classification models. Nevertheless, recent studies have shown that AUROC is unsuitable for evaluating the performance of link prediction algorithms 55,[65][66][67] . Another statistical measure is the area under the precision-recall curve (PRC), which potentially provides a more robust assessment, especially when dealing with imbalanced datasets 68 . In this study, we used precision@top, AUC, and PRC as performance metrics.
In order to compute some of the measures, we had to derive true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). To perform this, we ranked the predicted links in descending order based on the rank score given by the link prediction methods. Then, we considered several thresholds as cutoff values. The starting threshold is the minimum score given by the link prediction methods. Then we increase by a step size of 0.1, which was selected to ensure sufficient granularity in computing the area under the curve. We repeated this step until the threshold value was the same as the maximum score given by the link prediction algorithm. For each specific threshold score, if we found that the known link in the test dataset matched with the newly predicted link, and the score is greater than the threshold, we considered this matching as a true positive (TP) for evaluative purposes. Given an unknown link, which does not match the test dataset, but was predicted by the link prediction algorithm, and the score is greater than the threshold, we consider the case a false positive (FP). Similarly, when we found a known link (same as the test dataset and in the newly predicted links), but the score was below the threshold, we consider this a false negative (FN). Lastly, when we found any unknown link with a score below the threshold, we assigned the sample as a true negative (TN). Using the TP, FP, TN, and FN we calculated the "precision@top-1%", "precision@top-2%", "precision@top-5%", AUC, and PRC.
Data splitting for testing. To evaluating the performance of link prediction algorithms, the test data is generated by excluding a collection of links from the full homogenous network. Our homogenous network contains drug-drug, food-drug, and food-food similarities. We split 30% of links randomly to make the test data set, while the rest of the 70% of links are used for the training dataset. For stability, we repeat this evaluation ten times and report average performance.
Ground-truth evaluation using DDS. Contrary to food-protein interactions 26 , there is no accessible gold standard for widely confirmed food-drug interactions. Therefore, we resorted to the extracted drug-drug interactions from DrugBank for ground truth evaluation. Since the graph representation in FDMine is homogenous (i.e., all nodes are chemicals), we can consider any part of the graph as a representative set of evaluation. Here, we remove 30% of the drug-drug links in the graph. Then, we execute the framework and report top-ranked cases for the precision evaluation. We split 30% DDS links (randomly) for making the test data set, while the rest of the 70% DDS, and all FDS, FFS links are used in the training dataset. Here, we measured the precision in terms of recovering the original links in the DDS subgraph. It should be noted that we also performed an evaluation using a random subset of any type of links (see Results).
We have performed three types of evaluations to benchmark the results. In the first evaluation, a drug can have a link with another drug based on similarity scores. In the second evaluation, a drug will have a correct link with another drug only if it is reported in the DrugBank database. The difference between the second and third evaluations is that the original links in the second evaluation are assumed based on the established similarity measures. Both evaluations will help us establish a comprehensive overview of link recovery in general and the validity of these recovered links using DrugBank. Although drug-drug interactions are examined in these two evaluations, they both provide estimates for the accuracy of food-drug predictions since the graph is homogenous in nature. The following Table 2 lists all the evaluative approaches we have performed in this study. Our proposed FDMine framework. The FDMine framework (see Fig. 1) is composed of several phases.
In Phase 1, raw data is parsed from DrugBank and FooDB databases. In Phase 2, we execute two steps including a) building a homogenous network based on the structure similarity profile, and b) updating the weights of the homogenous network using food compound contributions. Next, the graph is prepared with nodes representing drugs, food and food compounds' composition. In the graph, links are weighted by similarity and contribution scores (see Phase 3 in Fig. 1). When applying the similarity thresholds, the homogenous network produces multiple subgraphs (disjoint graph). We build another version called the joint homogenous graph network and consider executing several link prediction algorithms, including applied path category-based and neighborhoodbased similarity-based approaches. In the final Phase 4, we rank the newly predicted link (based on the score given by our methods), test the performance of the applied methods with the test dataset, and finally, consult the literature to validate the top FDIs found using the different methods. For testing, we perform a comparison using ground-truth and report literature validation for our leading findings (see "Results and Discussion" section).

Results and discussion
The next subsections describe in detail the performance evaluation of FDMine, and the analysis of our novel FDI predictions.
Prediction results of FDMine. Evaluation 1: comprehensive evaluation for the recovery of DDS similarity links. As explained earlier, DDS similarity links are a priority in our evaluation setup as it establishes a ground truth evaluation (see Evaluation 2 results). Here, drug-drug links are based on the similarity scorings we computed. We have applied two different link prediction approaches over two different types of homogenous graph networks. One is the disjoint graph network, and the other is the joint graph network. The applied methods are the path category-based and neighborhood-based similarity-based link prediction algorithms. We used path  Table S4 provides a summary of different models over the disjoint graph network. For the disjoint graph, the SP_2 outperformed other methods. The precision rate for the top 1% (i.e., precision@ top-1) is 84% for SP_2 while RA, the second-best has achieved 64%. For precision@top-2, SP_2 achieved the best results with 60% and L3, the second-best 42%. The highest value for the precision@top-5 was achieved by the SP_2 (40%). In the disjoint version of the graph, neighborhood-based similarity-based methods achieved, on average 17% with varying standard deviation each. However, SP_3 consistently demonstrated poor performance (5%, 3%, 2% for precision@top-1, precision@top-2, and precision@top-5, respectively) compared to all other methods. SP_2 achieved 52% and 26% AUC and PRC, respectively. All neighborhood-based similarity-based methods achieved more than 80% (AUC) except L3 which had a reported precision of 60%. The PRC scores of the RA, AA, and CN were 70%, 65%, and 60% respectively.
When considering the joint version of the graph, different results were attained. The neighborhood-based similarity-based methods showed best results for the top precision@top-1, precision@top-2, and precision@top-5. For the precision@top-1, the RA achieved the best result (71%), followed by AA (67%). For the precision@top-2, L3 and RA both yielded similar performance (39%). Additionally, all neighborhood-based similarity-based methods produced the same result (16%) for precision@top-5. Contrary to the case of the disjoint version of the www.nature.com/scientificreports/ graph, the performance of SP_2 was weak. The SP_2 achieved, 23%, 15%, and 9% for the precision@top-1, preci-sion@top-2, and precision@top-5, respectively (see Tables S18-S20 for detailed results using 10 different random seeds). For the joint graph, the neighborhood-based similarity-based algorithms achieved AUC of more than 90% except L3 (65%). The value of the PRC is also high for the neighborhood-based similarity-based methods. The PRC scores for the RA, AA, CN were 87%, 86%, and 84% respectively. However, SP_3 always (disjoint and joint graphs) showed the weakest results in terms of all performance metrics (precision@top, AUC, and PRC). In Supplementary file 1: Table S5 summarizes the different models over the joint graph network. The comparison graph for the precision@top-1%, precision@top-2%, and precision@top-5% are provided in Fig. 2. For more details, see Supplementary file 1 Figures S7 and S8.
Evaluation 2: ground truth evaluation using DrugBank. The dataset we constructed using DrugBank and FooDB contains drug-drug links. The disjoint and joint dataset contains 2926, and 6581 drug-drug links, respectively. From evaluation 1, out of 2926, and 6581, our method managed to discover 1706, and 4178 of those links respectively, reported as DDIs in the DrugBank. We have considered these 1706 and 4178 as known DDIs and as ground truth. To cross-validate the performance of FDMine, we excluded a portion of known DDIs (or ground truth) as a test dataset from the main dataset and the rest of the dataset was used to train the models. Then, we calculate the precision@top-1%, precision@top-2%, and precision@top-5% and found approximately the same performance of FDMine with the disjoint dataset, and slightly better results for the joint dataset. Here, we have implementing a set of path category-based (path length 2 and 3) and different neighborhood-based similaritybased algorithms to discover new FDIs from two different homogenous (disjoint and joint) graph networks, and IV) used the precision@k metric and calculated the precision@top (top 1%, 2%, and top 5%) for drug-drug links to verify the accuracy of the algorithms with the test dataset. (b) illustrates a zoom-in view of food-drug interactions such that food items are represented as nodes that are then linked to their composition nodes. The structural similarity is between the small molecule drugs and the food composition nodes. An aggregation step is applied to compute the similarity of food-drug based on the composition and contribution. This figure was generated using MS PowerPoint v16.54.   Table 3. Performance evaluation of ground truth using disjoint dataset and path category-based (path length-2) method.  Table 4. Performance evaluation of ground truth using joint dataset and Neighborhood-based Similaritybased (RA) Method.  (DDS, FFS, FDS). Here we randomly assigned 30% of all (DD, FF, FD) links from the whole dataset to make the test dataset, and the rest of the 70% was used to train the model. We applied 'shortest path length 2' over the disjoint and 'RA' over joint graph. The 30% test dataset from the disjoint and joint dataset contains 26,157 and 27,612 links respectively. FDMine was able to recover an average of 9612.6 (± 5723.06) and 27,448.4 (± 14.20) links from the disjoint and joint datasets using 'shortest path length 2' and 'RA' methods, respectively.

RA
Evaluation 4: evaluation with gold standard dataset. Although Evaluations 1-3 reveal different aspects of the validity of FDMine, we considered reviewing the literature and built a list of well-reported food-drug interactions. We refer to this list of known food-drug interactions as the Gold Standard dataset. These interactions are fully hidden from the inference steps in FDMine and will only be used to evaluate the accuracy of the model. Details on the steps used and cases prepared for the dataset can be found in Supplementary File 1 (see 15. Details of the Gold Standard Dataset). For example, several studies referred to the interaction of Warfarin and Vitamin K 69,70 . In this evaluation, we will test if FDMine can reveal these known interactions. All steps for data preparation and thresholds are the same as the previous evaluations. This gold-standard dataset contain 53 FDIs. FDMine recovered 46.15% and 44.23% from the joint datasets using 'L3' and 'SP_L3' respectively. The recovery rate for the 'SP_L2' , 'RA' , ' AA' , and 'CN' was 5.77%. These results illustrate the advantage of introducing joint links and using longer path length (i.e., path length of 3) in matching ground-truth FDIs. The detailed findings are given in the Supplementary file 1: Tables S8 and S9 for the disjoint and joint datasets respectively. New food drug interaction prediction. After comparing the different approaches for link prediction, we executed the FDMine framework to find top candidates for FDIs. In the framework, we consider taking the top results from the joint and disjoint versions. At the final stage of FDMine, we surveyed the literature to find supporting evidence to the generated predictions. We have performed two batches using different contribution scores (i.e., 0.5 and 0.3, respectively). The default value in the FDMine framework is a 0.5 contribution score. As listed in Supplementary file 1, the results have demonstrated some repeated drugs among the top findings associated with a higher threshold value. A high threshold value will lead to removing more connections in the graph. This will lead to more disjoint subgraphs, and nodes with higher connections within the subgraphs exhibiting higher rank scores. Therefore, we consider a more relaxed threshold and generate Batch-2 results (i.e., contribution score of 0.3). In this batch, we see more diversity in our results. Supplementary file 1 (see 16 Table 5, the interactions we obtained appear to affect key biological pathways including-prostaglandin biosynthesis for inflammatory response 71 , beta-adrenergic signaling for cardiac output modulation 72 and GABA pathway 73 -a GABA based inhibitory neurotransmitter that down-regulates central nervous system stimulation 74 . After examining the results in Table 5, we have found that dietary fatty acids like Oleic acid (FDB012858), Erucic acid (FDB004287), (Z,Z)-9,12-Octadecadienoic (FDB012760) and Elaidic acid (FDB002951), available in foods like Onions-FOOD00006, Garden Cress-FOOD00099, Pomegranate-FOOD00151, etc., can affect prostaglandin biosynthesis via peroxisome proliferator-activated receptor (PPAR) mediated mechanism and Gabaergic pathway. Figure 3a,b highlight the list of these compounds and their interaction with PPAR and GABA-mediated effects, respectively. Similarly, we found evidence of food components like Eugenol (FDB012171), Carvacrol (FDB014512), which can potentially substantiate hypotensive effects when taken with beta adrenergic drugs. For example, Eugenol has been known to cause vasodilation via vanilloid TRPV4 receptors found on endothelial muscles in arteries. Beta-adrenergic drugs are prescribed to patients suffering from hypertension to decrease blood pressure (BP). So, when combined, this can cause an increased reduction in BP.
Prostaglandins are compounds that play a role in the anti-inflammatory pathway during injury 75 . An essential molecular building block in humans is arachidonic acid. It interacts with the peroxisome proliferator-activated receptor (PPAR) to form various prostaglandins 75 or anti-inflammatory compounds. Various dietary fatty acids (see Table 5; Oleic acid, Linoleic acid, Erucic Acid, Eldaic acid) are also absorbed via the exogenous chylomicron pathway and hydrolysed for various tissues to absorb them for further processing 76 . Some of our predicted compound items include Oleic acid-FDB012858, and Erucic acid-FDB004287, that are similar to arachidonic acid and are analogous 77 structures, belonging to the fatty acid group and are found in many dietary sources including Celery-FOOD00015, Peanuts (FOOD00016) and Burdock-FOOD00017 (see Table 5). Our literature review has highlighted reported evidence on the influence of these dietary fatty acids on the arachidonic acid cycle. Arachidonic acid is a precursor for the synthesis of various other biomolecules, associated with antiinflammatory pathways 78 . During injury, inflammation occurs and causes arachidonic acid to bind with PPARgamma receptors as shown in Fig. 3a to form prostaglandins or protective anti-inflammatory agents to curb injury 79 . Fatty acids (see Table 5) also compete with arachidonic acid during injury or inflammation to produce various substituted prostaglandins belonging to a family of derivative compounds known as eicosanoids 80  www.nature.com/scientificreports/ fewer anti-inflammatory profiles than other eicosanoids produced directly from arachidonic acid 82 . It is worth noting that arachidonic acid belongs to the list of essential fatty acids, including alpha-linoleic acid and docosahexaenoic acid 82 . There has been evidence to show that dietary sources such linoleic acid, erucic acid and elaidic acid (see Table 5) did increase PPAR gene expression in healthy subjects 83 . In 2012 Hung-Tsung Wu et al. also showed the interaction of oleic acid with PPAR-g receptors 84 . These results may suggest that taking drugs like Doconexent-DB03756 with foods such as FOOD00099-Garden Cress, FOOD00151-Pomegranate, FOOD00009-Chives, FOOD00062-Hazelnut, FOOD00525-Macadamia, can alter the normal dynamics of anti-inflammatory responses. Arachidonic acid is biosynthesized from dietary linoleic acid and released by phospholipases during inflammation. This pathway is also known as the COX or Cyclooxygenase pathway 85 .
Food compound composition with pharmacological effects (similar function-driven evidence). Here, we relaxed the contribution score to 0.3 (i.e., Batch-2) to obtain a diverse set of results (Supplementary file 1: 17. Batch-2 Description and Result). In this part of our literature validation, we analyze the potential of similar functions of drugs and food compounds on specific diseases. The results in Table 5 highlight some correlations with a group of drugs called beta-adrenergic drugs and essential oils. Our top correlated pairs of food and drug observed that both of them caused reduced blood pressure. Beta-blockers are used to treat hypertension in patients. Beta-blockers consist of b1, b2, and b3 subtype receptors. Beta-blockers can fall into various categories based on the extent of selectivity of binding across these subtypes. For example, Atenolol (DB00335), Bisoprolol (DB00612), Metoprolol (DB00264) and Esmolol (DB00187) are b1 selective blockers 91 .
The effects of b1 blockade include a decrease in cardiac output by inhibiting the SA and AV nodes, thereby decreasing stroke volume 86 . Propranolol (DB00571) and Penbutolol (DB01359), on the other hand, are nonselective beta-adrenergic blockers. Studies have also observed that beta-blockers may also contribute to GABA turnover in the cerebrum 87 .
We were able to confirm that fatty acids (Oleic acid (FDB012858), Erucic acid (FDB004287), (Z,Z)-9,12-Octadecadienoic (FDB012760) and Elaidic acid (FDB002951) ) can cross the blood-brain barrier and be beneficial to relieve anxiety 88 . They are also believed to act via stimulation of GABA-A based receptors. Benzodiazepines, barbiturates 89 and some anticonvulsants act by modulating the GABA receptors 92 . The inhibitory effects of GABA help relieve seizures. However, drugs like Pregabalin and Gabapentin instead act by blocking calcium or sodium channels to help stabilize seizures. Although this is not directly interacting with GABA receptors, it helps reduce excitatory neurotransmitters. Thus, they may help substantiate antiepileptic activity by increasing amounts of GABA.
In summary, the discussed pairs of food ingredients and drugs can influence their own pharmacokinetics. For example, taking beta-adrenergic drugs with food containing terpenes like Eugenol and Methyl chavicol can potentially cause more pronounced antihypertensive effects. Taking antiepileptic medications and foods containing fatty acids can potentially elevate overall GABA levels beyond the levels achieved when taken individually. Moreover, dietary fatty acids can also interact with the PPAR receptor during inflammation to produce variations of prostaglandins. This demonstrates the feasibility of using our FDMine framework to identify potential food and drug interactions.
Limitations and future work. There are some limitations associated with our current framework. For instance, FDMine did not consider the weight of the nodes (e.g., the degree of a node) to generate the joint version of the graph from the disjoint representation. Instead, FDMine used randomly chosen nodes to establish connections across disjoint groups. Additionally, the reported results illustrate that the precision@top has a significant drop for the SP_3 compared to the SP_2. In FDMine, an accumulated score that combines SP_2 and SP_3 was not introduced. This could have alleviated the drop in performance evaluation while taking advantage of the extra information gained from a longer path length when possible. Given the nature of a data-driven approach, an experimental validation would have helped further validate the findings of FDMine.
Future work will address various aspects of how FDMine can be improved upon. Currently, FDMine is evaluating the shortest path using the Dijkstra algorithm. One possible future research avenue would be to modify Dijkstra's algorithm to find the longest path by inverting the similarity scores. Another research option would be to assign different edge weights to generate the joint version of the disjoint graph. Similarly, the connection between two disjoint graphs can be made based on the degree of the node, e.g., maximum or minimum degree nodes can be chosen instead of randomly chosen nodes.
Moreover, the neighbourhood algorithms do not consider path weight. These algorithms function based on their mathematical formulation where the edge weight is missing. Here, we can modify this formulation so that these algorithms can incorporate the edge weight to produce FDMine's results.

Conclusion
In this study, we introduced FDMine as a framework to infer the interaction between food compounds and drugs using a homogenous graph representation. This homogenous representation enables us to take advantage of reported drug-drug interactions for accuracy evaluation, especially when accessible ground truth for FDIs is lacking. We considered several resources to construct food-drug, drug-drug, and food-food similarity profiles. FDMine uses established path category-based and neighborhood-based similarity methods to predict FDIs efficiently. A subset of Drug-drug interactions was used as ground-truth evaluations. This proposed methodology is Scientific Reports | (2022) 12:1061 | https://doi.org/10.1038/s41598-022-05132-y www.nature.com/scientificreports/ based on encoding all entities including drug and food into a homogenous graph of chemical nodes. Therefore, any part of this graph can then be used as a representative evaluation, potentially informative to clinicians and researchers. We prepared a gold standard dataset from the well-referenced FDIs reported in the literature to perform external validation. Additionally, we have performed two types of evaluations to benchmark results using different parts of the graph. The shortest path-based method has achieved a precision 84%, 60% and 40% for the top 1%, 2% and 5%, respectively. FDMine was able to achieve an average 99.4% recovery rate from 27,612 available links in the joint version of the graph. For the gold standard evaluation, FDMine recovered 46.15% of the ground-truth cases from the joint datasets using a path length of size 3 and the neighborhood-based algorithm. We validated the top FDIs predicted using FDMine to demonstrate the applicability of the model. In the literature validation, we discussed the therapeutic effects of a group of food items. We observed that a set of FDIs may reduce blood pressure, have anti-inflammatory effects or reduce seizure. The benchmark results and literature review suggest that FDMine can help identify FDIs precisely and may represent an advanced strategy in drug discovery.

Data availability
The code and datasets supporting the conclusions of this article are included within the article (and its supplementary files) or is made available at https:// github. com/ mosta fiz67/ FDMine_ Frame work.