FlavorGraph: a large-scale food-chemical graph for generating food representations and recommending food pairings

Food pairing has not yet been fully pioneered, despite our everyday experience with food and the large amount of food data available on the web. The complementary food pairings discovered thus far created by the intuition of talented chefs, not by scientific knowledge or statistical learning. We introduce FlavorGraph which is a large-scale food graph by relations extracted from million food recipes and information of 1,561 flavor molecules from food databases. We analyze the chemical and statistical relations of FlavorGraph and apply our graph embedding method to better represent foods in dense vectors. Our graph embedding method is a modification of metapath2vec with an additional chemical property learning layer and quantitatively outperforms other baseline methods in food clustering. Food pairing suggestions made based on the food representations of FlavorGraph help achieve better results than previous works, and the suggestions can also be used to predict relations between compounds and foods. Our research offers a new perspective on not only food pairing techniques but also food science in general.

to determine whether a food ingredient is essential in a recipe. This approach uses two different recipe networks to find which ingredients go well together or can be used as substitutes for better recipes. There have been also studies on the analysis of food preference and food pairing according to regional characteristics each in China 6 and India 7 . Researches on food preferences done by Wagner et al. 8 , Abbar et al. 9 focused on studying personal food preferences using online user data. On top of that, Zhang et al. 10 proposed the restaurant recommendation to guide dining preferences based on the user's food history. Other approaches 11,12 that combine case-based reasoning and deep learning for automatic recipe generation have been introduced. KitcheNette 13 uses deep Siamese neural networks trained on a large recipe dataset to predict food pairing scores. The hidden representations from the shared embedding layer of KitcheNette are used for predicting the co-occurrence of food ingredients in recipes and contribute to discovering novel food pairings by referencing similar food representations. However, as these approaches are solely based on statistical co-occurrence among many recipes, chemical compound information is not taken account in constructing food representations and recommending food pairing.
Many studies have been made to construct different types of representations using data-driven approaches and semantic concepts employed in other research fields. Semantic concepts are coherent ideas that come from words in language, people nodes in social networks, entities in databases, and so on. Word2vec 14,15 is a neural network that learns distributed representations of word vectors trained on textual data (sentences). In food research, Im2recipe 16 utilized word2vec to create food representations based on a large corpus of recipes. These food representations were extensively used for inferring food images 16 and generating novel food recipes 17 . Reciptor 18 most recently proposed a set transformer-based model to obtain recipe embeddings and uses a knowledge graph (KG) 19 derived triplet sampling approach to optimize the learned embeddings.
In graph-based embedding approches, node2vec 20 has been employed for building node representations from network data. Node2vec has also been used in network analysis and graph mining tasks. Node2vec generates random walks according to the relations of its network where the walks are analogous to sentences in word2vec. The nodes of the network are trained on each walk where the neighboring nodes serve as contextual information. Metapath2vec 21 which generates heterogeneous node representations using large-scale networks has been recently introduced. The random walks called metapaths generated by metapath2vec 21 are used for constructing similar representations of heterogeneous nodes based on commonly linked nodes. These data-driven and graphbased approaches that embed conceptual representations with rich domain-specific information may improve food pairing recommendations as the approaches can be used to construct food representations based on the relations between different foods and chemical compounds.
Using the approaches of previous studies, we built FlavorGraph (Fig. 1) which is a large-scale graph network of food and chemical compound nodes. FlavorGraph (Fig. 1) contains 6653 food ingredient nodes and 1561 food-related chemical compound nodes, 84 drug-like chemical compound nodes, and two relations among them. First, the relations between ingredients and ingredients (111,355 edges) are based on their probability of being used together (NPMI, Normalized Point-wise Mutual Information) in 1 million recipes (Fig. 1A). The relations between ingredients and chemical compounds (35,440 edges) were obtained from food-related academic resources that specify such food-compound relationships (Fig. 1B).
Then we adopted the graph node embedding method of metapath2vec 21 with an additional chemical structure learning layer to make conceptual representations of food. A metapath is a predefined sequence of node types  Fig. 1 for better illustration. Note that the whole graph was used for model training.
which are connected as edges according to the graph. Sub-graphs that satisfy the metapath condition were randomly selected from FlavorGraph. The embedded node vectors of FlavorGraph, which are food representations, were trained in a skip-gram fashion based on their sub-graph connectivity. Food ingredient nodes are further divided into two types: 'chemical-hub ingredient nodes' and 'non-hub ingredient nodes' . Among the ingredient nodes, we refer to the nodes that have a relation with the chemical compound nodes as 'chemical-hub ingredient nodes' . We found that only 416 food ingredient nodes (chemical-hub ingredient nodes) of the 6653 food ingredient nodes have chemical information, but the remaining 6237 nodes (non-hub ingredient nodes) do not. A small portion of food ingredients having such chemical information can lead to several model optimization issues. To overcome this issue, we employ two methods in this study. First, we created food-specific metapaths 21 where all three types of nodes are involved. Our intuition is to enforce the embedding model to pass the information from chemical compound nodes to non-hub ingredient nodes via chemical-hub ingredients nodes in a single metapath. Second, upon food pairing relations regarding the chemical information of food ingredients (35,440 edges), we also utilized the statistical co-occurrence of two food ingredients (111,355 edges). Therefore, we designed a model that considers not only the chemical aspects of food pairing but also the statistical aspects as well.
To further elaborate on food-specified metapaths, we created multiple to ensure that chemical information from compound nodes is passed to non-hub ingredient nodes via chemical-hub ingredient nodes. The chemicalhub ingredients nodes act as important intermediary nodes between two disjoint heterogeneous nodes (e.g., flavor compounds to chemical-hub ingredients to non-hub ingredients) and learn as much chemical as possible. Likewise, the created metapaths help food representation vectors simultaneously train on complex relations such as food-food (e.g., red_wine&steak) and food-chemistry (e.g., red_wine&ractone) relations. The final number of metapaths was 1,114,285 and the average length of them was 46.4. We then trained these metapaths in skipgram fashion to obtain meaningful food representations. The food representations were further used in many downstream tasks including food pairing. The results of food representations are shown in Fig. 3.
We conducted quantitative experiments and a qualitative analysis to evaluate whether the food representations constructed using the metapath-based graph embedding method of FlavorGraph provide meaningful information and help improve performance in food pairing tasks. Compared with other food embedding methods, our metapath-based graph embedding method with a chemical structure learning layer achieved the highest score in a node clustering task of correctly categorizing each food. We also performed a similarity search based on the cosine similarity ranking of the food representation vectors. We assumed that our method would yield food pairings that are chemically similar and/or used in the same recipe context. In practice, for a given query, our embedding method provides ranked results of recommended parings of ingredients that are chemically similar and used in the same recipe. The trained food and chemical representations can be further used with simple mathematical operations to suggest food pairings. In addition, the ranked results of recommended pairings can be used to provide a list of foods related to a particular food molecule or drug compound. In summary, our food representations are meaningful as they helped to improve performance in food clustering and food pairing recommendation.
The major contributions of this work are summarized as follows.
• We introduce FlavorGraph which is a large-scale network graph built from recipe and chemical relations (147,179 edges) of food ingredients (6653 nodes) and chemical compounds (1646 nodes). • We propose a food-specialized graph embedding method that constructs meaningful food representations for FlavorGraph, and demonstrate that our method outperforms other methods in a node clustering task which involves categorizing features of food. • We demonstrate the effectiveness of our food representations in the tasks of food pairing recommendation and food-compound relation prediction.

Results and discussion
Preliminary analysis on food pairings with chemical compound information and recipe co-occurrence. We performed preliminary analysis on the large amount of recipe data from Recipe1M 22 and chemical information in foods from FlavorDB 3 as both datasets are involved in this study. We first examined the relevancy between chemical compound information in foods and their category labels. Next, we looked into the correlation between pairwise ingredient co-occurrence probability in 1 million recipes and pairwise chemical similarity based on the number of overlapping chemical compounds in two ingredients. We denote the number of overlapping chemical compounds as chemical overlaps. Relation between chemical information and category labels of food ingredients Referring to Fig. 2, we found the food ingredients in same categories tend to have similar chemical structures. The scores in each box in Fig. 2 refer to the mean value of all Jaccard similarity scores of two ingredient chemical vectors within their categories. Each position in the ingredient chemical vectors represents the presence of a chemical compound (0 or 1). We observed that the similarity scores within same categories are largely high except for few corner cases. The mean value of similarity scores between same categories are 0.324 while different categories yielded the mean value of 0.167.
Relation between recipe co-occurrence and chemical similarity of two food ingredients In Table 1, we examined the following two hypotheses regarding our FlavorGraph data: (1) Whether the co-occurrence probability of two ingredients in a large corpus indicate good pairings. (2) Whether the high number of overlapping chemical compounds indicate good pairings.
For the first hypothesis, we actively selected 10 ingredient pairs among the top-ranked pairs with high recipe co-occurrence probability (NPMI). Some pairs have large chemical overlaps (e.g., bamboo_shoot&water_chestnut, oregano&basil, raspberry&blackberry and so on). These two ingredients are well-known "Congruent Pairings" 24 , where they share many chemical compounds and have high co-occurrence in many recipes. On the other hand, some frequently co-occurring pairs have small chemical overlaps. For example, miso&sake, anchovy&caper www.nature.com/scientificreports/ and saffron&basmati_rice are ingredients pairs that are chemically dissimilar as they have few shared compounds, but are well-known good pairings. For these cases, these types of pairings are called as "Complementary Pairings" 24 . For the second hypothesis, we ranked the 10 samples of ingredient pairings by their chemical overlaps. The top ranked pairings are mostly similar types of foods (e.g., fruit-fruit and bean-bean), but with quite low co-occurrence probabilities (− 0.178 0.117). We found that high chemical overlaps lead to chemically related foods in similar categories but do not necessarily mean good pairings.
To wrap up our preliminary analysis, we confirmed again that it is important to incorporate both recipe cooccurrence information and chemical information for building FlavorGraph. For improved food representations, our approach for building the metapaths focuses on propagating both types of information to the embedded nodes in our constructed FlavorGraph.  Table 1. Relation between recipe co-occurrence based pairing ranking and chemical overlap based ranking. Co-occurrence probability is normalized point-wise mutual information (NPMI) between pairs of ingredients. NPMI ranges from − 1 to 1, with − 1 (never occurred together), 0 (independent of each other), and 1 (co-occurred perfectly). www.nature.com/scientificreports/ Figure 3A shows the results of applying the graph embedding to FlavorGraph. Using metapath2vec 21 , we generated user-specified metapaths to learn the chemical relations between foods and chemical compounds, and how foods have been used together in recipes. A metapath is a path where heterogeneous nodes (food, compound, and drug) are connected to each other based on their relations in FlavorGraph. The <Food-Compound> relations are from two distinguished food databases 3,23 . The <Food-Food> relations are identified by the statistical co-occurrence of food ingredients in one million recipes 4 . Further details on generating and learning metapaths are provided in the Methods section. Figure 3A shows the 2D t-SNE projection of heterogeneous node representations (food ingredient nodes [chemical hub or non hub nodes], flavor compound nodes, and drug compound nodes). Food nodes are further divided into chemical hub ingredient nodes (indicated by orange diamonds, 6% of total ingredients) that have relations to chemical compounds, and the remaining ingredient nodes (indicated by yellow squares, 94% of total ingredients). Certain clusters of nodes are formed around these chemical hub ingredient nodes according to the nodes food category (e.g. fruit, dairy and so on). In each cluster, the chemical hub ingredient nodes are located closely to the chemical compound nodes (indicated by green and pink dots) which are included in the chemical hub ingredient nodes. Also, there are many food ingredient nodes with chemical information (indicated by yellow squares) near the chemical hub ingredient nodes, which indicates that the food ingredients are likely to be used together in recipes. Figure 3B shows that the wine and citrus fruit nodes are located closely to each other and that the wines and fruits share flavor compounds that are actually part of their composition (e.g., red_wine&genaric_acid and red_wine&ethyl_lactate). Figure 3C shows what kinds of food would go well with wines and citrus fruits (e.g., red_wine&steak and red_wine&marzano_tomato). The results show that if two food ingredient nodes are close to each other in the vector space, they may be a good pair as they share similar chemical structures and have a high possibility of being used in the same recipe.

Representing food ingredients and chemical compounds in a vector space.
Node clustering by food category. To illustrate how well our food representations align with their categorical distribution, we conducted a node clustering task. We measured the clustering accuracy with Normalized Mutual Information (NMI) as done in the work by Dong 21 . Originally, the Mutual Information (MI) is a measure of the similarity between two labels of the same data. Here, the two labels are predicted clustering labels and actual clustering labels. The Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). The results of the node clustering task are shown in Table 2.
For the clustering categories, we then collected the cuisine categories of 416 chemical hub ingredients from FlavorDB 3 . Originally, the total number of categories in FlavorDB is 34. Instead of using all the information, we chose to merge them into the following representative nine food categories: Bakery/Dessert/Snack, Beverage Alcoholic, Cereal/Crop/Bean, Dairy, Fruit, Meat/Animal Product, Plant/Vegetable, Seafood, and Others. The statistics of each cuisine category is shown in Table 3. Note that we only used 416 chemical hub ingredients for fair comparison since they were commonly used in all other models.
We compared the performance of the our modified version of metapath2vec on FlavorGraph with that of five different food vector embedding methods in clustering food ingredient nodes. To create food vectors for FlavorDB, we used the occurrence of chemical compounds in each food to represent a position of a binary vector (1645-D). For food vectors of Im2recipe, we utilized the word embedding 14 results of one million recipes www.nature.com/scientificreports/ on which a simple skip-gram model was trained. For the food vectors of node2vec and metapath2vec, and the FlavorGraph, we used different graph node embedding methods for each. Node2vec uses a simple random walk and metapath2vec uses probabilistic metapath walks. The graph embedding applied to FlavorGraph also uses metapath2vec but has an additional chemical embedding layer for better representing ingredient nodes. The flavor representations of FlavorGraph showed the best result (NMI score of 0.309) in node clustering. The hyperparameter setting of this experiments is further discussed in M1 of Supplementary Information. Figure 4 shows the 2D t-SNE projection of FlavorDB node representations (A) and that of FlavorGraph node representations (B). As shown in Fig. 4A, clustering result seems to be effective, but similar categories slightly Table 2. Node clustering results. The scores indicate for normalized mutual information (NMI). In the case of FlavorDB, since there is no continuous and dense vector, 1645 binary vectors representing the presence or absence of each chemical compound were used. *We found significant differences among all the results where p < .05 and the number of sampled results ( n = 1000).     (0)). The graph embedding method of FlavorGraph learns not only the food-chemical molecule relations, but also the relations between food ingredients in a large number of recipes. The recipe&novelty recommendation of food pairings (right table in Table 4) shows that FlavorGraph representations can be used to recommend complementary and novel pairings of foods in cooking. In this case, we only retrieved non hub ingredient nodes from FlavorGraph. This was to compare it with KitcheNette model, which has no chemical information. The results of KitcheNette, which predicts the recipe co-occurrence probability of food ingredient pairs, are ranked using Siamese neural networks. FlavorGraph can be used to recom- www.nature.com/scientificreports/ mend ingredients that are highly likely to be used together in recipes. For example, whole_wheat_hamburger_ bun(0), bacon_piece(0), ranch_salad_dressing(0), and so on are highly ranked food ingredients that go well with tomato. In summary, FlavorGraph can be used to recommend pairings of food ingredients that are similar in chemical structure and most likely to be used together in recipes. Table 5, we demonstrated that our learned food representation vectors can also be used to provide pairing recommendations based on the combination of multiple learned food representation vectors. First, we made pairing recommendations for a single food ingredient (e.g. ice_cream, white_wine). We showed that FlavorGraph can recommend complementary ingredients (e.g., caramel_sauce, cookie, cake, candy for ice_cream and sole_fillet, tomato, chicken for white_wine) that would generally go well with given food ingredients. Also, we added two or more ingredient representation vectors, and used them as a pairing query vector. For example, strawberry and chocolate were added to ice_cream, and shrimp was added to white_wine. We simply summed the two vectors, and performed the similarity search on the summed result. For the ranking of the summed vector of [ice_cream+strawberry], strawberry-related desserts (e.g. strawberry_ jello_o, strawberry_gelatin), or other desserts (e.g., brownie, whipped_topping, hot_fudge) that pair well with ice cream and strawberries were recommended. For the ranking of the summed vector of [ice_cream+chocolate], the recommendation results include additional desserts that go well together. White_wine is generally known to go well with seafood, vegetables, and chicken. For the ranking of summed vector of [white_wine+shrimp], different types of seafood that go well with the combination of white_wine and shrimp are recommend.

Case study 2: Can we also predict compound-food relations? Predicting compound-food relations
When our modified version of metapath2vec is trained on all the nodes of FlavorGraph, the method trains not only the relations between foods but also the relations between foods and chemical compounds. Therefore, we believe that it is possible to predict pre-existing compound-food relations and undiscovered ones through a similarity search of our learned food representation vectors. We demonstrated a toy example task to predict relationship between compounds and foods. Figure 5) shows the prediction results of <Flavor profile-Flavor compound-Food> network. To build this relation network, we first picked the 5 most frequently appeared flavor profiles (out of 582) in FlavorDB 3 . We then randomly sampled ten of each corresponding random flavor compounds (out of 1561) upon the picked flavor profiles. Lastly, we randomly sampled 20 food ingredients (out of 6653) for each of the flavor compounds.
The relation edges (grey lines) on <Flavor profile-Flavor compound> were defined based on already known connections found in FlavorDB. The relation edges (grey, blue, red lines) on <Flavor compound-Food> were defined based on whether the similarity search score of two nodes exceeds a certain threshold. This threshold is further discussed in M2 of Supplementary Information. In Table 6, we illustrated the evaluation results on similarity search between food nodes and flavor compound nodes. The grey (straight) lines are the relation edges that were correctly predicted to be connected. The red lines (dotted) are the relations edges that were predicted to be not connected but actually are connected according to the dataset. The blue (dashed) lines are the relations that were predicted to be connected but are unknown up to recent discoveries. In summary, not only did the similarity search predict the existing relations with a fairly high accuracy, but also discover six new <Flavor compound-Food> relations that have not been found before. These newly discovered relations can be used to discover new compound-food relations that have not been previously identified due to cost limitations or limited access to natural food sources.
What are the flavor profiles related to food? As shown in Fig. 5, we added flavor profiles related to existing <Flavor compound-Food> relations. Here, flavors such as fruity, bitter, fatty, floral, and so on are flavor compounds. Knowing and finding Food-Flavor compound-Flavor profile> relations can help in better understanding the chemical effects of food ingredients on to other food ingredients. For example, we can find foods with fruity flavor www.nature.com/scientificreports/ compounds based on the predicted relations between Fruity (flavor profile) and Hexyl_acetate (flavor compound). According to reference search 25 , Hexyl_acetate is known to be used as a flavoring because of its fruity odor, and it is naturally present in many fruits (such as apples and plums) as well as alcoholic beverages. We found that small quantities of Hexyl_acetate are found in a wide variety of fruits and foods including rose_wine, red_wine, champagne, apple, and parmesan_cheese. While the example results we provided here are toy examples of <Flavor profile-Food> relations, we expect to provide hints for future food researches such as flavor-specific searches.

Conclusion
We collected food recipes and chemical information of food to build a large-scale food-compound network graph called FlavorGraph. We used the food-specific metapath graph embedding method with an chemical structure learning layer to generate elaborate food representation vectors for FlavorGraph. Then we demonstrated that the food representation vectors can be used for making food pairing recommendations and predicting new foodcompound relations. However, our work has some limitations; first, more food-related information is needed to better understand food. This work combines only recipe co-occurrence information and chemical information. Second, since the metapath-based graph embedding method uses unsupervised learning, it is difficult to evaluate the food repre sentations. Last, the food pairing recommendations and food-compound relation predictions made in this study have not yet been scientifically evaluated because there is no single correct answer set that can be used for verifying such results. Nevertheless, we believe that FlavorGraph can be employed for better understanding the cooking and medicinal uses of food. Also, the deep learning strategies outlined in this paper can serve as the cornerstone for food pairing and food-relation prediction tasks.

Methods
Building FlavorGraph. We combined various datasets used in several food-related studies for building FlavorGraph which is a very large graph containing food-food and food-chemical compound relations (Fig. 1). FlavorGraph is comprised of three different types of nodes (e.g., food ingredients, flavor compounds, and drug compounds), and three different types of edges (e.g., food ingredient-food ingredient relations, food ingredientflavor compound relations, and food ingredient-drug compound relations). Further details of our constructed graph are provided in Table 7.
Food-food relations. We extracted the ingredient nodes (I) and ingredient-ingredient relations (I-I) from Recipe1M 16,22 which is a large-scale dataset of human-written cooking recipes. 16,857 unique candidate ingredients were selected from the recipes and co-occurring probabilities of all ingredient pairing combinations were calculated. The pairing scores were calculated based on the co-occurring probabilities. The higher the co-occur-  www.nature.com/scientificreports/ ring probability of two ingredients, the higher their pairing score. This scoring approach was first introduced in the work by Teng et al. 5 where Pointwise Mutual Information (PMI) (Eq. (1)) was used to create a simple ingredient network. Similarly, a normalized version of PMI (NPMI) 26 was utilized for training the pairing score prediction model KitcheNette 13 .
As done in the work by Park 13 , we used NPMI 26 scores to construct edges between every two ingredient nodes in FlavorGraph. Among all the possible candidate ingredient pairs from the list of extracted ingredients, we included only highly complementary candidates in our network. Candidate ingredient pairs satisfying one of the following conditions were selected as network edges: (1) Each ingredient appears more than 20 times in Recipe1M and both of them appear more than 5 times in the same recipe. (2) The calculated NPMI score is at least 0.25. Ingredient pairs whose NPMI score is less than 0.25 but ranked in the top 20 below the 0.25 threshold were also included as edges in FlavorGraph. A total of 7199 ingredient nodes and 164,531 ingredient-ingredient edges were included in FlavorGraph. We found that the 400 ingredients (e.g., chicken, rice, banana) are popularly used in cooking while the rest of them (e.g., hyacinth_bean, mammee_apple, drumstick_leaf) are rarely used. Based on the 400 selected ingredient nodes that have flavor compound information, there is a total of 1561 flavor compound nodes and 164,531 ingredient-flavor compound edges. For drug compounds (D) and ingredient-drug compound (I-D) relations, we used HyperFoods 23 to create chemical edges in graph. HyperFoods exploits machine learning to map cancer-beating drug compounds to natural food ingredients. HyperFoods is trained on drug-gene relations to infer food-gene relations by means as suggesting remedial foods for cancer treatment/prevention. 206 natural food ingredients are provided in a dataset used in HyperFoods but we used only 104 of them for constructing FlavorGraph. Based on the 104 ingredient nodes that have drug compound information, there are 84 drug compound nodes and 386 ingredientdrug compound edges.
Graph node embedding in FlavorGraph. We employed metapath2vec 21 which can learn representations of ingredient, flavor compound, and drug compound nodes, and the relations between the nodes. In meta-path2vec, the nodes of FlavorGraph are trained to learn relations between ingredients in food recipes and the chemical information of ingredients.
Generating metapaths from FlavorGraph. The scientific approach to food pairing focuses on the number of shared compounds as explained in Ahn's work 1 ; however, this approach suffers from the limited availability of chemical information on food ingredients (H-chemical-hub ingredients in FlavorGraph). To overcome this problem, we aim to design to learn chemical information (F, D) and even information on non-hub ingredient nodes (N). Food-specific metapaths are generated from FlavorGraph so that the chemical compound nodes (C) can pass information to non-hub ingredient nodes (N) through chemical-hub ingredient nodes (H). To do so, we first set up metapaths (e.g. C-H-N-H'-C') starting from compounds (C) and ending at compound nodes (C) so that the ingredient nodes (H, N, H) in the path can share the same context. For the nodes of the same type shared in a path are set to have different elements meaning that C and C' are different compounds. In the same way, we added metapaths (e.g. N-H-C-H'-N') starting at non-hub ingredients (N) and ending at N. These  www.nature.com/scientificreports/ metapaths were added so that every non-hub ingredient node (N) is trained at least once and learns chemical information. Last, to reflect all nodes in a balanced way, node2vec 20 based on completely random walks was added. The number of walks of each starting node was set to 100 and the maximum length of each metapath was set to 50. A total of 1,114,285 metapaths whose average length is 46.36 were used for final learning. where log p(c u |u; θ) is defined as a softmax function e Xc u ·Xu v∈V e Xv ·Xu , where Xi is the ith embedding vector in X. We also employed negative sampling where a set of M words is sampled from FlavorGraph. Therefore, the above equation is updated as where σ (x) is 1 1+e (−x) and Q is a distribution from where u m is drawn M times. The graph embedding method was used to learn various contexts from different types of nodes in generated food-specific metapaths of FlavorGraph. metapath2vec with chemical structure prediction (CSP) layer. As shown in Fig. 6, we implemented upgrade version of metapath2vec with chemical structure prediction (CSP) layer. As the food-specific metapaths in Fla-vorGraph have not only non-hub ingredient nodes (N), chemical-hub ingredient nodes (H) but also flavor compound nodes (F) and drug compound nodes (D), we included chemical knowledge in our flavor representations. However, even though the flavor and drug compound nodes have chemical structure information represented as CACTVS fingerprints. More information is available from PubChem Fingerprints), their valuable chemical information is not taken into account in learning. CACTVS fingerprints are expressed as 881-dimensional binary vectors where each element of the binary vectors indicates whether a particular molecular substructure exists in a certain chemical compound. Each bit of a fingerprint represents the presence or absence of one of the 881 chemical substructures. A chemical structure prediction (CSP) layer which is designed to use the more detailed information on chemical compounds was added to the original skip-gram model. We expect the www.nature.com/scientificreports/ skip-gram model to learn the available chemical structure information on flavor&drug compounds and use compound-ingredient relations to generate more significant node representations. With the encoded vectors from the CSP layer, the graph embedding method was trained to accurately predict the underlying chemical structure of flavor and drug compounds exclusively. The loss function for the CSP layer is the binary cross entropy between the encoded vectors from the CSP layer and the binary fingerprint vectors corresponding to chemical compounds. The CSP loss function is added to the graph embedding method as follows: where f d is the chemical encoder function for the dth dimension of the D-dimensional encoded vector from the CSP layer, y d is the binary label for dth chemical substructure in the actual CACTVS vector and is the weighting factor ranging from 0.0 to 1.0. Note that D is 881 which is the actual number of chemical substructures in our CACTVS/PubChem fingerprints. As the gradient backpropagates through both the CSP layer and the skip-gram model, we expect our updated embedding ingredient vectors to contain chemical information. However, not all ingredient vectors have such information due to the limited coverage of available databases. While the skip-gram model will make predictions for all ingredients, only those with 881-dimensional binary vectors as labels were included in the additional cross entropy loss. Therefore, the CSP task is trained in a semi-supervised learning setting. In sum, the skip-gram model is used for learning contextual information in pairing paths, and the CSP layer is used for learning chemical structure information.