Phenotype-oriented network analysis for discovering pharmacological effects of natural compounds

Although natural compounds have provided a wealth of leads and clues in drug development, the process of identifying their pharmacological effects is still a challenging task. Over the last decade, many in vitro screening methods have been developed to identify the pharmacological effects of natural compounds, but they are still costly processes with low productivity. Therefore, in silico methods, primarily based on molecular information, have been proposed. However, large-scale analysis is rarely considered, since many natural compounds do not have molecular structure and target protein information. Empirical knowledge of medicinal plants can be used as a key resource to solve the problem, but this information is not fully exploited and is used only as a preliminary tool for selecting plants for specific diseases. Here, we introduce a novel method to identify pharmacological effects of natural compounds from herbal medicine based on phenotype-oriented network analysis. In this study, medicinal plants with similar efficacy were clustered by investigating hierarchical relationships between the known efficacy of plants and 5,021 phenotypes in the phenotypic network. We then discovered significantly enriched natural compounds in each plant cluster and mapped the averaged pharmacological effects of the plant cluster to the natural compounds. This approach allows us to predict unexpected effects of natural compounds that have not been found by molecular analysis. When applied to verified medicinal compounds, our method successfully identified their pharmacological effects with high specificity and sensitivity.

ScIEnTIfIc RePoRtS | (2018) 8:11667 | DOI: 10.1038/s41598-018-30138-w focused on molecular analysis, such as molecular structure or target protein similarity, to predict the potential effects of natural compounds. However, many natural compounds do not have molecular structural information available, and their target protein information remains mostly unknown ( Supplementary Fig. 1). Hence, these approaches often encounter obstacles to large-scale analysis [18][19][20] .
The accumulated knowledge of herbal medicine, especially information on the efficacy of medicinal plants, can be used as a key resource to overcome this limitation. Even if no molecular information on many natural compounds is available, large-scale analysis can be performed by investigating the relationship between the known efficacy of plants and natural compounds from information on herbal medicine. For this purpose, we should consider the following characteristics of herbal medicine: (i) The efficacy of medicinal plants is described in various phenotype terms ( Supplementary Fig. 2). The efficacy information contains both high-level concepts, such as inflammation and hormone imbalance, and low-level concepts, such as aortitis and diabetic retinopathy ( Supplementary Fig. 3). Furthermore, similar concepts, such as synonyms and symptoms of diseases, are described in various forms. Therefore, to utilize the information on plant efficacy, these complex associations should be considered. For example, when extracting plants associated with urination, we can achieve more relevant results by examining phenotypes associated with urination, such as dysuria, urethral stones, and urinary tract abnormalities. (ii) Medicinal plants contain numerous natural compounds 21 . Unlike a single-target drug, herbal medicines consist of complex multicomponent mixtures of natural compounds. Moreover, even if we select plants that are associated with a particular disease, they are likely to be associated with many other diseases. Therefore, analyzing which natural compound in the plant is associated with a particular disease is difficult.
Here, we present a phenotype-oriented network analysis to identify pharmacological effects of natural compounds from herbal medicine. To address the characteristics of plant efficacy information in herbal medicine, the relationships between known plant efficacy and 5,021 phenotypes were quantified by applying a random walk with restart (RWR) algorithm, taking into account the hierarchy of the phenotypic network. This approach allows us to extract plant clusters with similar efficacy by considering complex phenotype associations. We then hypothesized that significantly enriched natural compounds in a plant cluster would be closely related to the efficacy associated with the plant cluster. To test this hypothesis, we investigated the predicted pharmacological effects of natural compounds based on the verified and candidate effect sets. We found that our predictions covered a large number of the results reported in previous work. More importantly, this approach solved the bottleneck by predicting pharmacological effects of natural compounds that were difficult to analyze due to a lack of molecular information. In conclusion, the novelty of our method is threefold: (i) It is the first phenotype-based in silico method that identifies pharmacological effects of natural compounds from herbal medicine without molecular analysis. (ii) Large-scale analysis can be performed by addressing the characteristics of herbal medicine systematically. (iii) It can be used as a preliminary tool to screen medicinal agents from numerous natural compounds.

Materials and Methods
Phenotype-oriented network analysis. We designed a novel algorithm to identify pharmacological effects of natural compounds from herbal medicine. The algorithm consists of four steps ( Fig. 1): (i) constructing phenotype vectors of plants by investigating the relationships between known plant efficacy and thousands of phenotypes in a phenotypic network; (ii) extracting plant clusters with similar efficacy by applying hierarchical clustering to phenotype vectors; (iii) finding significantly enriched natural compounds from the plant clusters; and (iv) identifying potential pharmacological effects of the natural compounds.
We generated phenotype vectors that cover a large number of quantified pharmacological effects of plants (Fig. 1a). Each phenotype vector contains 5,021 phenotypes defined by Medical Subject Headings (MeSH) and Online Mendelian Inheritance in Man (OMIM) (Supplementary Data 1). In the phenotypic network, phenotype nodes close to the root node have broad concepts, phenotype nodes distant from the root node have narrow concepts. Therefore, we assigned the edge weight between nodes in the phenotype network based on semantic similarity, which measures how similar two phenotypes are by determining closeness in a hierarchy. Next, the pharmacological effects of the plants were quantified by applying the RWR algorithm to the edge-weighted phenotypic network. The initial values of the phenotypic network were assigned to the known efficacy of plants, and their diffusion states were calculated by the RWR algorithm. In this process, we generated phenotype vectors for 2,286 plants.
Next, hierarchical clustering was performed to extract plant clusters from the phenotype vectors (Fig. 1b). In contrast to previous studies that selected plants with a specific phenotype, we extracted plant clusters with similar efficacy by taking into account a large number of phenotypes with their hierarchical relationships. For example, Viola tricolor, Thymus vulgaris and Chamaecyparis obtusa have been clustered because they are known to be effective in respiratory-related diseases or symptoms, such as scrofula, pertussis, and panting. Each plant cluster consists of an average of 3.6 plants containing an average of 43.3 natural compounds. Since each plant cluster contains a large number of natural compounds, the relationship between the pharmacological effects of the plant cluster and the natural compounds is complex. To solve this problem, we extracted significantly enriched natural compounds from plant clusters by using Fisher's exact test (Fig. 1c) and selecting natural compounds with a p-value lower than a threshold value. Finally, we investigated the potential pharmacological effects of the natural compounds (Fig. 1d). Our underlying hypothesis was that statistically significant natural compounds present in a plant cluster would have the pharmacological effects of the plant cluster. Therefore, the phenotype vectors of the plants belonging to the plant cluster were averaged and mapped to the enriched natural compounds. To validate the proposed method, we collected the following information. Drug information was acquired from DrugBank v.4.0 26 . Potential effects of natural compounds were collected from CTD 27   Quantifying the pharmacological effects of medicinal plants. We constructed a phenotypic network based on the hierarchical relationship of UMLS 25 and then calculated semantic similarity to measure the quantitative distance between phenotypes (Fig. 2a).
A relation between two general phenotype concepts, such as inflammation and hormonal imbalance, implies a reasonably large difference, while one between two specific concepts, such as diabetes mellitus and diabetic retinopathy, represents a small difference. Therefore, we applied the semantic similarity measure proposed by Wu & Palmer (wup) 38 and defined by the following equation (Fig. 2b). where lcs(c 1 , c 2 ) is the lowest common subsumer of concepts c 1 and c 2 . We assigned the edge weights of the phenotypic network based on the semantic similarity scores between phenotype nodes. Next, we performed the RWR algorithm to investigate the quantified pharmacological effects of medicinal plants in the edge-weighted phenotypic network. RWR simulates a random walker from its seed nodes and iteratively transmits the node values to the neighbor nodes with probabilities proportional to the corresponding edge weights [39][40][41] . First, we assigned initial values to seed nodes in the phenotypic network based on the known efficacy information of a plant (Fig. 2c). Second, we calculated the transition probability from a node to its neighbor nodes. We assumed the transition probability to be the value of the quantified relationship between phenotypes on the phenotypic network. The transition probability vector of each node at time step t + 1 was defined as where r represents the restarting probability of the random walker at each time step, which we set to 0.7 in this study. W denotes a transition matrix that is the column normalization of the adjacency matrix based on the edge weight of the phenotypic network 42 (Fig. 2d). p t represents the probability vector of each node at time step t, and p 0 represents the initial probability vector. The RWR algorithm simulates the random walker until all nodes reach the steady state (p t+1 − p t < 10 −8 ). We defined a list of phenotype values of a plant as a phenotype vector.
Clustering plants based on phenotype similarity. We merged all phenotype vectors into a matrix and applied hierarchical clustering to extract plants with similar pharmacological effects. Hierarchical clustering was performed by using R module pvclust 43 , which involves multiscale bootstrap resampling of 1,000 iterations to assess statistical significance. We selected clusters with an approximately unbiased (AU) p-value greater than a specific threshold. The AU p-value indicates the extent to which a cluster is strongly supported by the data, and a higher AU p-value indicates stronger support for the clustering. In this study, we set the threshold of the AU p-value to 0.95. In the clustering process, the correlation was calculated by the cosine distance.

Performance evaluation. Our method predicts pharmacological effects of natural compounds based on
phenotype-oriented network analysis. To examine the quantitative performance of the proposed method, we calculated the average area under the curve (AUC) scores of the receiver operating characteristic (ROC) and precision-recall (PR) curves (Fig. 3). We used 21 drugs from DrugBank and 92 compounds from CTD as our gold and silver standard datasets, respectively, for assessment of the prediction of the therapeutic and potential candidate effects. As a result, we obtained the AUROC and AUPR scores for therapeutic (AUROC T = 0.725 ± 0.085, AUPR T = 0.649 ± 0.080) and potential candidate effect (AUROC P = 0.754 ± 0.077, AUPR P = 0.685 ± 0.071) predictions. For this purpose, we averaged the AUROC and AUPR scores based on the natural compounds (Fig. 3a). To examine the importance of the hierarchical relationships, we compared the prediction performance with and without considering hierarchical relationships. The results showed that the performance decreased when hierarchical relationships were not considered in predicting therapeutic (AUROC T = 0.689 ± 0.079, AUPR T = 0.605 ± 0.082) and potential candidate effects (AUROC P = 0.719 ± 0.060, AUPR P = 0.644 ± 0.073). Furthermore, we compared our method with a network-based approach, the target-closeness method, which predicts drug efficacy by calculating the closeness between drug targets and disease genes 45 . The results indicated that our method, which uses herbal medicine information without any molecular analysis, exhibited better performance than the target-closeness method (AUROC T = 0.706 ± 0.089, AUPR T = 0.622 ± 0.068, AUROC P = 0.727 ± 0.081, AUPR P = 0.653 ± 0.062) in both therapeutic and potential candidate effect prediction.
ScIEnTIfIc RePoRtS | (2018) 8:11667 | DOI:10.1038/s41598-018-30138-w Next, we calculated the ROC and PR performance for each phenotype and averaged them to normalize the occurrence of different phenotypes (Fig. 3b) to determine whether our method focuses only on the prediction of a particular phenotype or not. For this purpose, we calculated the phenotype ranking for each natural compound based on a phenotype vector and then calculated the ROC and PR of the phenotype based on the ranking in all natural compounds. The results confirmed that when phenotype occurrence is normalized, similar performance in predicting therapeutic (AUROC = 0.715 ± 0.061, AUPR = 0.631 ± 0.064) and potential candidate effects (AUROC = 0.728 ± 0.066, AUPR = 0.661 ± 0.068) is obtained. These results indicate that our phenotype-oriented network analysis is relevant for predicting the pharmacological effects of natural compounds.
Next, by examining how the results of the proposed method differ from the results of the target-closeness method, we investigated whether it could be used as an alternative resource for drug discovery in the future. We first sorted the predicted pharmacological effects of natural compounds obtained from our method and the target-closeness method by their scores and then checked the rank correlation. The results confirmed that the rank correlation (r c ) between the two sets was very low (r c = 0.0019) and that there was no significant difference in the rank correlation scores of our results and a random set (r c = 0.0012). Next, we extracted the top 10% of phenotypes from our method and from the target-closeness method and calculated the Tanimoto coefficient (T c ) to investigate how similar the two sets were. The results confirmed that the Tanimoto coefficient between the results of our method and of the target-closeness method was very low (T c = 0.065 ± 0.013) and was not significantly different from the Tanimoto coefficient between our results and a random set (T c = 0.051 ± 0.010). Overall, these findings indicate that the results of our method and of the target-closeness method are statistically significantly different. The results show that the proposed method is complementary to molecular analysis and can be used as a tool to predict the pharmacological effects of natural compounds.
In contrast to the target-closeness method, which analyzes the compound efficacy via protein interaction information starting from known molecular targets, our method identifies statistically significant compounds and their efficacy by using information on the known efficacy of plants and on their constituent compounds. This discrimination provides a new way to analyze natural compounds. Previous molecular-based approaches are difficult to apply to natural compounds since the molecular target information on natural compounds is very limited (Supplementary Fig. 1). As an alternative method, we have been able to make new predictions by using the efficacy information of medicinal plants accumulated in herbal medicine and plant chemical composition information. Consequently, our method has produced novel predictions by analyzing new aspects of natural compounds. External literature validation. To validate the reliability of our method, we confirmed whether the predicted natural compounds and their candidate effects were identified in the external literature. We first ranked the predicted pharmacological effects of the 1,294 natural compounds by their scores and made three independent sets by selecting the top 10%, bottom 10% and random 10% of results containing 495,602 natural compound-phenotype associations. For the pharmacological effects of the selected natural compounds, we counted co-occurrences (n c ) from PubMed abstracts, calculated the Jaccard index and conducted Fisher's exact test (n f ) ( Table 1). We also performed the Mann-Whitney U test and calculated the corresponding p-values to check for significant differences in the literature evidence for the high-scored, low-scored and random sets 46 . A p-value from the Mann-Whitney U test lower than 0.05 was considered statistically significant.
The average co-occurrence for the high-scored set (n c = 1.41) was 12.8 and 3.8 times larger than the average co-occurrence of the low-scored set (n c = 0.11) and the random set (n c = 0.37). We also normalized the co-occurrence value by the Jaccard index to correct for the differences in the frequencies of natural compounds and of phenotypes. The average Jaccard index value of the high-scored set (JI = 2.2 × 10 −4 ) was 20.0 and 2.8 times higher than the values of the low-scored set (JI = 1.1 × 10 −5 ) and the random set (JI = 7.8 × 10 −5 ). Furthermore, we performed Fisher's exact test to find the significant associations (p-value < 0.001). To obtain the Fisher's test value for each association, the number of PubMed abstracts that included both the natural compound and the target phenotype was counted. The number of significant associations of the high-scored set (n f = 3,281) was 4.4 and 2.8 times higher than those of the low-scored set (n f = 746) and the random set (n f = 1,136). In addition, the p-values of the Mann-Whitney U test indicated that the difference in the literature evidence among the high-, low-scored and random sets was significant. These results show that our method can be used as a tool to identify pharmacological effects of natural compounds.
Novel pharmacological effects of natural compounds. Our method uses herbal medicine information without molecular analysis to predict pharmacological effects of natural compounds, enabling us to discover effects previously undetected by the target-closeness method. To find novel pharmacological effects of natural compounds, we first filter out meaningless associations. For this purpose, we selected a cutoff for the prioritized list of phenotypes of each natural compound according to the F1-measure, the harmonic mean of precision and recall. The F1 score was calculated for the threshold of phenotype values from 0 to 0.95 with intervals of 0.05, and  the best performance was obtained at 0.20. Based on this threshold value, the predicted pharmacological effects of natural compounds were filtered (Supplementary Data 4). Next, we found the PubMed evidence of predicted pharmacological effects of 10 natural compounds through manual curation (Table 2) and analyzed the results that differed from those of the target-closeness method. For instance, puerarin was investigated as a treatment for stroke 47 . However, the calculated distance between the target proteins of puerarin and the stroke-associated proteins in the molecular network shows that they are far away from each other (average shortest distance = 3.32), close to random (p-value < 0.001). Therefore, that the target-closeness method does not appear to show that puerarin can be used as a treatment for stroke. However, in our method, puerarin receives a high score for stroke (score = 0.773) and is proposed as a potential medicinal agent because many medicinal plants that contain tocopherol are known to be effective against stroke in herbal medicine. Furthermore, we can predict additional pharmacological effects of puerarin, such as the treatment of fever, epistaxis and perspiration, that have not been reported in DrugBank and CTD. From these results, we believe that our method can be used as an alternative tool to identify potential pharmacological effects of natural compounds.

Discussion
Herbal medicine has methodically collected information on medicinal plants for thousands of years and can be used as an important resource in drug development, in combination with information on natural compounds obtained by modern high-throughput screening techniques. Here, we introduce a phenotype-oriented network analysis to predict pharmacological effects of natural compounds from herbal medicine. The efficacy information in herbal medicine includes both high-and low-level phenotype concepts, and there are various associations between these concepts, such as synonym, symptom, superordination and subordination. Moreover, since natural sources are composed of various natural compounds, determining which natural compound is associated with a particular phenotype is difficult. In this study, the relationships between known plant efficacy and 5,021 phenotypes were quantified by considering the hierarchy of the phenotypic network. This approach enabled the extraction of plant clusters with similar efficacy by considering complex phenotype associations. From the plant clusters, we can identify significantly enriched natural compounds and their potential pharmacological effects. The proposed method is meaningful in that pharmacological effects of natural compounds were identified by utilizing herbal medicine information, in contrast to conventional methods that focus on molecular analysis. This approach enables large-scale analysis since it can be applied even in the absence of molecular information on natural compounds. In evaluating the prediction performance, we confirmed the successful prediction of pharmacological effects of natural compounds by comparing the results with those of the target-closeness method, which relies on molecular analysis. We also found that the predicted results of the proposed method and of the target-closeness method did not overlap. This result indicates that the proposed method enabled us to identify pharmacological effects of natural compounds that went undetected by the target-closeness method.
There are some additional considerations to improve our method. First, molecular information is not taken into account in the current study. We mentioned the lack of molecular information on natural compounds as a problem of conventional methods. However, this limitation can be solved with further experiments and improved techniques. We expect more accurate predictions to be made by using both herbal medicine and molecular information appropriately. Second, advanced methods are needed to predict the pharmacological effects of natural compounds. Currently, we extract significantly enriched natural compounds from the plant cluster and map the averaged pharmacological effects of the plant cluster to the natural compounds. However, the pharmacological effects of the natural compounds and the plant cluster cannot be the same and will require further advanced analysis for precise prediction. Nevertheless, these limitations can be taken into account through further improvements. We believe that this study enables us to perform large-scale analysis and to provide a new direction for future study by systematically addressing the characteristics of herbal medicine information.