Introduction

Internal and external etiology can lead to self-stable regulation disorder, which could change a series of metabolisms, functions and structures. Abnormal life activity processes are manifested as abnormal symptoms, signs and behavior1,2. Under certain conditions, the abnormal life activity processes caused by the disturbance of homeostasis after the damage of the disease cause the disease3,4. Traditional Chinese medicine (TCM) has been utilized to treat diseases for thousands of years5,6,7. Traditional Chinese medicine is a kind of material with the function of rehabilitation and health care, which could be utilized to prevent, treat and diagnose diseases under the guidance of TCM theory8,9,10,11.

Traditional Chinese medicine mainly comes from natural medicine and its processed products, including plant medicine, animal medicine, mineral medicine and some chemical and biological products12,13. The most important feature of traditional Chinese medicine in treating diseases is to pay attention to the adjustment of the functions of viscera and organs, and the balance and coordination between them. The focus of traditional Chinese medicine treatment is not that the human body is infected with the specific bacteria, virus and other pathogenic factors, but the specific reaction of the human body after these pathogenic factors act on the human body14,15. The purpose of treatment is to enhance the disease resistance and recovery ability of human body. To kill bacteria and relieve symptoms are mainly achieved by enhancing the body's own functions. In recent years, traditional Chinese medicine has certain advantages in the treatment of pneumonia16, shock17, convulsion18, hemorrhage19, acute respiratory failure20, renal failure21, heart failure22, cerebrovascular accident23, etc. it is not only effective, but also safe and simple, with few adverse reactions.

In the past decade, with the rapid development of sequencing technology, a large number of genomics data such as genomics, proteomics, metabonomics and so on, have been generated, which has led to the changes in the research of traditional Chinese medicine for diseases. Network pharmacology has been proposed, which was developed on the basis of the rapid development of systems biology and computer technology, generating the "disease-gene-target-drug" interaction network. Through network analysis, we can systematically and comprehensively observe the intervention and influence of drugs on the disease network, reveal the mystery of the synergistic effect of multi branch drugs on the human body, and find out the multi-target new drugs with high efficiency and low toxicity. Network pharmacology of traditional Chinese medicine has become a new idea for drug mechanism research and new drug development24,25,26,27,28. Lu et al. utilized network pharmacology and molecular docking technology to study the mechanism of Shaoyao Decoction in the treatment of ulcerative colitis, and found that Shaoyao decoction can improve the pathological damage of colon29. Liu et al. collected the main active components of Portulacae Herba, constructed interaction network of target proteins of liver cancer, and found that ketones may be the main material basis of its anti-liver cancer, which is related to the regulation of MAPK signaling pathway30. Liu et al. utilized network pharmacology to screen 102 active components of Danzhi Xiaoyao Powder, 147 corresponding targets and 52 intersecting targets with insomnia, and obtained the key components, key targets and key pathways of Danzhi Xiaoyao Powder in the treatment of insomnia31. Yang et al. presented network pharmacology to analyze the potential anti-tumor mechanisms of the main active components of Prunella vulgaris systematically at the molecular level32. Shen et al. discussed the possible mechanism of Wuling Powder in the treatment of diabetic nephropathy by network pharmacology, and found that Wuling Powder may reduce renal cell damage by regulating apoptosis related proteins, such as Caspases family protein and BCL2 Protein family33.

In the recent years, data mining methods have been applied to extract useful information from lots of TCM data33. Ren et al. utilized data mining methods to screen out 47 prescriptions, and found out 14 core drugs and 7 new prescriptions in order to search the medication rules and mechanism of TCM in the treatment of carotid atherosclerosis (CAS)34. Ga et al. utilized data mining method to select the top five active components of each Tibetan medicine with high frequency and network pharmacology was utilized to analyze the mechanism of Tibetan medicine in the treatment of high altitude polycythemia35. In order to study the medication rule of TCM intervention in iron death, Ou et al. constructed target-compound, compound-TCM, target-compound-TCM network, and frequency statistics was utilized to show that bitter and pungent herbs were the main herbs that could interfere with iron death, while cold herbs were the main ones, which mainly belonged to liver and lung meridians36. Pan et al. reprocessed a large number of Chinese medicine prescriptions for the treatment of primary liver cancer, and by analysis of data mining and network pharmacology medication regularity of effective traditional Chinese medicine prescriptions in the treatment of primary liver cancer was obtained37. Zheng et al. presented four classifiers to infer compound-target interaction network in the process of network pharmacology analysis38.

In order to better mine omics data and construct "disease-gene-target-drug" interaction network, deep learning model was utilized in this paper. Taking acute lung injury (ALI) disease as an example, we selected two ALI-related target genes (REAL and SATA3), which have been verified biology experiment. The active compounds are collected from BindingDB database for two key target genes as positive samples. The inactive compounds are generated from DUD-E as the negative samples. The different molecular descriptors and molecular fingerprints are utilized to characterize each compound, which form the full feature set and contain 374 features. With full feature set collected, forest graph embedded deep feed forward network is trained, which is utilized to identify the compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS) for the treatment of acute lung injury.

Methods

forgeNet

Forest graph-embedded deep feedforward network (forgeNet) is a novel machine learning algorithm, which has been successfully applied to solve classification problem with TCGA RNA-seq data. The flowchart of forgeNet is depicted in Fig. 1. From Fig. 1, it could be seen that this method contains two parts: feature graph construction and deep neural network. Compared with deep learning models, forgeNet solves the dimension problem of biological data and is more robust. The algorithm is described as follows39.

  • Step 1: feature graph construction

Figure 1
figure 1

The flowchart of forgeNet algorithm.

The flowchart of feature graph construction is depicted in Fig. 2. Before the labeled training data are input into classifier, the features of the data need to be extracted. In forgeNet, the used forest \(\xi\) contains \(p\) decision tree (DT). With the labeled training data, the forest is fitted and \(p\) DT are generated (\(\xi (\theta ) = \{ T_{1} (\theta_{1} ),\;T_{2} (\theta_{2} ),\; \ldots ,\;T_{p} (\theta_{p} )\}\), \(\theta_{i}\) is a parameter). Meanwhile if binary tree is regarded as a special case of directed graph, we can gain the following graph set.

$$ \Phi = \{ G_{1} (V_{1} ,\;E_{1} ),\; \ldots ,G_{i} (V_{i} ,\;E_{i} ),\; \ldots ,\;G_{N} (V_{p} ,\;E_{p} )\} . $$
(1)

where \(V_{i}\) and \(E_{i}\) represents vertex set and edge set of \(G_{i}\).

Figure 2
figure 2

Feature graph construction.

To integrate the directed graph set \(\Phi\), the final aggregated graph can be gained by the following formula.

$$ G(V,E) = \bigcup\limits_{i = 1}^{p} {G_{i} } . $$
(2)
  • Step 2: deep neural network

The feature graph obtained the previous step are embedded into this part. With the processed features graph-embedded deep feedforward networks (GEDFN) is used to train and make the classification for the unknown data12. Every layer of GEDFN is introduced as followed.

$$ \begin{gathered} Z_{1} = \sigma (X(W_{in} \Theta G) + b_{in} ), \hfill \\ \ldots \hfill \\ Z_{k + 1} = \sigma (Z_{k} W_{k} + b_{k} ), \hfill \\ \ldots \hfill \\ Z_{out} = \sigma (Z_{l} W_{l} + b_{l} ), \hfill \\ y = soft\max (Z_{out} W_{out} + b_{out} ). \hfill \\ \end{gathered} $$
(3)

where \(X\) is input data,\(Z_{k}\) is the \(k - th\) hidden layers, \(\Theta\) denotes Hadamard product, \(W_{k}\) and \(b_{k}\) are the weights and bias of the \(k - th\) hidden layer, respectively. \(\sigma ( \cdot )\) is an activation function, which could be sigmoid, hyperbolic tangent or rectifiers.

Inference algorithm

In order to construct "disease-gene-target-drug" interaction network more accurately, an ALI-related compound identification based on deep learning model and target genes is proposed. The flowchart is depicted in Fig. 3 and the detailed process is given as follows.

  1. (1)

    Data preparation. Two key target genes: signal transducer and activator of transcription 3 (STAT3), and nuclear transcription factor- κ B/p65 (nuclear factor kappa, B/p65, REAL) were proved to be mainly involved in the key pathways related to acute lung injury (ALI), and losely related to ALI diseases in the literature40. Then the BindingDB database (http://www.bindingdb.org/bind/index.jsp) is searched for the known active compounds of these two key target genes38. The active ligands are screened with the condition that IC50 < 5000 nmol L−1. The collected active compounds are labeled as positive samples. In order to collect the negative samples, 20% of the active ligands are randomly selected and uploaded to DUD-E database (http://dude.docking.org/) to generate the inactive ligands41. Active and inactive compound sets form the dataset. The structure of each compound is Simplified Molecular Input Line Entry System (SMILE), so the molecular descriptors and molecular fingerprints of each compound must be obtained as the feature vectors. In this paper, RDKit package is utilized to create the molecular descriptors and molecular fingerprints of each ligand. Molecular descriptors contains 208 features, such as topological polar surface area (TPSA) descriptor, number of valence electros, number of radical electrons, charge information and number of Aliphatic Carbocycles. MACCS fingerprints contains 166 molecular characteristic sites, such as Atom Pairs, topological torsions.

  2. (2)

    Model training. According to the collected data, the feature vector of each ligand is used as input for forgeNet. After training phase, the unknown compounds are screened for the target disease.

Figure 3
figure 3

The flowchart of ALI-related compound identification.

Experiments

In this section, active and inactive ligands of two key target genes: REAL and SATA3 about ALI disease are collected. For REAL, 966 ligands are collected, which contain 146 positive samples and 820 negative samples (Data1). For SATA3, 193 active ligands and 1210 inactive ligands are collected (Data2). Molecular descriptors and molecular fingerprints of each ligand could be obtained, which contains 374 features. In order to better reflect the effectiveness of forgeNet, three classical classifiers (SVM42, RF43, logical regression (LR), Naive Bayes (NB), XGBoost, LightGBM and gcForest44) are utilized to identify the compounds associated with diseases. Five evaluation criteria of classifier performance are utilized, which are SN, SP, Kappa, MCC and F1, respectively.

Model test

In order to test the generalization and stability of forgeNet, threefold, fivefold and tenfold cross validation methods are utilized. For each cross validation method, 10-repeat experiments are implemented. Identification averaged performances (Mean ± SD) of eight methods with Data1 and Data2 by threefold cross validation, fivefold and tenfold cross validation methods are listed in Tables 1, 2 and 3, respectively. For Table 1, with Data1 it could be seen that NB algorithm has the best SN performance, which is 0.9111 ± 0.021. In terms of SP, Kappa, MCC and F1, LightGBM performs better than other seven methods and forgeNet has the second better performances. With Data2, NB also obtain the highest SN performance, which shows that this method could identify more true ALI-related compounds than other methods, but NB also obtain the worst SP performance, which reveals that this method identifies most of compounds as related ones. From Tables 2 and 3, we also see that with Data1 and Data2, NB algorithm could obtain the best SN performances by fivefold cross validation and tenfold cross validation methods. forgeNet could obtain the highest SP, which shows that this method could identify more unrelated-disease compounds. Although forgeNet can identify less true related compounds than NB, this method could obtain the higher accuracy according to MCC performances. Kappa performances show that forgeNet can make the prediction results more consistent with the actual classification ones with the unbalanced data. F1 performances show that on the whole forgeNet could infer components-disease network more accurately than other seven classifiers. Standard Deviation performances of forgeNet also show that this method could obtain the more stable performances.

Table 1 Identification performances of eight methods with Data1 and Data2 by threefold cross validation method.
Table 2 Identification performances of eight methods with Data1 and Data2 by fivefold cross validation method.
Table 3 Identification performances of eight methods with Data1 and Data2 by tenfold cross validation method.

Receiver operating characteristic (ROC) and Precision-Recall (PR) curves are two important curves to evaluate the performance of machine learning algorithm. ROC curve is based on false positive rate (FPR) and true positives rate (TPR). PR curve is based on Recall and Precision. Area under curve (AUC) is defined as the area under the ROC curve or PR curve surrounded by the coordinate axis. PR and ROC curves of eight methods with Data1 and Data2 by threefold cross validation, fivefold and tenfold cross validation methods are depicted in Figs. 4, 5 and 6, respectively. From Fig. 4, it could be seen that LightGBM performs best with Data1 in terms of PR and ROC curves. forgeNet could obtain the second better performances. With Data2, forgeNet could obtain the best ROC and PR curves. From Fig. 5, with Data1, gcForest, LightGBM and forgeNet have the similar PR and ROC curves. From AUC values, it could be seen that forgeNet performs better than gcForest and LightGBM. With Data2, LightGBM and forgeNet have the similar PR and ROC curves. Figure 6 also shows that forgeNet could perform better than other classifiers for compound identification.

Figure 4
figure 4

PR curves and ROC curves of eight methods with Data1 and Data2 by threefold cross validation method.

Figure 5
figure 5

PR curves and ROC curves of eight methods with Data1 and Data2 by threefold cross validation method.

Figure 6
figure 6

PR curves and ROC curves of eight methods with Data1 and Data2 by threefold cross validation method.

Compound screening for traditional Chinese medicine prescription

Erhuang decoction (EhD) is a traditional heat clearing and detoxifying prescription, which is composed of Radix Scutellariae, Rhizoma Coptidis and licorice. 19 active chemical compounds (Neoglycyrol, Uralenol, Syringic acid 4-β-d-Glucopyranoside, Gancaonin N, Chrysin -6-C-glucoside-8-C-arabinoside, Chrysin-6-C-arabinoside 8-C-glucoside Liquiritin, Baicalin, Isomer of Baicalin, Oroxylin A-7-O-β-d-glucuronide, Chrysin-7-O-glucuronide, Isoliquiritin, Wogonoside, Liquiritigenin, Baicalein, Isoliquiritigenin, Wogonin, Oroxylin A, and Glycyrrhetinic acid) in EhD can dock with ALI related target genes and have high potential biological activity, which have been proved in the reference39. Dexamethasone (DXMS) is used as control drug. Molecular descriptors and molecular fingerprints are also utilized to obtain the features of 20 chemical compounds. Data1 and Data2 are utilized as the training sets in order to predict 20 chemical compounds, respectively. SVM, RF and gcForest are selected as comparison methods. The prediction ranks are listed in Table 4. By ranking results, we can see that DXMS ranks last by forgeNet on average, which is consistent with the results of molecular docking in the past research39. Thus the results reveal that forgeNet could screen the chemical compounds more accurately than SVM, RF and gcForest. We also analyze the mechanism of action of the highly ranked compounds for treatment of ALI. In the highly ranked compounds, Glycyrrhizin has a protective effect on acute lung injury through the activation and increase of Nrf2 nuclear translocation45. Baicalin plays a role in regulating the inflammatory response of ALI by stimulating regulatory T cells and inhibiting the release of IL6 and interleukin-23, which could lead to the decrease of Th17 (T helper cell 17) cells in order to affect the immune balance between Th17 and Treg response46. Baicalein can down regulate the mRNA expression of STAT3 and STAT4 in T cell JAK STAT signal pathway in order to promote T cell proliferation, and play an immune and anti-inflammatory role.

Table 4 Prediction ranks of 20 chemical compounds by SVM, RF, gcForest and forgeNet.

Performance test of different feature sets

In order to test the influence of different feature sets on the identification results, we utilized molecular descriptors as control feature set. Molecular descriptors and molecular fingerprints make up full feature set. With these two feature sets, SVM, RF, gcForest and forgeNet are utilized by threefold, fivefold, tenfold and leave-one-out methods. The AUC and F1 results are depicted in Figs. 7 and 8, respectively. From the results, it could be seen that full feature set could improve the compound identification accuracy of methods.

Figure 7
figure 7

AUC performances of four methods by leave-one-out (a), threefold (b), fivefold (c) and tenfold (d) and methods with full feature set (blue) and control feature set (red).

Figure 8
figure 8

F1 performances of four methods by leave-one-out (a), threefold (b), fivefold (c), and tenfold (d) methods with full feature set (blue) and control feature set (red).

Conclusions

Network pharmacology has become a frontier and hot spot in the field of traditional Chinese medicine research. This research method can effectively predict the effective components, target and side effects of drugs, and is conducive to the process of modernization of traditional Chinese medicine. In order to construct "disease-gene-target-drug" interaction network more accurately, forest graph embedded deep feed forward network is utilized to infer "disease-compound" network in this paper. According to acute lung injury, two ALI-related target genes (REAL and SATA3) are selected, and the active and inactive compounds of the two corresponding target genes are collected, respectively. Molecular descriptors and molecular fingerprints are utilized to characterize each compound. By threefold, fivefold and tenfold cross validation methods, the experimental results show that forgeNet has the better performance than SVM, RF, LR, NB, XGBoost, LightGBM and gcForest in terms of SN, SP, Kappa, MCC, F1, AUC, ROC curve and PR curve. ForgeNet is also utilized to identify 19 compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS) and the results reveal that forgeNet could infer the compounds of disease related more accurately. We also test the influence of different feature sets on the identification results and find the feature set based on molecular descriptors and molecular fingerprints could improve the compound identification accuracy of methods.

In the further we will apply the method to prioritize the compounds in other ALI-related and other diseases related TCM prescriptions.