A structural deep network embedding model for predicting associations between miRNA and disease based on molecular association network

Previous studies indicated that miRNA plays an important role in human biological processes especially in the field of diseases. However, constrained by biotechnology, only a small part of the miRNA-disease associations has been verified by biological experiment. This impel that more and more researchers pay attention to develop efficient and high-precision computational methods for predicting the potential miRNA-disease associations. Based on the assumption that molecules are related to each other in human physiological processes, we developed a novel structural deep network embedding model (SDNE-MDA) for predicting miRNA-disease association using molecular associations network. Specifically, the SDNE-MDA model first integrating miRNA attribute information by Chao Game Representation (CGR) algorithm and disease attribute information by disease semantic similarity. Secondly, we extract feature by structural deep network embedding from the heterogeneous molecular associations network. Then, a comprehensive feature descriptor is constructed by combining attribute information and behavior information. Finally, Convolutional Neural Network (CNN) is adopted to train and classify these feature descriptors. In the five-fold cross validation experiment, SDNE-MDA achieved AUC of 0.9447 with the prediction accuracy of 87.38% on the HMDD v3.0 dataset. To further verify the performance of SDNE-MDA, we contrasted it with different feature extraction models and classifier models. Moreover, the case studies with three important human diseases, including Breast Neoplasms, Kidney Neoplasms, Lymphoma were implemented by the proposed model. As a result, 47, 46 and 46 out of top-50 predicted disease-related miRNAs have been confirmed by independent databases. These results anticipate that SDNE-MDA would be a reliable computational tool for predicting potential miRNA-disease associations.


Materials and methods
Benchmark database. Human miRNA-disease associations benchmark database HMDD v3.0 37 was adopted as data support in this paper, which collected 32,281 confirmed miRNA-disease associations, involving 1102 miRNAs and 850 diseases. Here, after data processing, we chose 16,427 known miRNA-disease associations as positive samples including 1023 miRNAs and 850 diseases. What's more, we defined the adjacency matrix AM to represent the miRNA-disease associations. When the miRNA mi(a) have a verified association with the disease di(b) , we set AM(mi(a), di(b)) = 1 , otherwise AM(mi(a), di(b)) = 0 . In this paper, we introduce two other independent databases (dbDEMC 38 and miR2Ddisease 39 ) to verified the result of case study.

Molecular associations network.
In this study, we combined multiple biological molecular information according the Molecular association network (MAN). The MAN is a heterogeneous information network proposed by Guo et al. 40 . Currently, this complex network consists of five types of molecular (miRNA, lncRNA, protein, disease, drug) and associations between them. The heterogeneous information network MAN provided a new comprehensive view to explore the complex physiological process and human disease. The structure diagram of molecular association network is as shown in Fig. 2. In this study, we download the information of molecular and associations between them from multiple databases. The number of different molecules is shown in Table 1, and the associations between them are shown in the following Table 2.
Chaos game representation (CGR) algorithm. MiRNA sequences contain a lot of complex information. However, most of the existing sequence feature information extraction algorithms only quantify one of position information and nonlinear information. In order to measure the similarity of these information con-  www.nature.com/scientificreports/ tained in the miRNA sequences comprehensively. In this study, we chose chaos game representation (CGR) 50 to quantize position and nonlinear information to calculate miRNA sequence similarity by pearson coefficient. Firstly, the positions of four nucleotides of miRNA are mapped to Euclidean space by the following formula: where T i is the position of i th nucleotide, and it is related to the position of the previous nucleotide T i−1 and the nucleotide coefficient G i . In this paper, the contribution parameter c is equal to 0.5 and T 0 is (0.5, 0.5). Secondly, we divided the CGR space into 64 subspaces as shown in Fig. 3. The attribute information of each subspace SS i would be represented by integrating the position information X i , Y i and nonlinear information Z i by the following formula: where num i is the number of points in subspace SS i .
Finally, each miRNA sequence information could be represented by the descriptor m(i) . And we calculate sequence similarity M sim (m(i), m(j)) by Pearson correlation coefficient.   52  where ϑ is a parameter of semantic contribution, and ϑ is equal to 0.5 as previous study. Therefore, DV (D) of D could be calculated as follows: According the assumption that two diseases should have higher similarity if they hold more same parts in DAG, the similarity of the diseases d(a) with d(b) could be obtained as follows: Structural deep network embedding. Since existing network embedding algorithms could not keep the high-order proximity of large-scale networks, this paper adopted the structural deep network embedding (SDNE) to extract the behavior information of miRNAs and diseases. Many existing network embedding models are shallow model (e.g. Laplacian Eigenmaps 53 , Graph Factorization 54 ), which are unable to validly extract the highly non-linear structural information of network. SDNE is a semi-supervised model for network embedding. For the part of supervised, first-order similarity based on Laplacian matrix would be adopted to preserve local network information. And the part of unsupervised, SDNE used deep autoencoder modeling second-order similarity to save the global network information. Therefore, the loss function of SDNE is divided into two parts, i.e. Laplacian matrix model and Deep autoencoder model.
First-order similarity. To make adjacent nodes of graph closer in the latent space, the loss function of first-order similarity could be obtained as following formula: where s i,j is the adjacency matrix for heterogeneous information network and y (k) i indicates the node i of k-th layer. Figure 3. The CGR of has-mir-3976 plotted in 8 × 8 subspaces and the matrix of its nucleotides with probabilities for chaos game representation. www.nature.com/scientificreports/ Second-order similarity. For the capturing of global structure information, SDNE construct the deep autoencoder model. Any given x i could be convert into the latent representation of k th layer as: here W (k) is the k th layer weight matrix and b (k) as a parameter. According the optimization goal of the autoencoder is to reduce the reconstruction error in input and output, therefore, we could define the loss function as follows: The adjacency matrices are often very sparse, which means zero elements are far more than non-zero elements. Therefore, the loss function would be optimized as: where ⊙ is the Hadamard product (multiplying the corresponding elements).
Integrating the first-order similarity and second-order similarity, the finally loss function of SDNE is shown as follows: where L reg is a regularization term, and α is a parameter to control the loss of the first-order similarity. The regularization term is shown as: Integration of feature information. In this study, we firstly obtained miRNA sequence similarity and disease semantic similarity and convert them into attribute feature information M sim (i) , D sim (j) of same dimension by stacked autoencoder. The dimension of M sim (i) and D sim (j) is 64. After then, the behavior feature information of miRNAs M b (i) and diseases D b (j) were extracted by the structural deep network embedding based on the molecular association network. The dimension of M b (i) and D b (j) is 128. Finally, a complete sample feature descriptor is constructed by fusing above information based on the HMDD v3.0 database. The feature descriptor was a 384-dimensional vector as follows: Convolutional neural network algorithm. Convolutional neural network (CNN) is a deep-structured feedforward neural network with convolution calculations. CNN could shift-invariant classify the input information based on layer structure by representation learning capability. With the development of research, CNN has been successfully utilized in bioinformatics 55 . Therefore, in this paper, we adopted the CNN to train and predict potential miRNA-disease association. Specifically, CNN has a multi-layer structure including input, convolutional layer, pooling layer, fully-connected layer and output as shown in Fig. 4. The input layer is a matrix of all feature descriptor FD i, j with size 26284 × 384 . Two convolutional layers C1 and C2 are obtained by 32 filters with 3 × 1 convolution kernel and 64 filters with 3 × 1 convolution kernel. In this study, we adopted maxpooling 2 × 1 kernel to subsample the C2 . After repeatedly convolution and pooling, CNN classifies the features from fully-connected layer and output the probability distribution.

Results and discussion
Performance evaluation. In this experiment, we implemented the five-fold cross validation to evaluate the performance of proposed model under HMDD v3.0 37 . These known miRNA-disease pairs would be randomly split into five subsets with no intersection. Each cross validation, one of five subsets would be set as test set and remaining data sets as train set. To avoid the revelation of test data, we constructed the heterogeneous information network by only training data and extract the behavior information. In this study, a class of evaluation criteria were used to assess SDNE-MDA, including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), Matthews Correlation Coefficient (MCC) and area under curve (AUC). As a result, the average Acc, Sen, Spec, Prec, MCC and AUC achieved 87.38%, 87.28%, 87.47%, 87.45%, 74.76% and 0.9447 with standard deviations of 0.44%, 0.93%, 1.01%, 0.82%, 0.88% and 0.0027, respectively as shown in Table 3. In addition, the receiver operating characteristics (ROC) curve and area under precision-recall (PR) curve by SDNE-MDA based on HMDD are shown in Fig. 5.     Table 4. The accuracy of SDNE-MDA is 7.78% and 3.43% higher than that of SDNE-MDA_AI and SDNE-MDA_BI, respectively. In addition, the AUC of proposed model is 0.0811 and 0.0260 higher than SDNE-MDA_AI and SDNE-MDA_BI. The ROC curves and PR curves of three experiments are shown in Fig. 6. These results indicated that integrating the two kind of information to represent the node achieved more distinguished performance.
Comparison with different classifier models. In this study, the CNN was adopted to train and identify potential relationships between miRNA and disease. To further evaluate SDNE-MDA, we compare proposed model with Bagging, Logistic Regression, Naive Bayes and Adaboost classifier model. In this experiment, we implemented the five-fold cross validation in these different classifier models based on the HMDD v3.0. Finally, the proposed model yielded average AUC of 0.9447 based on five-fold cross validation and outperformed Bagging (0.8998), LogisticRegression (0.9270), Naive Bayes (0.8881), Adaboost (0.9226) and MLP (0.9320). The AUC of CNN is 0.0259 higher than the mean AUC of all five model, and the accuracy is 1.60% higher than that of the second highest methods. The detail results of the comparison between SDNE-MDA and other four classifier models are shown in Table 5, and we drew the ROC curves as shown in Fig. 7. Therefore, CNN algorithm is the optimal selection for the proposed model to predicting potential miRNA-disease associations.    Table 6. The proposed method is 0.0399 higher than the average AUC of all algorithms, and 0.0275 higher than that of the second highest methods. This is mainly due to SDNE-MDA integrated two types of information of miRNAs and diseases, and extract the feature more comprehensively. Therefore, the proposed model is an effective and reliable computational tool for predicting potential miRNAdisease associations.
Case studies. For further evaluating the prediction ability of SDNE-MDA, we implemented case studies based on three significant human diseases (Breast Neoplasms, Kidney Neoplasms, Lymphoma). In this study, these known miRNA-disease associations based on HMDD v3.0 database would be the training set. To avoid the overlap in the train data and prediction list, the test set is the unknown relationship pairs between three diseases and all possible miRNAs. As a result, 47, 46 and 46 of top-50 candidate related miRNAs were confirmed by independent databases. Therefore, SDNE-MDA is a feasible and reliable model for predicting potential relationships between miRNA and disease. Breast Neoplasms is the most universal neoplasms in female and the risk of breast cancer is up to 13% in the United States. Although men may also develop breast cancer, 99% of patients are women. There are approximately 276,480 novel cases in women and 42,170 were die from breast cancer in 2020 60 . In previous few years, studies had indicated the expression level of miRNA have strong impact to growth and division of breast tumor cell 61 . Therefore, we implemented a case study of Breast Neoplasms-miRNA associations by SDNE-MDA. In the   Table 7, 47 of top 50 predicted Breast Neoplasms related miRNAs were verified based on independent databases. Kidney Neoplasms is a novel cancer with higher adult incidence 60 . In the past few years, however, morbidity and mortality of kidney neoplasms have been increasing. There are about 73,750 novel cases in kidney neoplasms with about 45,520 in male and about 28,230 in female in United States and about 14,830 deaths for this cancer (9860 men and 4970 women) in 2020. Recently, increasing researchers have indicated miRNAs are related with kidney neoplasms 62 . Thus, we take Kidney Neoplasms as a case study for SDNE-MDA and prioritize the candidate miRNAs. In the prediction list shown as Table 8, 46 of top-50 potential kidney neoplasms-related miRNAs were confirmed by independent databases.
Lymphoma is one of the most common malignant cancers (~ 4% of all new cancer) especially in teenagers in United States 60 . Lymphoma mainly contains two types of Hodgkin Lymphoma (HL) and non-Hodgkin Lymphoma (NHL). In 2020, it is estimated that about 85,720 new cases of Lymphoma (47,070 of men and 38,650 of women) and 20,910 deaths for HL and NHL (12,030 of men and 8,880 of women). Therefore, we implemented SDNE-MDA to prioritize possible miRNAs for Lymphoma based on HMDD v3.0. As shown in Table 9, 46 out of top 50 predicted Lymphoma candidate miRNAs were verified by independent databases.

Conclusion
In previous few years, accumulating number of researches demonstrated that miRNAs have closely link with diseases. Various of biological experiments and computational methods are committed to classify the association of them. In this paper, we proposed a structural deep network embedding-based model SDNE-MDA to predict miRNA-disease associations. This model constructed a complex network MAN by fusing miRNAs, diseases and three related molecular (lncRNA, drug and protein) with their relationships. Through the comprehensive heterogeneous information network, potential miRNA-disease associations could be predicted more accurate and efficient. And CNN is utilized to train and classify the potential miRNA-disease associations. Compared with other classifiers and feature extraction models, SDNE-MDA showed outstanding performance. In addition, case studies were implemented on three significant human disease for further validate performance of SDNE-MDA. As a result, 47, 46 and 46 of top-50 predicted miRNAs have been confirmed by independent databases. These results demonstrated that SDNE-MDA is a reliable computational tool for predicting miRNA-disease associations.