SAEROF: an ensemble approach for large-scale drug-disease association prediction by incorporating rotation forest and sparse autoencoder deep neural network

Drug-disease association is an important piece of information which participates in all stages of drug repositioning. Although the number of drug-disease associations identified by high-throughput technologies is increasing, the experimental methods are time consuming and expensive. As supplement to them, many computational methods have been developed for an accurate in silico prediction for new drug-disease associations. In this work, we present a novel computational model combining sparse auto-encoder and rotation forest (SAEROF) to predict drug-disease association. Gaussian interaction profile kernel similarity, drug structure similarity and disease semantic similarity were extracted for exploring the association among drugs and diseases. On this basis, a rotation forest classifier based on sparse auto-encoder is proposed to predict the association between drugs and diseases. In order to evaluate the performance of the proposed model, we used it to implement 10-fold cross validation on two golden standard datasets, Fdataset and Cdataset. As a result, the proposed model achieved AUCs (Area Under the ROC Curve) of Fdataset and Cdataset are 0.9092 and 0.9323, respectively. For performance evaluation, we compared SAEROF with the state-of-the-art support vector machine (SVM) classifier and some existing computational models. Three human diseases (Obesity, Stomach Neoplasms and Lung Neoplasms) were explored in case studies. As a result, more than half of the top 20 drugs predicted were successfully confirmed by the Comparative Toxicogenomics Database(CTD database). This model is a feasible and effective method to predict drug-disease correlation, and its performance is significantly improved compared with existing methods.

The average cost of a successful new drug is estimated at more than $1 billion and the process takes nearly a decade. However, drug repositioning can find some new drug efficacy in both marketed and unlisted compounds, thereby reducing the cycle and cost of drug development. Drug repositioning, also known as new use of old drugs, refers to the process of expanding indications and discovering new targets through further research for drugs that have been on the market. Drug-disease association is an important theoretical basis for drug repositioning. Therefore, the prediction of new drug-disease association has attracted more and more researchers' attention. In addition to the experimental methods, computational methods to discover new drug-disease associations can lead to further cost savings.
Some researchers have published computational models of drug repositioning based on deep learning techniques. For example, Lu et al. used regularized nuclear classifiers to construct drug and disease predictions 1 . Liang et al. used a Laplacian regularization algorithm for sparse subspaces to construct a drug repositioning prediction model: LRSSL2 2 . The method incorporates information such as medicinal chemistry information and drug targets. To solve this problem, Wu et al. proposed a semi-supervised graph cutting algorithm to find the optimal graph cutting to identify potential drug-disease associations, which is called SSGC 3 .
In the computation framework of most computational methods for predicting drug-disease associations, two modules of feature extraction and classification are normally constructed separately. Effective feature extraction methods could help to improve the prediction accuracy 4 . The similarity between drugs/disease used to be constructed as they are considered to be important to describe their correlation with regards to pattern of drug-disease associations. The first consideration is how to express the features of a particular drug or disease. Therefore, based on the consideration of multiple features, different feature extraction methods are proposed. For example, when DR2DI describing the similarity of the disease, the information content on the disease Medical Subject Headings (MeSH) descriptors and their corresponding Directed Acyclic Graphs (DAGs) are used 5 . In addition to the commonly used machine learning methods to extract features, sparse auto-encoders have recently received attention. For example, Deng et al. applied sparse auto-encoder to the study of speech emotion recognition 6 . Su et al. used training neural networks to capture the internal structure of the human body 7 . In recent years, with the development of auto-encoder and other types of deep learning technology, some feature extraction methods based on deep learning are gaining more and more research attention. Feature dimension reduction can effectively extract useful features. Using auto-encoder to map the raw features into a low-dimensional space in which the relations of drug and disease can be more effectively measured. In our model, we proposed a feature extraction method combining sparse auto-encoder and PCA to learn the feature representation of drugs and diseases. Sparse auto-encoder is a variant of based auto-encoder, which integrates sparse penalty term into conventional auto-encoder.
In this study, we propose a computational model that combing a sparse auto-encoder with the rotation forest. With a comprehensive consideration of multiple features, we use a combination method to obtain the combined features. A feature extraction module based on sparse autoencoder and Principal Component Analysis (PCA) is established, and the combined features are learned into the final feature representation by sparse auto-encoder. Considering that the ensemble classifier normally yield more stable prediction results than single classifier, we adopt rotation forest to deal with the extracted features from sparse auto-encoder for final prediction. The results yield from rotation forest describe the probability scores of each drug-disease pair to be interactive. Those drug-disease pairs with high prediction scores are considered most likely to be associated among all testing samples.
The results of the SAEROF model after 10-fold cross-validation on Fdataset and Cdataset were compared with the two most advanced drug reposition prediction models. The results show that SAEROF model has better performance. In addition, case studies were conducted on three human diseases, including obesity, Stomach Neoplasms, and Lung Neoplasms. Of the top 20 candidates predicted by SAEROF (Obesity 17/20, Stomach Neoplasms 16/20, Lung Neoplasms 16/20), more than 10 were validated in the CTD database 8 .

Materials and Methods
In this section, the model we proposed is introduced: First, we describe the datasets used. and second, we explain how to use datasets to calculate similarities between drugs and diseases. Last, the results of the cross-validation rotation forest experiment are given. Figure 1 is a flow chart of the SAEROF model predicting potential drug-disease associations. First, two kinds of drug similarity and disease similarity were calculated respectively. Then, the feature matrix is obtained by Similarity for drugs and disease. We here introduce two kinds of drug similarities and two kinds of disease similarities in this section. Drug structure similarity is calculated based on the chemical structure of the drug. Simplified molecular-input line-entry system (SMILE) is a notation that describes the structure of a molecule in a short text string and for a given drug is downloaded from DrugBank 14 . Chemical similarity kits were used to calculate the similarity between the two drugs 15 . Similarities that do not provide prediction information are converted to values close to 0. Next, group drugs based on existing drug-disease relationships. We adjust the similarity by applying the logistic function. Using the above method, drug similarity DE r can be obtained. We established a new weighting network for drug sharing (As shown in Fig. 2). Nodes in the drug mapping network, common diseases of drug pairs represent edge weights.
In the SAEROF model, we use ClusterONE 16 to identify clusters. The definition of cohesion of cluster V is as follows: represent the total weight of the edge in H. W V ( ) bound represent the total weight of vertex set and other edges of the group. P V ( ) represent a penalty term. Assuming that drug r i and r j belong to the same cluster V. Drug structure similarity DE between r i and r j was defined as: It is worth noting that for the structural similarity between drugs, if its value is not less than 1, use 0.99 instead 10 .  www.nature.com/scientificreports www.nature.com/scientificreports/ Directed acyclic graphs (DAG) can be used to describe semantic similarity of diseases, which can be downloaded from the national liary of medicine's comprehensive retrieval control vocabulary, medical subject words (MeSH) database 17 Where ψ is the semantic effect parameter, which is related to b and its sub- The higher the proportion of DAGs sred by the two diseases, the higher the similarity. The semantic similarity score of disease f(i) and f(j) is: Next, the semantic similarity of disease is improved by using the same measure of drug structure similarity. The similarity was adjusted by analyzing the drug-disease association. Finally, ClusterONE was used to cluster the diseases to obtain the comprehensive similarity DS of the diseases.
Define the adjacency matrix A, where the columns represent the drug and the rows represent the disease. The i th − column vector of the adjacency matrix A is represented by the binary vector V g i ( ( )). Calculate the Gaussian interaction profile kernel of drug g i ( ) and drug g(j) 18 : where Parameter θ g is could adjust the kernel bandwidth and normalize the original parameter g θ˙. Similar to the calculation method of drug similarity, disease Gaussian interaction profile kernel similarity formula is: represents the association profiles of disease d i ( ) (or d j ( )) by observing whether d i ( ) (or d j ( )) is associated with each of drugs and is equivalent to the i th − (or j th − ) row vector of adjacency matrix A. Parameter d ∂ is implemented to adjust the kernel bandwidth and normalize the original parameter d ∂ ′ . The value of g θ ′ and ∂ ′ d are set to 0.5 for simplicity. feature fusion. In this section, descriptors from multiple data sources are integrated to predict drug-disease associations. The data set contains some unknown drug -disease associations, and the corresponding Gaussian interaction. profile kernel is 0. To solve this problem, we decided to fuse the structural similarity of drugs and the semantic similarity of diseases. This solution can reflect the related characteristics of diseases and drugs from different perspectives. Drug semantic similarity DE (Eq. 3) was filled in drug Gaussian interaction profile kernel similarity GE (Eq. 7) to form drug similarity matrix SIM drug . The drug similarity SIM g i g j ( ( ), ( )) drug formula for drug g i ( ) and drug g j ( ) is as follows: GE g i g j if g i and g j has Gaussian interaction profile kernel similarity DE otherwise For the similarity of diseases, Disease semantic similarity DS was filled in disease Gaussian interaction profile kernel similarity GD (Eq. 9). The formula is:

SIM GD d i d j if d i and d j has Gaussian
interaction profile kernel similarity DS otherwise www.nature.com/scientificreports www.nature.com/scientificreports/ Feature extraction based on SAEROF. In recent years, bioinformatics has paid great attention to the application of deep learning. As an effective learning strategy, deep learning is widely used. As an unsupervised neural network model, the autoencoder can learn the hidden features of the input samples. Its basic structure is shown in Fig. 3. However, autoencoders cannot effectively extract useful features. Aiming at this problem, a sparse autoencoder (SAE) is proposed, which introduces a sparse penalty term to learn relatively sparse features.
SAE is a three-layered symmetric neural network. Select x e ( ) 1/(1 ) x σ = + − as the activation function of the network. Encoder function, The input layer x is mapped to the hidden layer h. The decoder function is: decoder e ncoder where W represents the connection parameter between the two layers, b is an offset. Add sparsity penalty to the target function of the auto-encoder to obtain valid features. Suppose a x ( ) j denotes the activation of hidden unit t. The average activation amount of hidden unit t is: The sparse term is added to the objective function that penalizes t  ρ if it deviates significantly from ρ. The penalty term is expressed as: penalty t S t 1 2  ∑ ρ ρ = = S 2 is the number of neurons in the hidden layer. ρ is a sparsity parameter, usually a small value close zero. There is a weight attached to the penalty, which is 10e is the relative entropy between two Bernoulli random variables with a mean value of ρ and a mean value of ρ 19 . Relative entropy is a standard measure of the difference between two distributions.
This penalty function possesses the property that . Otherwise, it increases monotonically as ρ  t diverges from ρ, which acts as the sparsity constraint. The cost function with sparse penalty term added is defined as:∑ is the cost function of the neural network. γ is the weight of the sparse penalty. As shown in formula 15, the cost function be solved by minimizing W and b. This can be calculated through the backpropagation algorithm, where the random gradient descent method is used for training. The parameters W and b of each iteration are updated as follows: www.nature.com/scientificreports www.nature.com/scientificreports/ where σ is represent the learning rate. The average activation degree is calculated through the forward traversal of all training examples to obtain the sparse error. To optimize the hyperparameters in our models 20 , we keep trying by setting the dimension from 10 to 200. As a result, we found that the performance actually robust to the setting when the dimension is higher than 50 21 . Specially, the performance reaches its highest within the interval of [95,105]. Therefore, the dimension of the hidden layer was optimized as 100.The output layer of Fdataset is 100 dimensions and the input layer is 906 dimensions. The output layer of Cdataset is 100 dimensions and the input layer is 1072 dimensions. We used a single layer sparse automatic encoder. To reduce the computational cost of the classifier, we used the bottleneck hidden layer as the output, which is 100 dimensional. The learning rate is adaptively changed during the optimization by the adadelta algorithm. Dimensionality reduction is a kind of data set preprocessing technology, which is usually used before the data is applied in other algorithms. It can remove some redundant information and noise of the data, making data more simply and efficiently, so as to improve the data processing speed and save a lot of time and cost. Dimension reduction has also become a widely used data preprocessing method. Principal Component Analysis (PCA) is the most widely used data dimension reduction algorithm. The main idea of PCA is to map n-dimensional features to k-dimensional features, which are brand new orthogonal features and also known as principal components. They are k-dimensional features reconstructed on the basis of the original n-dimensional features. The essence of PCA algorithm is to find some projection directions, so that the variance of the data in these projection directions is the largest, and these projection directions are orthogonal to each other. Here, we reduced the 100-dimensional features obtained by SAE to 84 dimensions through PCA to obtain the final eigenvector.
Ensemble learning complete learning tasks by building and combining multiple machine learning models. Since ensemble learning algorithms are more accurate than single classifiers, they have received more and more attention in recent years. Rotation forest (RF) is a popular ensemble classifier proposed by Rodriguez et al. 22 . which has been widely used in various fields. First, RF randomly divides samples into different subsets. Local principal component analysis (PCA) is then used to rotate each subset to increase diversity. Input the rotated subset into different decision trees. The final result of the classification is produced by voting on all the decision trees. Due to the introduction of randomness, RF can prevent overfitting, resist noise and be insensitive to abnormal outliers. Therefore, in this work, we chose the rotation forest as a classifier to process the learned features. We optimize parameters through a grid search, and the parameters of rotation forest, K and n_classifiers are set as 200 and 139, respectively. The ensemble classifier is composed of several weak classifiers, and the subtree selects the feature subset with fewer dimensions. The subtree training is simple as the same as the way to train a decision tree. Its time cost complexity is O(n*|D|*log(|D|)), where |D| is the feature dimension.

Results and discussion
Evaluation Criteria. We evaluated the performance of SAEROF by 10-fold cross validation. The evaluation criteria used include precision (Prec.), recall, F1-score and accuracy (Acc.). The calculation formula is defined as: TP is defined as a positive sample, which is actually a positive sample. TN is defined as a negative sample, and in fact is a negative sample. FP stands for positive sample, but actually negative sample. FN is defined as a negative sample, but it's actually a positive sample. In addition, the Receiver Operating Characteristic (ROC) curve and the area under the curve (AUC) that can comprehensively reflect the performance of the model are also used in the experiment.
Evaluate prediction performance. We chose to use a 10-fold cross-validation method to evaluate the ability of the SAEROF model to predict drug-disease associations. On the Fdataset and Cdataset, all data sets were randomly divided into 10 equal parts. Choose one group at a time as the test set and the other nine as the training set. Finally, the mean and standard deviation of the results of ten experiments were calculated. Tables 2, 3 and Fig. 4  www.nature.com/scientificreports www.nature.com/scientificreports/ f1-score is 80.51% ± 1.75% and mean AUC is 0.9092 ± 0.0103. On the Cdataset, the results were as follows: accuracy is 83.47% ± 1.59%, precision is 85.83% ± 1.17%, recall is 80.21% ± 3.67%, f1-score is 82.87% ± 1.98% and mean AUC is 0.9323 ± 0.0081.
The high accuracy of the SAEROF model stems from the feature extraction method and the choice of classifiers. Combined with sparse auto-encoder, relatively sparse features can be extracted. The ensemble strategy and random tree rotation strategy make the rotation forest classifier have better classification ability.
In order to evaluate the SAEROF model from multiple perspectives, we compared the results with those of two state-of-the-art models, DrugNet and HGBI 23,24 . For all methods we used a ten-fold cross validation. Experiment results (As show in Table 4 Table 3. 10-fold cross-validation results performed by SAEROF on Cdataset. www.nature.com/scientificreports www.nature.com/scientificreports/ and 0.074 higher than the DrugNet model, respectively. The comparison results show that the SAEROF model is significantly better than the other two models. Unlike these two models, the use of sparse autoencoders can learn sparse features and combine with rotation forest classification to obtain more meaningful prediction results.
Comparison among different classifier. In this section, in order to evaluate the effectiveness of the proposed feature extraction method combined with the rotation forest classifier, an attempt is made to replace the rotation forest classifier with SVM classifier 25 . Tables 5, 6 and Fig. 5 summarize the results of the SVM classifier 10-fold cross-validation on dataset. On Fdataset, the indicators of SVM classifier are: accuracy 74.06% ± 1.83%, precision 71.12% ± 1.83%, recall 81.12% ± 3.62%, f1-score 75.74% ± 1.94% and mean AUC is 0.8068 ± 0.0224. On Cdataset, the indicators of SVM classifier are: accuracy 76.92% ± 1.99%, precision 74.25% ± 2.05%, recall 82.46% ± 2.26%, f1-score 78.13% ± 1.86% and mean AUC is 0.8390 ± 0.0175. It can be seen from the results that the results of the rotation forest classifier are significantly better than the SVM classifier. Due to the idea of ensemble learning and the rotation strategy of the random tree, the rotation forest classifier has better performance than the SVM classifier when using the same feature descriptor. case studies. We implemented the case studies on Fdataset and Cdataset, respectively. Case studies on Obesity and Stomach Neoplasms were carried out on Fdataset, and case studies on Lung Neoplasms were carried out on Cdataset. Specifically, in the experiment, we used Fdataset and Cdataste to train the model. It is important to note that when predicting the drug associated with a disease, all associations between a particular disease and the drug should be removed from the data set. We used the CTD database to validate the top 20 drugs predicted by SAEROF. The World Health Organization has defined obesity as diseases that pose a threat to human health,    www.nature.com/scientificreports www.nature.com/scientificreports/ manifested by excessive accumulation of fat. Obesity is major threats to many chronic diseases, including diabetes, cardiovascular disease and even cancer. We selected obesity as the first case study and used SAEROF to predict related drug. As shown in Table 7, after comparing prediction results with the CTD dataset, 17 of the top 20 predicted drugs were confirmed.
Stomach Neoplasms are common digestive disorders that are both benign and malignant. We selected this disease as a case study to validate the predictive power of SAEROF. Table 8 lists the 20 drugs that SAEROF predicts are highly associated with Stomach Neoplasms. Comparison with CTD database shows that 16 of the top-20 drugs predicted by Stomach Neoplasms can be identified.
The incidence and mortality of Lung Neoplasms have increased significantly in recent decades. We chose lung tumors on Cdataset as case studies to verify SAEROF's predictive power. As shown in Table 9, comparing the predicted results with the CTD data set, 16 of the top 20 predicted drugs proved to be associated with Lung Neoplasms.
Case studies of obesity, Stomach Neoplasms and Lung Neoplasms have shown that SAEROF performs well in predicting the most promising drugs.

conclusion
In order to further accelerate the process of drug repositioning, effective methods for predicting drug-disease association are urgently needed. Our model opens up new perspectives for predicting drug-disease associations.
In the feature extraction process, three kinds of descriptor, Gaussian interaction profile kernel, drug structure similarity and disease semantic similarity are extracted from the drug-disease association pair. The representative features are extracted using sparse auto-encoder. Finally, the rotation forest classifier is used for sample classification.
Experiments have shown that the SAEROF model is suitable for large-scale prediction of drug-disease associations, and the results of case studies on obesity, Stomach Neoplasms, and Lung Neoplasms confirm this view. In order to further improve the accuracy of the prediction model, protein information and disease gene information can be integrated in the future.

Index
Drug Name Evidence Index Drug Name Confirmed  Table 9. The top 20 drugs predicted to be associated with obesity Lung Neoplasms.