iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder

Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.

feature extraction, or to machine learning classification algorithms. In response to these the issue of two aspects, we have presented a solution to improve the performance of the predictive model in this paper.
On the one hand, improvement of feature extraction method is of critical importance to improve the classification performance. Since S Wold 9 proposed the concept of auto-covariance function(ACF) and cross-covariance function(CCF) to analyze the relations between biopolymer sequences and chemical processes in 1993, this method had been applied to identify nuclear receptors and their subfamilies 10 and N 6 -methyladenosine sites 11 via incorporating physical-chemical properties into pseudo amino acid composition(PseAAC) or pseudo dinucleotide composition(PseDNC), respectively. Encouraged by the above successes of introducing this feature extraction approach into computational proteomics, we use twelve physical-chemical properties of the dinucleotides within DNA via a battery of cross-covariance and auto-covariance transformations to obtain a mode of PseDNC to formulate given sequence samples.
On the other hand, the improved machine learning classification algorithms that can provide a better result for classification, is one of the important factors impacting on the performance of classifiers. And in general, different classification algorithms will have different performances. Conventional classification algorithms, such as Support Vector Machine(SVM) [12][13][14][15] , random forest 16 , hidden Markov model 17 , Bayes 18 , covariance discriminant (CD) 19 , Minimax Probability Machine (MPM) 20 and so on, have limitations in processing the original data. Recently, a novel classification algorithm, deep learning, has been proposed based on big data, and it has overcome the former limitations. Deep learning algorithm mainly includes convolutional neural network(CNN) 21 , deep belief network(DBN) 22 and stacked auto-encoder(SAE) 23,24 . Some remarkable progress has been made in diverse fields such as speech recognition and image recognition. In 2014, L James et al. 25 firstly used SAE to predict θ and Tangles used to represent local backbone structure of proteins. In the same year, SP Nguyen et al. 26 built a model "DL-Pro" that learned a SAE network as a classifier for protein structures. In 2016, J Xu et al. 27 used SAE algorithm to detect on breast cancer histopathology images. W Xu et al. 28 constructed a model for human promoter recognition with SAE. Inspired by these achievements, the predictor called iSS-PC is constructed by using deep sparse auto-encoder in this paper and its predication performance has been greatly improved.
Basing on a series of recent studies [29][30][31] , we can draw a conclusion that we should follow the five steps 32 shown in Fig. 2 to establish a real and effective biological predictor based on sequence. Below, we are going to discuss  how to deal with these steps one by one. Of course, the order of these steps may be appropriately adjusted to be in a format that is suitable for the journal "Scientific Reports".

Results and Discussion
Selection of the characteristic parameter. As described in Section Methods later in the article, we can obtain a feature vector containing 144 × τ components to represent the given sample sequence D. Here τ is named characteristic parameter, and its value as an integer. Obviously, the dimension I of the feature vector is increased with the increment of the characteristic parameter τ, as shown below.
However, we should notice that oversized τ value will lead to the problem of the curse of dimensionality. Thus, the value of τ is set at 2, 3, 4 and 5 to carry out experiments, respectively. And the experimental results are listed in Table 1 and Table 2. As can be seen from Table 1, τ = 5 gives the best results, but there is little difference between the results given by τ = 4 and τ = 5. Then, in order to reduce computation time, we fix the τ value into 4. As can be seen form    Table 4. The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice acceptor site sequences. a The prediction method developed by Wei Chen (2014). b The prediction method proposed in this paper.
iSS-PseDNC predictor constructed by Wei Chen 6 based on the corresponding benchmark dataset are listed in these tables, respectively. As can be seen from Table 3, although the Sn rate of the new predictor "iSS-PC" is a little bit higher than that of the iSS-PseDNC predictor, the score of the other three metrics has been greatly improved. For example, the ACC rate of our predictor "iSS-PC" has increased by nearly three percent, the MCC rate, nearly six percent and the Sp rate, also nearly six percent. It means that better experimental effect has been acquired, and indicates that our predictor is superior to the iSS-PseDNC predictor at identifying the splice donor site sequences.
On the other hand, as can be seen from Table 4, although the Sn rate of the new iSS-PC predictor is 4% lower than that of the iSS-PseDNC predictor, the Sp rate of our predictor has increased by over 9 percent. And most importantly, the most important indicators for ranking different algorithms have different increases, ACC, nearly 2.5 percent and MCC, nearly 4.5 percent. It indicates that our predictor is also superior to the iSS-PseDNC predictor at identifying the splice acceptor site sequences.
Then through the above analyses, we can draw the conclusion that the methods of feature extraction and classification designed in this paper are very effective based on the splice site sequences. It means that the iSS-PC predictor has higher prediction precision and consumes less time than the existing predictors.
Receiver operating characteristic (ROC) curves. Receiver operating characteristic(ROC) curve 33 is the another important gauge of performance of a predictor. It can visually present readers' eyes in graphical form. The area under the ROC curve(AUC) represents a popular evaluation index of the performance of a binary classifier. Studies 34,35 indicated that the larger the AUC meant better predictor's performance.
In the Figs 3 and 4, the blue curve is generated by new predictor "iSS-PC", and the green curve is formed by the predictor "iSS-PseDNC" constructed by Wei    the value of AUC generated by predictor "iSS-PC" is found to be 0.9628, whereas the value of AUC generated by predictor "iSS-PseDNC" is found to be 0.9518, as shown in Fig. 4. Obviously, it can be seen that the AUC value of the predictor "iSS-PC" is higher than that of the predictor "iSS-PseDNC" for both the splice donor and acceptor site sequences. Therefore, we can draw the conclusion that our predictor "iSS-PC" is superior to the predictor "iSS-PseDNC", and from the experimental results, it can be proved that the predictor "iSS-PC" is accurate and stable.
Comparison with traditional high-effectiveness machine learning algorithms. SVM and random forest (RF) are the traditional but efficient classification algorithms. In addition, Dynamic selection and Circulating Combination-based ensemble Clustering i.e. libD3C 36, 37 is a popular tool for binary classification task, too. In order to quickly and easily perform classification prediction for users, libD3C package can be downloaded from the website: http://datamining.xmu.edu.cn/~gjs/LibD3C_1.1/index.html. Meanwhile, WEKA, a free and open source software program, should be downloaded and installed. Then, the ensemble classification model constructed by libD3C can be created in WEKA. In this paper, we compare the SAE model with these traditional machine learning algorithms to examine the performance of the new predictor. And the results are listed in Tables 5 and 6.
The results show in the Tables 5 and 6: the rates of the two most important indicators, ACC and MCC obtained from our predictor "iSS-PC" are significantly higher than those of others, respectively. It indicates the SAE classification algorithm is more effective to identify the splice sites and the new predictor "iSS-PC" would be a very useful tool in this regard.
Web server and its user guide. In this paper, a simple and practical network predictor shown in Fig. 5, called iSS-PC, has been developed, in order to help the researchers identify splicing sites in real-time and easily. And we provide service consumers with a Web site link http://www.jci-bioinfo.cn/iSS-PC. Below, this article provides details on how to use the network predictor "iSS-PC".
(a) If you want to get the information about the network predictor, please click the Read Me button. Then you can obtain a brief introduction of our predictor and the caveats for using it. (b) If you want to obtain the benchmark data-set for the iSS-PC predictor training and testing in this paper, please click the Supporting Information button. Here are a few data-sets for download, such as S 1 only containing splice donor site sequences, S 2 only containing splice acceptor site sequences. (c) If you want to get some important references and resources in establishing the iSS-PC predictor, please click the Citation button. (d) Before entering query sequences or uploading a file for batch prediction, you should choose types of splice sites: splice donor site or splice acceptor site. (e) The network predictor "iSS-PC" accepts single or multiple sequence queries. But the input sequences must be in FASTA format, or the network predictor may report errors and will request you to re-input your query sequence. Click the Example button on top of the first input box to see the input format. (f) If you want to obtain the prediction results, please click the Submit button. After entering query sequences in the first input box in the Example window, you will see how much you've been doing with the job on your screen. When the job is over, the results will be displayed in the page as "The number of DNA sequences investigated: X", and "The DNA #xx is splice donor/acceptor site sequences" or " The DNA #xx is non-splice donor/acceptor site sequences".

Conclusions
Feature extraction is the key problem in the research on bioinformatics. In this article, we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate the given sequence samples via a battery of cross-covariance and auto-covariance transformations, and achieved good results. However, with the further research of feature extraction methods and the development of computer technology, more and more web servers have been emerged, such as Pse-in-One 38 , repRNA 39 , and repDNA 40 . Then, many features such as pseudo amino acid composition (PseAAC), pseudo dinucleotide composition (PseDNC), pseudo trinucleotide composition (PseTNC), dinucleotide-based auto covariance (DAC) and dinucleotide-based cross covariance (DCC) can be generated by using these web servers. Therefore, for the future, we can try to study more other similar genomic problems by using the feature extraction methods based on these web servers. Classification algorithm design is another important step that can affect the performance of a predictor. In this paper, we used deep sparse auto-encoder to construct the iSS-PC predictor. By using the same feature extraction method on benchmark data-sets, we compared the SAE model with those traditional machine learning algorithms, and found that the SAE classification algorithm was stable and reliable. Therefore, the new approach could be used to solve many important tasks in bioinformatics, such as iRSpot-EL 41 , iDHS-EL 42 , iEnhancer-2L 43 . And these are the work which should be completed in the next phase. In fact, we had constructed a predictor called "iDHSs-PseTNC" 44 to identify DNase I hypersensitive sites with pseudo trinucleotide component by deep sparse auto-encoder, and the results of the predictor iDHSs-PseTNC was superior to that of iDHS-EL.
In conclusion, the timely identification of the splicing sites in DNA sequence is significant for the intensive study on DNA function and the development of new drugs. The experimental results by five-fold cross-validation on the same benchmark datasets indicated that the iSS-PC predictor was superior to other predictors in this area. And the results were promising enough for our predictor to be used as an analytic solution to more genomic problems, such as DNA-binding protein prediction 45 , detection of tubule boundary 46 , methylation site prediction 47 , phosphorylation site prediction 48 , and protein-protein interaction prediction 49 .

Methods
Benchmark dataset. In this paper, the benchmark dataset is composed of two parts: splice donor site sequences and splice acceptor site sequences. The former can be denoted by S 1 , the latter can be formulated by S 2 , as shown below.
(2)  N N N N N N N N  D (3) where N i (i = 1, 2, …, L) represents the ith nucleotide of the sequence sample. It can be any one of the four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T), respectively. While L represents the length of the given sequence sample.
Some literatures have shown that among the discrete vector models for a DNA sample, nucleic acid composition (NAC) is the simplest one. According to the NAC-discrete vector model, the given sequence sample D of Eq. (3) can be defined as is the normalized occurrence frequency of the corresponding descriptor in the DNA sequence. And T is the transpose operator. But in this way all the sequence order information of sequence D would be entirely lost.
As mentioned in the literature 51 , in order to incorporate more short-range sequence-order or local information, the k-tuple nucleotide composition or k-mers approach can be used to formulate the given sequence D into a feature vector containing 4 k components, i.e.
where f 1 is the normalized occurrence frequency of the first k-mer; f 2 , that of the second k-mer, and so on. It should be noted however, that k is usually not more than 4, otherwise it may cause over-fitting problem, "high-dimension disaster" 52 and increase of computational run-time with the feature vector dimensions increasing.
To incorporate long-range or global sequence order information, the pseudo components were proposed to deal with not only peptide/protein sequences, but also RNA/DNA sequences. As mentioned in the recent paper 53 , the sequence D of Eq. (2) can be formulated as below by using the pseudo nucleotide composition (PseKNC).
where subscript I, the vector dimension, is an integer. Its value as well as the components in Eq. (6) will depend on how to extract the desired information from the sequence D. Below, the "physical-chemical property matrix" and "auto-covariance and covariance transformations" will be used to define the value of subscript I in Eq. (6).  Table 7. Then we can obtain a 12 × (L − 1) PC property matrix as shown below. where PC i (N j N j + 1 ) represents the ith (i = 1, 2, …, 12) PC property value for the dinucleotide N j N j + 1 in Eq. (3). However, the data of Table 7 should be normalized by the following equation before they were substituted into Eq. (7).

Physical-chemical property matrix.
where x k represents the original PC property value in Table 7 of the kth (k = 1, 2, …, 16) dinucleotide. While mean (x) represents the average value for the sixteen dinucleotides; and std (x), the corresponding standard deviation; y k , the corresponding converted values, will remain unchanged if they go through the same conversion procedure again.
Auto-covariance and cross covariance. The concept of auto-covariance function and cross-covariance function was proposed in 1993, when analyzing the relations between biopolymer sequences and chemical processes. Recently, according to the description to auto-covariance and cross-covariance transformations in literatures 10,11 , these transformations could be expressed by the following mathematical expressions.
where AC represents the correlation of the same PC property between two sub-sequences separated by τ dinucle- 1 is the mean of the data along the μth row in the matrix of Eq. (7). where CC represents the correlation between two subsequences each belonging to a different PC property. As we can see from Eq. (9), we can generate 12 × τ components associated with the PC properties of a sample sequence D in Eq. (3) and from Eq. (10), 12 × 11 × τ components. Then we can generate (12 × τ + 12 × 11 × τ) = 144 × τ components by ACF and CCF via 12 different PC properties. Therefore, the sample sequence D can be eventually formulated by where ξ μ represents the μth of the 144 × τ components generated by Eqs (9) and (10) as described above.
Deep sparse auto-encoder. In 1986, DE Rumelhart et al. 64 firstly proposed the concept of an auto-encoder to process the large complex high-dimensional data. In 2006, GE Hinton et al. 22 improved the prototype structure of the auto-encoder, thus making deep auto-encoder (DAE) appear. Thereafter, in 2008, Y Bengio et al. 65 proposed the concept of sparse auto-encoder, therefore, the study of DAE went much deeper. And in 2010, P Vincent 24 developed stacked de-noising auto-encoder to yield significantly lower classification error. Based on the research 22 , we constructed a deep sparse auto-encoder model with two hidden layers in this paper, as shown in the Fig. 6. In order to implement classification accurately and quickly based on minimum error law, we can use deep learning software packages, including SAE and NN software, which can be obtained  A set of metrics for measuring prediction quality. As mentioned in the literature, accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew correlation coefficient (Mcc) introduced by Chou 66 are the most frequently used metrics to evaluate the performance of the predictor in bioinformatics. To make these easier to understand for the researchers, the four metrics can be formulated as below 30,67 . However, it should be noted that the four metrics formulated in Eq. (12) are valid only for the single-label systems, but unsuitable for multi-label systems appearing frequently in system biology and system medicine. For the latter, an utterly different set of metrics is needed as elaborated in the literature 68 .

Cross-validation.
After the four well-known metrics mentioned above have been adopted to evaluate the performance of predictors, another thing we should consider at this moment is what validation method should be used to calculate the value of the four metrics. Generally speaking, there are three popular cross-validation approaches in prediction and analysis on the statistics, i.e., independent dataset test, K-fold cross-validation and jackknife test. Although the jackknife test always yielding a unique output for a given benchmark dataset seems the least arbitrary, K-fold cross-validation has more advantages in the computational time than that of the former. Therefore, in this paper, we adopt five-fold cross-validation to score the four metrics. Below, let's introduce specific methods about five-fold cross-validation.
Firstly, for the benchmark dataset S 1 of Eq. (2) consisting of splice donor site sequences, we randomly divided the data-sets + S 1 and − S 1 into five subsets which size was approximately equal to each other, respectively, as shown below  where + S 11 denotes the number of elements (samples) in + S 11 , and so forth. Finally, we can obtain five subsets of the benchmark dataset S 1 according to their labels for the dividing category, as shown below Therefore, we can single out each of the five subsets of Eq. (15) one by one to test the model that were trained with the remaining four subsets for identifying the splice donor site sequences. The cross validation is carried out five times, and the average scores among the output are regarded as the final outcome. It's remarkable that the same cross-validation process can be used for the benchmark data-set S 2 of Eq. (2) consisting of splice acceptor site sequences.