Introduction

Generally, the pre-mRNA, including exons and one or more introns, is transcribed from a eukaryotic gene’s DNA template. In the pre-mRNA, exon-intron boundaries i.e. the 5′ ends of the introns are called splice donor sites or 5′ splice sites, and intron-exon boundaries i.e. the 3′ ends of the introns are called splice acceptor sites or 3′ splice sites, as shown in Fig. 1. There are two forms of splice sites. Before the pre-mRNA becomes a mature messenger RNA (mRNA), it must go through several biological processes (Fig. 1). The final mRNA containing only remaining exons can be directly involved in the synthesis of protein. Thus, the biological process of removing introns from its 5′ splice site to its 3′ splice site in pre-mRNA and connecting exons to form mRNA plays an important role in gene regulation and expression. In this case, accurate identification of splice sites becomes increasingly important.

Figure 1
figure 1

Sketch map showing the steps about the pre-mRNA how to become a mature messenger RNA.

Although the technology of PCR has become one of the most important identification methods to accurately identify splice sites with the development of identification technology the functional sites of genes, it is very expensive and time consuming based only on experimental technique. Hence, development of an effective computational method, so as to help researchers effectively and in a timely fashion, identifying splice sites, has become the urgent need to solve a big problem. In this situation, the computational splice-site analysis tools based on the WEB took up, such as NetGene1, 2, SplicePredictor3, GeneSplicer4 and SplicePort5. Recently, Wei Chen et al.6 built a prediction model “iSS-PseDNC” which incorporated six DNA local structural properties into pseudo dinucleotide composition to identify splice donor and acceptor sites. In 2016, M Iqbal et al.7 used PseTNC and PseTetraNC methods to propose a hybrid prediction model, called iSS-Hyb-mRMR, for identifying splice sites, and Prabina Kumar Meher8 used a hybrid feature extraction approach, which contains positional, dependency and compositional features, to develop a predictor called HSplice for predicting the donor splice sites in eukaryotic genes. These were, on balance, successful.

Based on the above information, although the remarkable progress in identification of splice sites has been made, further study about splice-site predictors can be improved and perfected, whether it is with regard to in feature extraction, or to machine learning classification algorithms. In response to these the issue of two aspects, we have presented a solution to improve the performance of the predictive model in this paper.

On the one hand, improvement of feature extraction method is of critical importance to improve the classification performance. Since S Wold9 proposed the concept of auto-covariance function(ACF) and cross-covariance function(CCF) to analyze the relations between biopolymer sequences and chemical processes in 1993, this method had been applied to identify nuclear receptors and their subfamilies10 and N6-methyladenosine sites11 via incorporating physical-chemical properties into pseudo amino acid composition(PseAAC) or pseudo dinucleotide composition(PseDNC), respectively. Encouraged by the above successes of introducing this feature extraction approach into computational proteomics, we use twelve physical-chemical properties of the dinucleotides within DNA via a battery of cross-covariance and auto-covariance transformations to obtain a mode of PseDNC to formulate given sequence samples.

On the other hand, the improved machine learning classification algorithms that can provide a better result for classification, is one of the important factors impacting on the performance of classifiers. And in general, different classification algorithms will have different performances. Conventional classification algorithms, such as Support Vector Machine(SVM)12,13,14,15, random forest16, hidden Markov model17, Bayes18, covariance discriminant (CD)19, Minimax Probability Machine (MPM)20 and so on, have limitations in processing the original data. Recently, a novel classification algorithm, deep learning, has been proposed based on big data, and it has overcome the former limitations. Deep learning algorithm mainly includes convolutional neural network(CNN)21, deep belief network(DBN)22 and stacked auto-encoder(SAE)23, 24. Some remarkable progress has been made in diverse fields such as speech recognition and image recognition. In 2014, L James et al.25 firstly used SAE to predict θ and Tangles used to represent local backbone structure of proteins. In the same year, SP Nguyen et al.26 built a model “DL-Pro” that learned a SAE network as a classifier for protein structures. In 2016, J Xu et al.27 used SAE algorithm to detect on breast cancer histopathology images. W Xu et al.28 constructed a model for human promoter recognition with SAE. Inspired by these achievements, the predictor called iSS-PC is constructed by using deep sparse auto-encoder in this paper and its predication performance has been greatly improved.

Basing on a series of recent studies29,30,31, we can draw a conclusion that we should follow the five steps32 shown in Fig. 2 to establish a real and effective biological predictor based on sequence. Below, we are going to discuss how to deal with these steps one by one. Of course, the order of these steps may be appropriately adjusted to be in a format that is suitable for the journal “Scientific Reports”.

Figure 2
figure 2

Sketch map showing the steps how to establish a predictor for biological system.

Results and Discussion

Selection of the characteristic parameter

As described in Section Methods later in the article, we can obtain a feature vector containing 144 × τ components to represent the given sample sequence D. Here τ is named characteristic parameter, and its value as an integer. Obviously, the dimension I of the feature vector is increased with the increment of the characteristic parameter τ, as shown below.

$$I=\{\begin{array}{cc}\begin{array}{c}288\\ 432\\ \begin{array}{c}576\\ \begin{array}{c}720\\ \vdots \end{array}\end{array}\end{array} & \begin{array}{c}\tau =2\\ \tau =3\\ \begin{array}{c}\tau =4\\ \begin{array}{c}\tau =5\\ \vdots \end{array}\end{array}\end{array}\end{array}$$
(1)

However, we should notice that oversized τ value will lead to the problem of the curse of dimensionality. Thus, the value of τ is set at 2, 3, 4 and 5 to carry out experiments, respectively. And the experimental results are listed in Table 1 and Table 2. As can be seen from Table 1, τ = 5 gives the best results, but there is little difference between the results given by τ = 4 and τ = 5. Then, in order to reduce computation time, we fix the τ value into 4. As can be seen form Table 2, τ = 4 gives the best results. Then we can generate a feature vector containing 144 × 4 = 576 components as the input of the deep sparse auto-encoder for identifying splicing donor site and splicing acceptor site.

Table 1 The test results of splice donor site sequences based on different characteristic parameter τ values.
Table 2 The test results of splice acceptor site sequences based on different characteristic parameter τ values.

Comparison with the existing methods

The four metrics i.e. accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew correlation coefficient (Mcc) can reflect the performance of predictors clearly. Based on the benchmark dataset composed solely of splice donor site sequences, their scores obtained by the new predictor “iSS-PC” via the five-fold cross-validation test are listed in Table 3. And the results for splice acceptor site sequences, listed in Table 4. For ease of comparison between the other methods, the results obtained by the iSS-PseDNC predictor constructed by Wei Chen6 based on the corresponding benchmark dataset are listed in these tables, respectively.

Table 3 The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice donor site sequences.
Table 4 The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice acceptor site sequences.

As can be seen from Table 3, although the Sn rate of the new predictor “iSS-PC” is a little bit higher than that of the iSS-PseDNC predictor, the score of the other three metrics has been greatly improved. For example, the ACC rate of our predictor “iSS-PC” has increased by nearly three percent, the MCC rate, nearly six percent and the Sp rate, also nearly six percent. It means that better experimental effect has been acquired, and indicates that our predictor is superior to the iSS-PseDNC predictor at identifying the splice donor site sequences.

On the other hand, as can be seen from Table 4, although the Sn rate of the new iSS-PC predictor is 4% lower than that of the iSS-PseDNC predictor, the Sp rate of our predictor has increased by over 9 percent. And most importantly, the most important indicators for ranking different algorithms have different increases, ACC, nearly 2.5 percent and MCC, nearly 4.5 percent. It indicates that our predictor is also superior to the iSS-PseDNC predictor at identifying the splice acceptor site sequences.

Then through the above analyses, we can draw the conclusion that the methods of feature extraction and classification designed in this paper are very effective based on the splice site sequences. It means that the iSS-PC predictor has higher prediction precision and consumes less time than the existing predictors.

Receiver operating characteristic (ROC) curves

Receiver operating characteristic(ROC) curve33 is the another important gauge of performance of a predictor. It can visually present readers’ eyes in graphical form. The area under the ROC curve(AUC) represents a popular evaluation index of the performance of a binary classifier. Studies34, 35 indicated that the larger the AUC meant better predictor’s performance.

In the Figs 3 and 4, the blue curve is generated by new predictor “iSS-PC”, and the green curve is formed by the predictor “iSS-PseDNC” constructed by Wei Chen et al. The corresponding values of AUC computed over five-fold cross-validation are shown in Figs 3 and 4. From Fig. 3 it can be seen that the values of AUC are 0.9566 and 0.9239 for splice donor site sequences, respectively. On the other hand, for the splice acceptor site sequences the value of AUC generated by predictor “iSS-PC” is found to be 0.9628, whereas the value of AUC generated by predictor “iSS-PseDNC” is found to be 0.9518, as shown in Fig. 4. Obviously, it can be seen that the AUC value of the predictor “iSS-PC” is higher than that of the predictor “iSS-PseDNC” for both the splice donor and acceptor site sequences. Therefore, we can draw the conclusion that our predictor “iSS-PC” is superior to the predictor “iSS-PseDNC”, and from the experimental results, it can be proved that the predictor “iSS-PC” is accurate and stable.

Figure 3
figure 3

ROC curves of the two different predictors for the splice donor site sequences.

Figure 4
figure 4

ROC curves of the two different predictors for the splice acceptor site sequences.

Comparison with traditional high-effectiveness machine learning algorithms

SVM and random forest (RF) are the traditional but efficient classification algorithms. In addition, Dynamic selection and Circulating Combination-based ensemble Clustering i.e. libD3C36, 37 is a popular tool for binary classification task, too. In order to quickly and easily perform classification prediction for users, libD3C package can be downloaded from the website: http://datamining.xmu.edu.cn/~gjs/LibD3C_1.1/index.html. Meanwhile, WEKA, a free and open source software program, should be downloaded and installed. Then, the ensemble classification model constructed by libD3C can be created in WEKA. In this paper, we compare the SAE model with these traditional machine learning algorithms to examine the performance of the new predictor. And the results are listed in Tables 5 and 6.

Table 5 The 5-fold cross-validation test results obtained from different classification algorithms with the same feature extraction method on benchmark data-set only containing splice donor site sequences.
Table 6 The 5-fold cross-validation test results obtained from different classification algorithms with the same feature extraction method on benchmark data-set only containing splice acceptor site sequences.

The results show in the Tables 5 and 6: the rates of the two most important indicators, ACC and MCC obtained from our predictor “iSS-PC” are significantly higher than those of others, respectively. It indicates the SAE classification algorithm is more effective to identify the splice sites and the new predictor “iSS-PC” would be a very useful tool in this regard.

Web server and its user guide

In this paper, a simple and practical network predictor shown in Fig. 5, called iSS-PC, has been developed, in order to help the researchers identify splicing sites in real-time and easily. And we provide service consumers with a Web site link http://www.jci-bioinfo.cn/iSS-PC. Below, this article provides details on how to use the network predictor “iSS-PC”.

  1. (a)

    If you want to get the information about the network predictor, please click the Read Me button. Then you can obtain a brief introduction of our predictor and the caveats for using it.

  2. (b)

    If you want to obtain the benchmark data-set for the iSS-PC predictor training and testing in this paper, please click the Supporting Information button. Here are a few data-sets for download, such as S 1 only containing splice donor site sequences, S 2 only containing splice acceptor site sequences.

  3. (c)

    If you want to get some important references and resources in establishing the iSS-PC predictor, please click the Citation button.

  4. (d)

    Before entering query sequences or uploading a file for batch prediction, you should choose types of splice sites: splice donor site or splice acceptor site.

  5. (e)

    The network predictor “iSS-PC” accepts single or multiple sequence queries. But the input sequences must be in FASTA format, or the network predictor may report errors and will request you to re-input your query sequence. Click the Example button on top of the first input box to see the input format.

  6. (f)

    If you want to obtain the prediction results, please click the Submit button. After entering query sequences in the first input box in the Example window, you will see how much you’ve been doing with the job on your screen. When the job is over, the results will be displayed in the page as “The number of DNA sequences investigated: X”, and “The DNA #xx is splice donor/acceptor site sequences” or “ The DNA #xx is non-splice donor/acceptor site sequences”.

  7. (g)

    The lower panel of Fig. 5 offers the option for batch prediction. If you want to submit your batch of multiple sequences in FASTA format for prediction in order to avoid constantly online awaiting, please click the Browse button. The prediction results of each batch job will be sent to your e-mail address. Clicking the Batch-example button, you will see the examples of batch file in FASTA format.

  8. (h)

    Running times of the network predictor “iSS-PC” are shown underneath the above graph in mathematical terms. And the corresponding number stands for popularity of our predictor to a certain extent.

Figure 5
figure 5

A semi-screenshot of the homepage for the web-server “iSS-PC”.

Conclusions

Feature extraction is the key problem in the research on bioinformatics. In this article, we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate the given sequence samples via a battery of cross-covariance and auto-covariance transformations, and achieved good results. However, with the further research of feature extraction methods and the development of computer technology, more and more web servers have been emerged, such as Pse-in-One38, repRNA39, and repDNA40. Then, many features such as pseudo amino acid composition (PseAAC), pseudo dinucleotide composition (PseDNC), pseudo trinucleotide composition (PseTNC), dinucleotide-based auto covariance (DAC) and dinucleotide-based cross covariance (DCC) can be generated by using these web servers. Therefore, for the future, we can try to study more other similar genomic problems by using the feature extraction methods based on these web servers.

Classification algorithm design is another important step that can affect the performance of a predictor. In this paper, we used deep sparse auto-encoder to construct the iSS-PC predictor. By using the same feature extraction method on benchmark data-sets, we compared the SAE model with those traditional machine learning algorithms, and found that the SAE classification algorithm was stable and reliable. Therefore, the new approach could be used to solve many important tasks in bioinformatics, such as iRSpot-EL41, iDHS-EL42, iEnhancer-2L43. And these are the work which should be completed in the next phase. In fact, we had constructed a predictor called “iDHSs-PseTNC”44 to identify DNase I hypersensitive sites with pseudo trinucleotide component by deep sparse auto-encoder, and the results of the predictor iDHSs-PseTNC was superior to that of iDHS-EL.

In conclusion, the timely identification of the splicing sites in DNA sequence is significant for the intensive study on DNA function and the development of new drugs. The experimental results by five-fold cross-validation on the same benchmark datasets indicated that the iSS-PC predictor was superior to other predictors in this area. And the results were promising enough for our predictor to be used as an analytic solution to more genomic problems, such as DNA-binding protein prediction45, detection of tubule boundary46, methylation site prediction47, phosphorylation site prediction48, and protein-protein interaction prediction49.

Methods

Benchmark dataset

In this paper, the benchmark dataset is composed of two parts: splice donor site sequences and splice acceptor site sequences. The former can be denoted by S 1, the latter can be formulated by S 2, as shown below.

$${S}_{1}={S}_{1}^{+}\cup {S}_{1}^{-};{S}_{2}={S}_{2}^{+}\cup {S}_{2}^{-}$$
(2)

where \({S}_{1}^{+}\) represents the positive dataset containing 2796 true splice donor site sequences, while \({S}_{1}^{-}\) represents the negative dataset consisting of 2800 false splice donor site sequences. \({S}_{2}^{+}\), the positive dataset composed of 2880 true splice acceptor site sequences, while \(\,{S}_{2}^{-}\), the negative dataset composed of 2800 false splice acceptor site sequences. The symbol \(\cup \) denotes “union” in the Cantor set theory. Datasets S 1and S 2 provided by Wei Chen6 can be downloaded from the website: http://dx.doi.org/10.1155/2014/623149, or these datasets can be obtained from Supplementary Information.

Feature extraction

Generally, input of nearly all the machine learning based classifiers must be numerical features but not sequences50, therefore, splice site sequences should be transformed into numerical feature vectors. Below, let’s describe how to formulate a sample sequence into a discrete vector model.

A sequence sample in the current benchmark dataset can be generally expressed as

$${\rm{D}}={N}_{1}{N}_{2}{N}_{3}{N}_{4}{N}_{5}{N}_{6}{N}_{7}\cdots {N}_{L}$$
(3)

where N i (i = 1, 2, …, L) represents the ith nucleotide of the sequence sample. It can be any one of the four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T), respectively. While L represents the length of the given sequence sample.

Some literatures have shown that among the discrete vector models for a DNA sample, nucleic acid composition (NAC) is the simplest one. According to the NAC-discrete vector model, the given sequence sample D of Eq. (3) can be defined as

$${\rm{D}}={[\begin{array}{cc}\begin{array}{cc}f(A) & f(C)\end{array} & \begin{array}{cc}f(G) & f(T)\end{array}\end{array}]}^{T}$$
(4)

where \({f}_{i}=f(\cdot )\), (i = 1, 2, 3, 4) is the normalized occurrence frequency of the corresponding descriptor in the DNA sequence. And T is the transpose operator. But in this way all the sequence order information of sequence D would be entirely lost.

As mentioned in the literature51, in order to incorporate more short-range sequence-order or local information, the k-tuple nucleotide composition or k-mers approach can be used to formulate the given sequence D into a feature vector containing 4k components, i.e.

$${\rm{D}}={[\begin{array}{ccc}\begin{array}{cc}{f}_{1} & {f}_{2}\end{array} & \begin{array}{cc}{f}_{3} & \cdots \end{array} & \begin{array}{cc}{f}_{{4}^{k-1}} & {f}_{{4}^{k}}\end{array}\end{array}]}^{T}$$
(5)

where f 1 is the normalized occurrence frequency of the first k-mer; f 2, that of the second k-mer, and so on. It should be noted however, that k is usually not more than 4, otherwise it may cause over-fitting problem, “high-dimension disaster”52 and increase of computational run-time with the feature vector dimensions increasing.

To incorporate long-range or global sequence order information, the pseudo components were proposed to deal with not only peptide/protein sequences, but also RNA/DNA sequences. As mentioned in the recent paper53, the sequence D of Eq. (2) can be formulated as below by using the pseudo nucleotide composition (PseKNC).

$${\rm{D}}={[\begin{array}{ccc}\begin{array}{ccc}{\xi }_{1} & {\xi }_{2} & {\xi }_{3}\end{array} & \cdots & \begin{array}{ccc}{\xi }_{\mu } & \cdots & {\xi }_{I}\end{array}\end{array}]}^{T}$$
(6)

where subscript I, the vector dimension, is an integer. Its value as well as the components in Eq. (6) will depend on how to extract the desired information from the sequence D.

Below, the “physical-chemical property matrix” and “auto-covariance and covariance transformations” will be used to define the value of subscript I in Eq. (6).

Physical-chemical property matrix

DNA physical-chemical(PC) property is the most intuitive feature of biochemical reactions. And it has different PC properties for each of sixteen different dinucleotides or dimers that are AA, AC, AG, AT, CA, …, TT in a DNA sequence, respectively. In this paper, the following twelve PC properties were adopted: (1) HC1: A-philicity54; (2) HC2: base stacking55; (3) HC3: B-DNA twist56; (4) HC4: bendability57; (5) HC5: DNA bending stiffness58; (6) HC6: DNA denaturation59; (7) HC7: duplex disrupt energy60; (8) HC8: duplex free energy61; (9) HC9: propeller twist56;(10) HC10: protein deformation62; (11) HC11: protein-DNA twist62; (12)HC12: Z-DNA63. The original values of the twelve descriptors for each dinucleotide are listed in Table 7. Then we can obtain a 12 × (L − 1) PC property matrix as shown below.

$${\rm{D}}=[\begin{array}{cc}\begin{array}{cc}\begin{array}{c}\begin{array}{c}P{C}^{1}({N}_{1}{N}_{2})\\ P{C}^{2}({N}_{1}{N}_{2})\end{array}\\ \begin{array}{c}\vdots \\ P{C}^{12}({N}_{1}{N}_{2})\end{array}\end{array} & \begin{array}{c}\begin{array}{c}P{C}^{1}({N}_{2}{N}_{3})\\ P{C}^{2}({N}_{2}{N}_{3})\end{array}\\ \begin{array}{c}\vdots \\ P{C}^{12}({N}_{2}{N}_{3})\end{array}\end{array}\end{array} & \begin{array}{cc}\begin{array}{c}\begin{array}{c}\cdots \\ \cdots \end{array}\\ \begin{array}{c}\vdots \\ \cdots \end{array}\end{array} & \begin{array}{c}\begin{array}{c}P{C}^{1}({N}_{L-2}{N}_{L-1})\\ P{C}^{2}({N}_{L-2}{N}_{L-1})\end{array}\\ \begin{array}{c}\vdots \\ P{C}^{12}({N}_{L-2}{N}_{L-1})\end{array}\end{array}\end{array}\end{array}]$$
(7)

where PC i(N j N j + 1) represents the ith (i = 1, 2, …, 12) PC property value for the dinucleotide N j N j + 1 in Eq. (3). However, the data of Table 7 should be normalized by the following equation before they were substituted into Eq. (7).

$${y}_{k}=({x}_{k}-mean(x))/std(x)$$
(8)

where x k represents the original PC property value in Table 7 of the kth (k = 1, 2, …, 16) dinucleotide. While mean (x) represents the average value for the sixteen dinucleotides; and std (x), the corresponding standard deviation; y k , the corresponding converted values, will remain unchanged if they go through the same conversion procedure again.

Table 7 The original values of the twelve PC properties for each dinucleotide.

Auto-covariance and cross covariance

The concept of auto-covariance function and cross-covariance function was proposed in 1993, when analyzing the relations between biopolymer sequences and chemical processes. Recently, according to the description to auto-covariance and cross-covariance transformations in literatures10, 11, these transformations could be expressed by the following mathematical expressions.

$${\rm{AC}}({\rm{\mu }},{\rm{\tau }})=\frac{{\sum }_{j=1}^{L-1-\tau }[P{C}^{\mu }({N}_{j}{N}_{j+1})-\overline{P{C}^{\mu }}][P{C}^{\mu }({N}_{j+\tau }{N}_{j+1+\tau })-\overline{P{C}^{\mu }}]}{L-1-\tau }\,({\rm{\mu }}=1,2,\cdots ,12)$$
(9)

where AC represents the correlation of the same PC property between two sub-sequences separated by τ dinucleotides, τ = 1, 2, …, L − 2. While \(\overline{P{C}^{\mu }}=\frac{{\sum }_{j=1}^{L-1}P{C}^{\mu }({N}_{j}{N}_{j+1})}{L-1}\) is the mean of the data along the μth row in the matrix of Eq. (7).

$${\rm{CC}}({n}_{1},{n}_{2},{\rm{\tau }})=\frac{{\sum }_{j=1}^{L-1-\tau }[P{C}^{{n}_{1}}({N}_{j}{N}_{j+1})-\overline{P{C}^{{n}_{1}}}][P{C}^{{n}_{2}}({N}_{j+\tau }{N}_{j+1+\tau })-\overline{P{C}^{{n}_{2}}}]}{L-1-\tau }\,({n}_{1}\ne {n}_{2})$$
(10)

where CC represents the correlation between two subsequences each belonging to a different PC property.

As we can see from Eq. (9), we can generate 12 × τ components associated with the PC properties of a sample sequence D in Eq. (3) and from Eq. (10), 12 × 11 × τ components. Then we can generate (12 × τ + 12 × 11 × τ) = 144 × τ components by ACF and CCF via 12 different PC properties. Therefore, the sample sequence D can be eventually formulated by

$${\rm{D}}={[\begin{array}{ccc}\begin{array}{ccc}{\xi }_{1} & {\xi }_{2} & {\xi }_{3}\end{array} & \cdots & \begin{array}{ccc}{\xi }_{\mu } & \cdots & {\xi }_{144\times \tau }\end{array}\end{array}]}^{T}$$
(11)

where ξ μ represents the μth of the 144 × τ components generated by Eqs (9) and (10) as described above.

Deep sparse auto-encoder

In 1986, DE Rumelhart et al.64 firstly proposed the concept of an auto-encoder to process the large complex high-dimensional data. In 2006, GE Hinton et al.22 improved the prototype structure of the auto-encoder, thus making deep auto-encoder (DAE) appear. Thereafter, in 2008, Y Bengio et al.65 proposed the concept of sparse auto-encoder, therefore, the study of DAE went much deeper. And in 2010, P Vincent24 developed stacked de-noising auto-encoder to yield significantly lower classification error.

Based on the research22, we constructed a deep sparse auto-encoder model with two hidden layers in this paper, as shown in the Fig. 6. In order to implement classification accurately and quickly based on minimum error law, we can use deep learning software packages, including SAE and NN software, which can be obtained from the website https://github.com/rasmusbergpalm/DeepLearnToolbox. Note that, in order to optimize the effectiveness of the SAE algorithm, we should fine tune the model parameters by loop optimization. Finally, we can get the best results.

Figure 6
figure 6

A sketch map of a deep sparse auto-encoder model with two hidden layers.

The predictor established according to the above-mentioned procedures is called ‘iSS-PC’, where ‘i’ stands for ‘identifying’, ‘SS’ for ‘splicing sites’ and ‘PC’ for ‘physical-chemical property’.

There are two issues to be dealt with: one is ‘what metrics should be used to examine the accuracy of the predictor?’ The other is ‘what validation method should be taken to calculate the metric values?’

A set of metrics for measuring prediction quality

As mentioned in the literature, accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew correlation coefficient (Mcc) introduced by Chou66 are the most frequently used metrics to evaluate the performance of the predictor in bioinformatics. To make these easier to understand for the researchers, the four metrics can be formulated as below30, 67.

$$\{\begin{array}{c}\begin{array}{c}ACC=1-\frac{{N}_{-}^{+}+{N}_{+}^{-}}{{N}^{+}+{N}^{-}}\,\\ Mcc=\frac{1-(\frac{{N}_{-}^{+}}{{N}^{+}}+\frac{{N}_{+}^{-}}{{N}^{-}})}{\sqrt{(1+\frac{{N}_{+}^{-}-{N}_{-}^{+}}{{N}^{+}})(1+\frac{{N}_{-}^{+}-{N}_{+}^{-}}{{N}^{-}})}}\,\end{array}\\ \begin{array}{c}Sn=1-\frac{{N}_{-}^{+}}{{N}^{+}}\,\\ Sp=1-\frac{{N}_{+}^{-}}{{N}^{-}}\,\end{array}\end{array}$$
(12)

where N + the total number of the true splice donor site sequences (true splice acceptor site sequences) detected, \({N}_{-}^{+}\) the number of the true splice donor site sequences (true splice acceptor site sequences) misidentified as the false splice donor site sequences(false splice acceptor site sequences); whereas, N the total number of the false splice donor site sequences (false splice acceptor site sequences) observed, \({N}_{+}^{-}\) the number of the false splice donor site sequences (false splice acceptor site sequences) mis-predicted as the true splice donor site sequences (true splice acceptor site sequences).

However, it should be noted that the four metrics formulated in Eq. (12) are valid only for the single-label systems, but unsuitable for multi-label systems appearing frequently in system biology and system medicine. For the latter, an utterly different set of metrics is needed as elaborated in the literature68.

Cross-validation

After the four well-known metrics mentioned above have been adopted to evaluate the performance of predictors, another thing we should consider at this moment is what validation method should be used to calculate the value of the four metrics. Generally speaking, there are three popular cross-validation approaches in prediction and analysis on the statistics, i.e., independent dataset test, K-fold cross-validation and jackknife test. Although the jackknife test always yielding a unique output for a given benchmark dataset seems the least arbitrary, K-fold cross-validation has more advantages in the computational time than that of the former. Therefore, in this paper, we adopt five-fold cross-validation to score the four metrics. Below, let’s introduce specific methods about five-fold cross-validation.

Firstly, for the benchmark dataset S 1 of Eq. (2) consisting of splice donor site sequences, we randomly divided the data-sets \({S}_{1}^{+}\) and \({S}_{1}^{-}\) into five subsets which size was approximately equal to each other, respectively, as shown below

$$\{\begin{array}{c}{S}_{1}^{+}={S}_{11}^{+}\cup {S}_{12}^{+}\cup {S}_{13}^{+}\cup {S}_{14}^{+}\cup {S}_{15}^{+}\\ {S}_{1}^{-}={S}_{11}^{-}\cup {S}_{12}^{-}\cup {S}_{13}^{-}\cup {S}_{14}^{-}\cup {S}_{15}^{-}\end{array}$$
(13)

where \({S}_{1i}^{+}\), the subset of \({S}_{1}^{+}\), its label for the dividing category is set to i(i = 1, 2, …, 5). Similarly, \({S}_{1i}^{-}\), the subset of \({S}_{1}^{-}\), its label for the dividing category is set to i, too. Both \({S}_{1i}^{+}\) and \({S}_{1i}^{-}\) satisfied the following conditions.

$$\{\begin{array}{c}|{S}_{11}^{+}|\approx |{S}_{12}^{+}|\approx |{S}_{13}^{+}|\approx |{S}_{14}^{+}|\approx |{S}_{15}^{+}|\\ |{S}_{11}^{-}|\approx |{S}_{12}^{-}|\approx |{S}_{13}^{-}|\approx |{S}_{14}^{-}|\approx |{S}_{15}^{-}|\end{array}$$
(14)

where \(|{S}_{11}^{+}|\) denotes the number of elements (samples) in \({S}_{11}^{+}\), and so forth.

Finally, we can obtain five subsets of the benchmark dataset S 1 according to their labels for the dividing category, as shown below

$${S}_{1}={S}_{1}^{^{\prime} }\cup {S}_{2}^{^{\prime} }\cup {S}_{3}^{^{\prime} }\cup {S}_{4}^{^{\prime} }\cup {S}_{5}^{^{\prime} }$$
(15)

where \({S}_{1}^{^{\prime} }={S}_{11}^{+}\cup {S}_{11}^{-},{S}_{2}^{^{\prime} }={S}_{12}^{+}\cup {S}_{12}^{-}\), and so forth.

with

$$|{S}_{1}^{^{\prime} }|\approx |{S}_{2}^{^{\prime} }|\approx |{S}_{3}^{\text{'}}|\approx |{S}_{4}^{^{\prime} }|\approx |{S}_{5}^{^{\prime} }|$$
(16)

Therefore, we can single out each of the five subsets of Eq. (15) one by one to test the model that were trained with the remaining four subsets for identifying the splice donor site sequences. The cross validation is carried out five times, and the average scores among the output are regarded as the final outcome. It’s remarkable that the same cross-validation process can be used for the benchmark data-set S 2 of Eq. (2) consisting of splice acceptor site sequences.