iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder

Xu, Zhao-Chun; Wang, Peng; Qiu, Wang-Ren; Xiao, Xuan

doi:10.1038/s41598-017-08523-8

Download PDF

Article
Open access
Published: 15 August 2017

iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder

Zhao-Chun Xu ORCID: orcid.org/0000-0003-1799-5529¹,
Peng Wang¹,
Wang-Ren Qiu^1,2 &
…
Xuan Xiao^1,3

Scientific Reports volume 7, Article number: 8222 (2017) Cite this article

1932 Accesses
20 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.

An automated framework for evaluation of deep learning models for splice site predictions

Article Open access 23 June 2023

Amin Zabardast, Elif Güney Tamer, … Arif Yılmaz

Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease

Article Open access 15 November 2023

Emily Kunce Stroup & Zhe Ji

Mechanism and modeling of human disease-associated near-exon intronic variants that perturb RNA splicing

Article 27 October 2022

Hung-Lun Chiang, Yi-Ting Chen, … Chien-Ling Lin

Introduction

Generally, the pre-mRNA, including exons and one or more introns, is transcribed from a eukaryotic gene’s DNA template. In the pre-mRNA, exon-intron boundaries i.e. the 5′ ends of the introns are called splice donor sites or 5′ splice sites, and intron-exon boundaries i.e. the 3′ ends of the introns are called splice acceptor sites or 3′ splice sites, as shown in Fig. 1. There are two forms of splice sites. Before the pre-mRNA becomes a mature messenger RNA (mRNA), it must go through several biological processes (Fig. 1). The final mRNA containing only remaining exons can be directly involved in the synthesis of protein. Thus, the biological process of removing introns from its 5′ splice site to its 3′ splice site in pre-mRNA and connecting exons to form mRNA plays an important role in gene regulation and expression. In this case, accurate identification of splice sites becomes increasingly important.

Although the technology of PCR has become one of the most important identification methods to accurately identify splice sites with the development of identification technology the functional sites of genes, it is very expensive and time consuming based only on experimental technique. Hence, development of an effective computational method, so as to help researchers effectively and in a timely fashion, identifying splice sites, has become the urgent need to solve a big problem. In this situation, the computational splice-site analysis tools based on the WEB took up, such as NetGene^{1, 2}, SplicePredictor³, GeneSplicer⁴ and SplicePort⁵. Recently, Wei Chen et al.⁶ built a prediction model “iSS-PseDNC” which incorporated six DNA local structural properties into pseudo dinucleotide composition to identify splice donor and acceptor sites. In 2016, M Iqbal et al.⁷ used PseTNC and PseTetraNC methods to propose a hybrid prediction model, called iSS-Hyb-mRMR, for identifying splice sites, and Prabina Kumar Meher⁸ used a hybrid feature extraction approach, which contains positional, dependency and compositional features, to develop a predictor called HSplice for predicting the donor splice sites in eukaryotic genes. These were, on balance, successful.

Based on the above information, although the remarkable progress in identification of splice sites has been made, further study about splice-site predictors can be improved and perfected, whether it is with regard to in feature extraction, or to machine learning classification algorithms. In response to these the issue of two aspects, we have presented a solution to improve the performance of the predictive model in this paper.

On the one hand, improvement of feature extraction method is of critical importance to improve the classification performance. Since S Wold⁹ proposed the concept of auto-covariance function(ACF) and cross-covariance function(CCF) to analyze the relations between biopolymer sequences and chemical processes in 1993, this method had been applied to identify nuclear receptors and their subfamilies¹⁰ and N⁶-methyladenosine sites¹¹ via incorporating physical-chemical properties into pseudo amino acid composition(PseAAC) or pseudo dinucleotide composition(PseDNC), respectively. Encouraged by the above successes of introducing this feature extraction approach into computational proteomics, we use twelve physical-chemical properties of the dinucleotides within DNA via a battery of cross-covariance and auto-covariance transformations to obtain a mode of PseDNC to formulate given sequence samples.

On the other hand, the improved machine learning classification algorithms that can provide a better result for classification, is one of the important factors impacting on the performance of classifiers. And in general, different classification algorithms will have different performances. Conventional classification algorithms, such as Support Vector Machine(SVM)^12,13,14,15, random forest¹⁶, hidden Markov model¹⁷, Bayes¹⁸, covariance discriminant (CD)¹⁹, Minimax Probability Machine (MPM)²⁰ and so on, have limitations in processing the original data. Recently, a novel classification algorithm, deep learning, has been proposed based on big data, and it has overcome the former limitations. Deep learning algorithm mainly includes convolutional neural network(CNN)²¹, deep belief network(DBN)²² and stacked auto-encoder(SAE)^{23, 24}. Some remarkable progress has been made in diverse fields such as speech recognition and image recognition. In 2014, L James et al.²⁵ firstly used SAE to predict θ and Tangles used to represent local backbone structure of proteins. In the same year, SP Nguyen et al.²⁶ built a model “DL-Pro” that learned a SAE network as a classifier for protein structures. In 2016, J Xu et al.²⁷ used SAE algorithm to detect on breast cancer histopathology images. W Xu et al.²⁸ constructed a model for human promoter recognition with SAE. Inspired by these achievements, the predictor called iSS-PC is constructed by using deep sparse auto-encoder in this paper and its predication performance has been greatly improved.

Basing on a series of recent studies^29,30,31, we can draw a conclusion that we should follow the five steps³² shown in Fig. 2 to establish a real and effective biological predictor based on sequence. Below, we are going to discuss how to deal with these steps one by one. Of course, the order of these steps may be appropriately adjusted to be in a format that is suitable for the journal “Scientific Reports”.

Results and Discussion

Selection of the characteristic parameter

As described in Section Methods later in the article, we can obtain a feature vector containing 144 × τ components to represent the given sample sequence D. Here τ is named characteristic parameter, and its value as an integer. Obviously, the dimension I of the feature vector is increased with the increment of the characteristic parameter τ, as shown below.

$$I=\{\begin{array}{cc}\begin{array}{c}288\\ 432\\ \begin{array}{c}576\\ \begin{array}{c}720\\ \vdots \end{array}\end{array}\end{array} & \begin{array}{c}\tau =2\\ \tau =3\\ \begin{array}{c}\tau =4\\ \begin{array}{c}\tau =5\\ \vdots \end{array}\end{array}\end{array}\end{array}$$

(1)

However, we should notice that oversized τ value will lead to the problem of the curse of dimensionality. Thus, the value of τ is set at 2, 3, 4 and 5 to carry out experiments, respectively. And the experimental results are listed in Table 1 and Table 2. As can be seen from Table 1, τ = 5 gives the best results, but there is little difference between the results given by τ = 4 and τ = 5. Then, in order to reduce computation time, we fix the τ value into 4. As can be seen form Table 2, τ = 4 gives the best results. Then we can generate a feature vector containing 144 × 4 = 576 components as the input of the deep sparse auto-encoder for identifying splicing donor site and splicing acceptor site.

Table 1 The test results of splice donor site sequences based on different characteristic parameter τ values.

Full size table

Table 2 The test results of splice acceptor site sequences based on different characteristic parameter τ values.

Full size table

Comparison with the existing methods

The four metrics i.e. accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew correlation coefficient (Mcc) can reflect the performance of predictors clearly. Based on the benchmark dataset composed solely of splice donor site sequences, their scores obtained by the new predictor “iSS-PC” via the five-fold cross-validation test are listed in Table 3. And the results for splice acceptor site sequences, listed in Table 4. For ease of comparison between the other methods, the results obtained by the iSS-PseDNC predictor constructed by Wei Chen⁶ based on the corresponding benchmark dataset are listed in these tables, respectively.

Table 3 The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice donor site sequences.

Full size table

Table 4 The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice acceptor site sequences.

Full size table

As can be seen from Table 3, although the Sn rate of the new predictor “iSS-PC” is a little bit higher than that of the iSS-PseDNC predictor, the score of the other three metrics has been greatly improved. For example, the ACC rate of our predictor “iSS-PC” has increased by nearly three percent, the MCC rate, nearly six percent and the Sp rate, also nearly six percent. It means that better experimental effect has been acquired, and indicates that our predictor is superior to the iSS-PseDNC predictor at identifying the splice donor site sequences.

On the other hand, as can be seen from Table 4, although the Sn rate of the new iSS-PC predictor is 4% lower than that of the iSS-PseDNC predictor, the Sp rate of our predictor has increased by over 9 percent. And most importantly, the most important indicators for ranking different algorithms have different increases, ACC, nearly 2.5 percent and MCC, nearly 4.5 percent. It indicates that our predictor is also superior to the iSS-PseDNC predictor at identifying the splice acceptor site sequences.

Then through the above analyses, we can draw the conclusion that the methods of feature extraction and classification designed in this paper are very effective based on the splice site sequences. It means that the iSS-PC predictor has higher prediction precision and consumes less time than the existing predictors.

Receiver operating characteristic (ROC) curves

Receiver operating characteristic(ROC) curve³³ is the another important gauge of performance of a predictor. It can visually present readers’ eyes in graphical form. The area under the ROC curve(AUC) represents a popular evaluation index of the performance of a binary classifier. Studies^{34, 35} indicated that the larger the AUC meant better predictor’s performance.

In the Figs 3 and 4, the blue curve is generated by new predictor “iSS-PC”, and the green curve is formed by the predictor “iSS-PseDNC” constructed by Wei Chen et al. The corresponding values of AUC computed over five-fold cross-validation are shown in Figs 3 and 4. From Fig. 3 it can be seen that the values of AUC are 0.9566 and 0.9239 for splice donor site sequences, respectively. On the other hand, for the splice acceptor site sequences the value of AUC generated by predictor “iSS-PC” is found to be 0.9628, whereas the value of AUC generated by predictor “iSS-PseDNC” is found to be 0.9518, as shown in Fig. 4. Obviously, it can be seen that the AUC value of the predictor “iSS-PC” is higher than that of the predictor “iSS-PseDNC” for both the splice donor and acceptor site sequences. Therefore, we can draw the conclusion that our predictor “iSS-PC” is superior to the predictor “iSS-PseDNC”, and from the experimental results, it can be proved that the predictor “iSS-PC” is accurate and stable.

Comparison with traditional high-effectiveness machine learning algorithms

SVM and random forest (RF) are the traditional but efficient classification algorithms. In addition, Dynamic selection and Circulating Combination-based ensemble Clustering i.e. libD3C^{36, 37} is a popular tool for binary classification task, too. In order to quickly and easily perform classification prediction for users, libD3C package can be downloaded from the website: http://datamining.xmu.edu.cn/~gjs/LibD3C_1.1/index.html. Meanwhile, WEKA, a free and open source software program, should be downloaded and installed. Then, the ensemble classification model constructed by libD3C can be created in WEKA. In this paper, we compare the SAE model with these traditional machine learning algorithms to examine the performance of the new predictor. And the results are listed in Tables 5 and 6.

Table 5 The 5-fold cross-validation test results obtained from different classification algorithms with the same feature extraction method on benchmark data-set only containing splice donor site sequences.

Full size table

Table 6 The 5-fold cross-validation test results obtained from different classification algorithms with the same feature extraction method on benchmark data-set only containing splice acceptor site sequences.

Full size table

The results show in the Tables 5 and 6: the rates of the two most important indicators, ACC and MCC obtained from our predictor “iSS-PC” are significantly higher than those of others, respectively. It indicates the SAE classification algorithm is more effective to identify the splice sites and the new predictor “iSS-PC” would be a very useful tool in this regard.

Web server and its user guide

In this paper, a simple and practical network predictor shown in Fig. 5, called iSS-PC, has been developed, in order to help the researchers identify splicing sites in real-time and easily. And we provide service consumers with a Web site link http://www.jci-bioinfo.cn/iSS-PC. Below, this article provides details on how to use the network predictor “iSS-PC”.

(a)
If you want to get the information about the network predictor, please click the Read Me button. Then you can obtain a brief introduction of our predictor and the caveats for using it.
(b)
If you want to obtain the benchmark data-set for the iSS-PC predictor training and testing in this paper, please click the Supporting Information button. Here are a few data-sets for download, such as S ₁ only containing splice donor site sequences, S ₂ only containing splice acceptor site sequences.
(c)
If you want to get some important references and resources in establishing the iSS-PC predictor, please click the Citation button.
(d)
Before entering query sequences or uploading a file for batch prediction, you should choose types of splice sites: splice donor site or splice acceptor site.
(e)
The network predictor “iSS-PC” accepts single or multiple sequence queries. But the input sequences must be in FASTA format, or the network predictor may report errors and will request you to re-input your query sequence. Click the Example button on top of the first input box to see the input format.
(f)
If you want to obtain the prediction results, please click the Submit button. After entering query sequences in the first input box in the Example window, you will see how much you’ve been doing with the job on your screen. When the job is over, the results will be displayed in the page as “The number of DNA sequences investigated: X”, and “The DNA #xx is splice donor/acceptor site sequences” or “ The DNA #xx is non-splice donor/acceptor site sequences”.
(g)
The lower panel of Fig. 5 offers the option for batch prediction. If you want to submit your batch of multiple sequences in FASTA format for prediction in order to avoid constantly online awaiting, please click the Browse button. The prediction results of each batch job will be sent to your e-mail address. Clicking the Batch-example button, you will see the examples of batch file in FASTA format.
(h)
Running times of the network predictor “iSS-PC” are shown underneath the above graph in mathematical terms. And the corresponding number stands for popularity of our predictor to a certain extent.

Conclusions

Feature extraction is the key problem in the research on bioinformatics. In this article, we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate the given sequence samples via a battery of cross-covariance and auto-covariance transformations, and achieved good results. However, with the further research of feature extraction methods and the development of computer technology, more and more web servers have been emerged, such as Pse-in-One³⁸, repRNA³⁹, and repDNA⁴⁰. Then, many features such as pseudo amino acid composition (PseAAC), pseudo dinucleotide composition (PseDNC), pseudo trinucleotide composition (PseTNC), dinucleotide-based auto covariance (DAC) and dinucleotide-based cross covariance (DCC) can be generated by using these web servers. Therefore, for the future, we can try to study more other similar genomic problems by using the feature extraction methods based on these web servers.

Classification algorithm design is another important step that can affect the performance of a predictor. In this paper, we used deep sparse auto-encoder to construct the iSS-PC predictor. By using the same feature extraction method on benchmark data-sets, we compared the SAE model with those traditional machine learning algorithms, and found that the SAE classification algorithm was stable and reliable. Therefore, the new approach could be used to solve many important tasks in bioinformatics, such as iRSpot-EL⁴¹, iDHS-EL⁴², iEnhancer-2L⁴³. And these are the work which should be completed in the next phase. In fact, we had constructed a predictor called “iDHSs-PseTNC”⁴⁴ to identify DNase I hypersensitive sites with pseudo trinucleotide component by deep sparse auto-encoder, and the results of the predictor iDHSs-PseTNC was superior to that of iDHS-EL.

In conclusion, the timely identification of the splicing sites in DNA sequence is significant for the intensive study on DNA function and the development of new drugs. The experimental results by five-fold cross-validation on the same benchmark datasets indicated that the iSS-PC predictor was superior to other predictors in this area. And the results were promising enough for our predictor to be used as an analytic solution to more genomic problems, such as DNA-binding protein prediction⁴⁵, detection of tubule boundary⁴⁶, methylation site prediction⁴⁷, phosphorylation site prediction⁴⁸, and protein-protein interaction prediction⁴⁹.

Methods

Benchmark dataset

In this paper, the benchmark dataset is composed of two parts: splice donor site sequences and splice acceptor site sequences. The former can be denoted by S ₁, the latter can be formulated by S ₂, as shown below.

$${S}_{1}={S}_{1}^{+}\cup {S}_{1}^{-};{S}_{2}={S}_{2}^{+}\cup {S}_{2}^{-}$$

(2)

where ${S}_{1}^{+}$ represents the positive dataset containing 2796 true splice donor site sequences, while ${S}_{1}^{-}$ represents the negative dataset consisting of 2800 false splice donor site sequences. ${S}_{2}^{+}$, the positive dataset composed of 2880 true splice acceptor site sequences, while $\,{S}_{2}^{-}$, the negative dataset composed of 2800 false splice acceptor site sequences. The symbol $\cup $ denotes “union” in the Cantor set theory. Datasets S ₁and S ₂ provided by Wei Chen⁶ can be downloaded from the website: http://dx.doi.org/10.1155/2014/623149, or these datasets can be obtained from Supplementary Information.

Feature extraction

Generally, input of nearly all the machine learning based classifiers must be numerical features but not sequences⁵⁰, therefore, splice site sequences should be transformed into numerical feature vectors. Below, let’s describe how to formulate a sample sequence into a discrete vector model.

A sequence sample in the current benchmark dataset can be generally expressed as

$${\rm{D}}={N}_{1}{N}_{2}{N}_{3}{N}_{4}{N}_{5}{N}_{6}{N}_{7}\cdots {N}_{L}$$

(3)

where N _i (i = 1, 2, …, L) represents the ith nucleotide of the sequence sample. It can be any one of the four nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T), respectively. While L represents the length of the given sequence sample.

Some literatures have shown that among the discrete vector models for a DNA sample, nucleic acid composition (NAC) is the simplest one. According to the NAC-discrete vector model, the given sequence sample D of Eq. (3) can be defined as

$${\rm{D}}={[\begin{array}{cc}\begin{array}{cc}f(A) & f(C)\end{array} & \begin{array}{cc}f(G) & f(T)\end{array}\end{array}]}^{T}$$

(4)

where ${f}_{i}=f(\cdot )$, (i = 1, 2, 3, 4) is the normalized occurrence frequency of the corresponding descriptor in the DNA sequence. And T is the transpose operator. But in this way all the sequence order information of sequence D would be entirely lost.

As mentioned in the literature⁵¹, in order to incorporate more short-range sequence-order or local information, the k-tuple nucleotide composition or k-mers approach can be used to formulate the given sequence D into a feature vector containing 4^k components, i.e.

$${\rm{D}}={[\begin{array}{ccc}\begin{array}{cc}{f}_{1} & {f}_{2}\end{array} & \begin{array}{cc}{f}_{3} & \cdots \end{array} & \begin{array}{cc}{f}_{{4}^{k-1}} & {f}_{{4}^{k}}\end{array}\end{array}]}^{T}$$

(5)

where f ₁ is the normalized occurrence frequency of the first k-mer; f ₂, that of the second k-mer, and so on. It should be noted however, that k is usually not more than 4, otherwise it may cause over-fitting problem, “high-dimension disaster”⁵² and increase of computational run-time with the feature vector dimensions increasing.

To incorporate long-range or global sequence order information, the pseudo components were proposed to deal with not only peptide/protein sequences, but also RNA/DNA sequences. As mentioned in the recent paper⁵³, the sequence D of Eq. (2) can be formulated as below by using the pseudo nucleotide composition (PseKNC).

$${\rm{D}}={[\begin{array}{ccc}\begin{array}{ccc}{\xi }_{1} & {\xi }_{2} & {\xi }_{3}\end{array} & \cdots & \begin{array}{ccc}{\xi }_{\mu } & \cdots & {\xi }_{I}\end{array}\end{array}]}^{T}$$

(6)

where subscript I, the vector dimension, is an integer. Its value as well as the components in Eq. (6) will depend on how to extract the desired information from the sequence D.

Below, the “physical-chemical property matrix” and “auto-covariance and covariance transformations” will be used to define the value of subscript I in Eq. (6).

Physical-chemical property matrix

DNA physical-chemical(PC) property is the most intuitive feature of biochemical reactions. And it has different PC properties for each of sixteen different dinucleotides or dimers that are AA, AC, AG, AT, CA, …, TT in a DNA sequence, respectively. In this paper, the following twelve PC properties were adopted: (1) HC¹: A-philicity⁵⁴; (2) HC²: base stacking⁵⁵; (3) HC³: B-DNA twist⁵⁶; (4) HC⁴: bendability⁵⁷; (5) HC⁵: DNA bending stiffness⁵⁸; (6) HC⁶: DNA denaturation⁵⁹; (7) HC⁷: duplex disrupt energy⁶⁰; (8) HC⁸: duplex free energy⁶¹; (9) HC⁹: propeller twist⁵⁶;(10) HC¹⁰: protein deformation⁶²; (11) HC¹¹: protein-DNA twist⁶²; (12)HC¹²: Z-DNA⁶³. The original values of the twelve descriptors for each dinucleotide are listed in Table 7. Then we can obtain a 12 × (L − 1) PC property matrix as shown below.

$${\rm{D}}=[\begin{array}{cc}\begin{array}{cc}\begin{array}{c}\begin{array}{c}P{C}^{1}({N}_{1}{N}_{2})\\ P{C}^{2}({N}_{1}{N}_{2})\end{array}\\ \begin{array}{c}\vdots \\ P{C}^{12}({N}_{1}{N}_{2})\end{array}\end{array} & \begin{array}{c}\begin{array}{c}P{C}^{1}({N}_{2}{N}_{3})\\ P{C}^{2}({N}_{2}{N}_{3})\end{array}\\ \begin{array}{c}\vdots \\ P{C}^{12}({N}_{2}{N}_{3})\end{array}\end{array}\end{array} & \begin{array}{cc}\begin{array}{c}\begin{array}{c}\cdots \\ \cdots \end{array}\\ \begin{array}{c}\vdots \\ \cdots \end{array}\end{array} & \begin{array}{c}\begin{array}{c}P{C}^{1}({N}_{L-2}{N}_{L-1})\\ P{C}^{2}({N}_{L-2}{N}_{L-1})\end{array}\\ \begin{array}{c}\vdots \\ P{C}^{12}({N}_{L-2}{N}_{L-1})\end{array}\end{array}\end{array}\end{array}]$$

(7)

where PC ⁱ(N _j N _{j + 1}) represents the ith (i = 1, 2, …, 12) PC property value for the dinucleotide N _j N _{j + 1} in Eq. (3). However, the data of Table 7 should be normalized by the following equation before they were substituted into Eq. (7).

$${y}_{k}=({x}_{k}-mean(x))/std(x)$$

(8)

where x _k represents the original PC property value in Table 7 of the kth (k = 1, 2, …, 16) dinucleotide. While mean (x) represents the average value for the sixteen dinucleotides; and std (x), the corresponding standard deviation; y _k, the corresponding converted values, will remain unchanged if they go through the same conversion procedure again.

Table 7 The original values of the twelve PC properties for each dinucleotide.

Full size table

Auto-covariance and cross covariance

The concept of auto-covariance function and cross-covariance function was proposed in 1993, when analyzing the relations between biopolymer sequences and chemical processes. Recently, according to the description to auto-covariance and cross-covariance transformations in literatures^{10, 11}, these transformations could be expressed by the following mathematical expressions.

$${\rm{AC}}({\rm{\mu }},{\rm{\tau }})=\frac{{\sum }_{j=1}^{L-1-\tau }[P{C}^{\mu }({N}_{j}{N}_{j+1})-\overline{P{C}^{\mu }}][P{C}^{\mu }({N}_{j+\tau }{N}_{j+1+\tau })-\overline{P{C}^{\mu }}]}{L-1-\tau }\,({\rm{\mu }}=1,2,\cdots ,12)$$

(9)

where AC represents the correlation of the same PC property between two sub-sequences separated by τ dinucleotides, τ = 1, 2, …, L − 2. While $\overline{P{C}^{\mu }}=\frac{{\sum }_{j=1}^{L-1}P{C}^{\mu }({N}_{j}{N}_{j+1})}{L-1}$ is the mean of the data along the μth row in the matrix of Eq. (7).

$${\rm{CC}}({n}_{1},{n}_{2},{\rm{\tau }})=\frac{{\sum }_{j=1}^{L-1-\tau }[P{C}^{{n}_{1}}({N}_{j}{N}_{j+1})-\overline{P{C}^{{n}_{1}}}][P{C}^{{n}_{2}}({N}_{j+\tau }{N}_{j+1+\tau })-\overline{P{C}^{{n}_{2}}}]}{L-1-\tau }\,({n}_{1}\ne {n}_{2})$$

(10)

where CC represents the correlation between two subsequences each belonging to a different PC property.

As we can see from Eq. (9), we can generate 12 × τ components associated with the PC properties of a sample sequence D in Eq. (3) and from Eq. (10), 12 × 11 × τ components. Then we can generate (12 × τ + 12 × 11 × τ) = 144 × τ components by ACF and CCF via 12 different PC properties. Therefore, the sample sequence D can be eventually formulated by

$${\rm{D}}={[\begin{array}{ccc}\begin{array}{ccc}{\xi }_{1} & {\xi }_{2} & {\xi }_{3}\end{array} & \cdots & \begin{array}{ccc}{\xi }_{\mu } & \cdots & {\xi }_{144\times \tau }\end{array}\end{array}]}^{T}$$

(11)

where ξ _μ represents the μth of the 144 × τ components generated by Eqs (9) and (10) as described above.

Deep sparse auto-encoder

In 1986, DE Rumelhart et al.⁶⁴ firstly proposed the concept of an auto-encoder to process the large complex high-dimensional data. In 2006, GE Hinton et al.²² improved the prototype structure of the auto-encoder, thus making deep auto-encoder (DAE) appear. Thereafter, in 2008, Y Bengio et al.⁶⁵ proposed the concept of sparse auto-encoder, therefore, the study of DAE went much deeper. And in 2010, P Vincent²⁴ developed stacked de-noising auto-encoder to yield significantly lower classification error.

Based on the research²², we constructed a deep sparse auto-encoder model with two hidden layers in this paper, as shown in the Fig. 6. In order to implement classification accurately and quickly based on minimum error law, we can use deep learning software packages, including SAE and NN software, which can be obtained from the website https://github.com/rasmusbergpalm/DeepLearnToolbox. Note that, in order to optimize the effectiveness of the SAE algorithm, we should fine tune the model parameters by loop optimization. Finally, we can get the best results.

The predictor established according to the above-mentioned procedures is called ‘iSS-PC’, where ‘i’ stands for ‘identifying’, ‘SS’ for ‘splicing sites’ and ‘PC’ for ‘physical-chemical property’.

There are two issues to be dealt with: one is ‘what metrics should be used to examine the accuracy of the predictor?’ The other is ‘what validation method should be taken to calculate the metric values?’

A set of metrics for measuring prediction quality

As mentioned in the literature, accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew correlation coefficient (Mcc) introduced by Chou⁶⁶ are the most frequently used metrics to evaluate the performance of the predictor in bioinformatics. To make these easier to understand for the researchers, the four metrics can be formulated as below^{30, 67}.

$$\{\begin{array}{c}\begin{array}{c}ACC=1-\frac{{N}_{-}^{+}+{N}_{+}^{-}}{{N}^{+}+{N}^{-}}\,\\ Mcc=\frac{1-(\frac{{N}_{-}^{+}}{{N}^{+}}+\frac{{N}_{+}^{-}}{{N}^{-}})}{\sqrt{(1+\frac{{N}_{+}^{-}-{N}_{-}^{+}}{{N}^{+}})(1+\frac{{N}_{-}^{+}-{N}_{+}^{-}}{{N}^{-}})}}\,\end{array}\\ \begin{array}{c}Sn=1-\frac{{N}_{-}^{+}}{{N}^{+}}\,\\ Sp=1-\frac{{N}_{+}^{-}}{{N}^{-}}\,\end{array}\end{array}$$

(12)

where N ⁺ the total number of the true splice donor site sequences (true splice acceptor site sequences) detected, ${N}_{-}^{+}$ the number of the true splice donor site sequences (true splice acceptor site sequences) misidentified as the false splice donor site sequences(false splice acceptor site sequences); whereas, N ⁻ the total number of the false splice donor site sequences (false splice acceptor site sequences) observed, ${N}_{+}^{-}$ the number of the false splice donor site sequences (false splice acceptor site sequences) mis-predicted as the true splice donor site sequences (true splice acceptor site sequences).

However, it should be noted that the four metrics formulated in Eq. (12) are valid only for the single-label systems, but unsuitable for multi-label systems appearing frequently in system biology and system medicine. For the latter, an utterly different set of metrics is needed as elaborated in the literature⁶⁸.

Cross-validation

After the four well-known metrics mentioned above have been adopted to evaluate the performance of predictors, another thing we should consider at this moment is what validation method should be used to calculate the value of the four metrics. Generally speaking, there are three popular cross-validation approaches in prediction and analysis on the statistics, i.e., independent dataset test, K-fold cross-validation and jackknife test. Although the jackknife test always yielding a unique output for a given benchmark dataset seems the least arbitrary, K-fold cross-validation has more advantages in the computational time than that of the former. Therefore, in this paper, we adopt five-fold cross-validation to score the four metrics. Below, let’s introduce specific methods about five-fold cross-validation.

Firstly, for the benchmark dataset S ₁ of Eq. (2) consisting of splice donor site sequences, we randomly divided the data-sets ${S}_{1}^{+}$ and ${S}_{1}^{-}$ into five subsets which size was approximately equal to each other, respectively, as shown below

$$\{\begin{array}{c}{S}_{1}^{+}={S}_{11}^{+}\cup {S}_{12}^{+}\cup {S}_{13}^{+}\cup {S}_{14}^{+}\cup {S}_{15}^{+}\\ {S}_{1}^{-}={S}_{11}^{-}\cup {S}_{12}^{-}\cup {S}_{13}^{-}\cup {S}_{14}^{-}\cup {S}_{15}^{-}\end{array}$$

(13)

where ${S}_{1i}^{+}$, the subset of ${S}_{1}^{+}$, its label for the dividing category is set to i(i = 1, 2, …, 5). Similarly, ${S}_{1i}^{-}$, the subset of ${S}_{1}^{-}$, its label for the dividing category is set to i, too. Both ${S}_{1i}^{+}$ and ${S}_{1i}^{-}$ satisfied the following conditions.

$$\{\begin{array}{c}|{S}_{11}^{+}|\approx |{S}_{12}^{+}|\approx |{S}_{13}^{+}|\approx |{S}_{14}^{+}|\approx |{S}_{15}^{+}|\\ |{S}_{11}^{-}|\approx |{S}_{12}^{-}|\approx |{S}_{13}^{-}|\approx |{S}_{14}^{-}|\approx |{S}_{15}^{-}|\end{array}$$

(14)

where $|{S}_{11}^{+}|$ denotes the number of elements (samples) in ${S}_{11}^{+}$, and so forth.

Finally, we can obtain five subsets of the benchmark dataset S ₁ according to their labels for the dividing category, as shown below

$${S}_{1}={S}_{1}^{^{\prime} }\cup {S}_{2}^{^{\prime} }\cup {S}_{3}^{^{\prime} }\cup {S}_{4}^{^{\prime} }\cup {S}_{5}^{^{\prime} }$$

(15)

where ${S}_{1}^{^{\prime} }={S}_{11}^{+}\cup {S}_{11}^{-},{S}_{2}^{^{\prime} }={S}_{12}^{+}\cup {S}_{12}^{-}$, and so forth.

with

$$|{S}_{1}^{^{\prime} }|\approx |{S}_{2}^{^{\prime} }|\approx |{S}_{3}^{\text{'}}|\approx |{S}_{4}^{^{\prime} }|\approx |{S}_{5}^{^{\prime} }|$$

(16)

Therefore, we can single out each of the five subsets of Eq. (15) one by one to test the model that were trained with the remaining four subsets for identifying the splice donor site sequences. The cross validation is carried out five times, and the average scores among the output are regarded as the final outcome. It’s remarkable that the same cross-validation process can be used for the benchmark data-set S ₂ of Eq. (2) consisting of splice acceptor site sequences.

References

Brunak, S., Engelbrecht, J. & Knudsen, S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. Journal of Molecular Biology 220, 49–65 (1991).
Article CAS PubMed Google Scholar
Hebsgaard, S. M., Korning, P. G., Tolstrup, N., Engelbrecht, J. & Rouz, P. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Research 24, 3439–3452 (1996).
Article CAS PubMed PubMed Central Google Scholar
Brendel, V. & Kleffe, J. Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. Nucleic Acids Research 26, 4748–4757 (1998).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M., Lin, X. & Salzberg, S. L. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research 29, 1185–1190 (2001).
Article CAS PubMed PubMed Central Google Scholar
Dogan, R. I., Getoor, L., Wilbur, W. J. & Mount, S. M. SplicePort–an interactive splice-site analysis tool. Nucleic Acids Research 35, W285–291 (2007).
Article PubMed PubMed Central Google Scholar
Chen, W., Feng, P. M., Lin, H. & Chou, K. C. iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition. Biomed Research International 2014, 623149 (2014).
PubMed PubMed Central Google Scholar
Iqbal, M. & Hayat, M. “iSS-Hyb-mRMR”: Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition. Computer Methods & Programs in Biomedicine 128, 1–11 (2016).
Article Google Scholar
Meher, P. K., Sahu, T. K., Rao, A. R. & Wahi, S. D. Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms for Molecular Biology 11, 16 (2016).
Article PubMed PubMed Central Google Scholar
Wold, S., Jonsson, J., Sjörström, M., Sandberg, M. & Rännar, S. DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Analytica Chimica Acta 277, 239–253 (1993).
Article CAS Google Scholar
Xiao, X., Wang, P. & Chou, K. C. iNR-PhysChem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix. Plos One. 7, e30869 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Liu, Z. et al. pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. Analytical Biochemistry. 497, 60–67 (2015).
Article PubMed Google Scholar
Cai, Y. D., Ricardo, P. W., Jen, C. H. & Chou, K. C. Application of SVM to predict membrane protein types. Journal of Theoretical Biology 226, 373–376 (2004).
Article MathSciNet CAS PubMed Google Scholar
Gu, B. & Sheng, V. S. A Robust Regularization Path Algorithm for ν-Support Vector Classification. IEEE Transactions on Neural Networks & Learning Systems 99, 1–8 (2016).
Google Scholar
Gu, B. et al. Incremental learning for ν -Support Vector Regression. Neural Networks the Official Journal of the International Neural Network Society 67, 140–150 (2015).
Article PubMed Google Scholar
Gu, B., Sheng, V. S. & Li, S. Bi-parameter space partition for cost-sensitive SVM. AAAI Press 1, 3532–3539 (2015).
Google Scholar
Kandaswamy, K. K. et al. AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. Journal of Theoretical Biology 270, 56–62 (2011).
Article CAS PubMed Google Scholar
Krogh, A., Larsson, B., Von, H. G. & Sonnhammer, E. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Journal of Molecular Biology 305, 567–580 (2001).
Article CAS PubMed Google Scholar
Yang, Z., Wong, W. S. W. & Nielsen, R. Bayes empirical bayes inference of amino acid sites under positive selection. Molecular Biology & Evolution 22, 1107–1118 (2005).
Article CAS Google Scholar
Chou, K. C. A Key Driving Force in Determination of Protein Structural Classes. Biochemical & Biophysical Research Communications 264, 216–224 (1999).
Article CAS Google Scholar
Gu, B., Sun, X. & Sheng, V. S. Structural Minimax Probability Machine. IEEE Transactions on Neural Networks & Learning Systems 99, 1–11 (2016).
Google Scholar
Lawrence, S., Giles, C. L., Tsoi, A. C. & Back, A. D. Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks 8, 98–113 (1997).
Article CAS PubMed Google Scholar
Hinton, G. E., Osindero, S. & Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation. 18, 1527–1543 (2006).
Article MathSciNet PubMed MATH Google Scholar
Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 381, 607–609 (1996).
Article ADS CAS PubMed Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P. A. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research 11, 3371–3408 (2010).
MathSciNet MATH Google Scholar
James, L. et al. Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto‐encoder deep neural network. Journal of Computational Chemistry 35, 2040–2046 (2014).
Article Google Scholar
Nguyen, S. P., Shang, Y. & Xu, D. DL-PRO: A Novel Deep Learning Method for Protein Model Quality Assessment. International Joint Conference on Neural Networks. 2014, 2071–2078 (2014).
PubMed PubMed Central Google Scholar
Xu, J. et al. Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology images. IEEE Transactions on Medical Imaging 35, 119–130 (2016).
Article PubMed Google Scholar
Xu, W., Zhang, L. & Lu, Y. SD-MSAEs: Promoter Recognition in Human Genome based on Deep Feature Extraction. Journal of Biomedical Informatics 61, 55–62 (2016).
Article PubMed Google Scholar
Chou, K. C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins-structure Function & Bioinformatics 43, 246–255 (2001).
Article CAS Google Scholar
Chen, W., Feng, P. M., Lin, H. & Chou, K. C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research 41, e68 (2013).
Article CAS PubMed PubMed Central Google Scholar
Chen, W., Feng, P. M., Deng, E. Z., Lin, H. & Chou, K. C. iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Analytical Biochemistry. 462, 76–83 (2014).
Article CAS PubMed Google Scholar
Chou, K. C. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology 273, 236–247 (2011).
Article MathSciNet CAS PubMed Google Scholar
Lu, Q., Obuchowski, N., Won, S., Zhu, X. & Elston, R. C. Using the optimal robust receiver operating characteristic (ROC) curve for predictive genetic tests. Biometrics. 66, 586–593 (2010).
Article MathSciNet PubMed MATH Google Scholar
Fawcett, T. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Machine Learning. 31, 1–38 (2004).
MathSciNet Google Scholar
Grau, J., Grosse, I. & Keilwagen, J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 31, 2595–2616 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zou, Q. et al. An approach for identifying cytokines based on a novel ensemble classifier. Biomed Research International 2013, 1–11 (2013).
Google Scholar
Lin, C. et al. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 123, 424–435 (2014).
Article Google Scholar
Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research 43, 65–71 (2015).
Article Google Scholar
Liu, B., Liu, F., Wang, X. & Chou, K. C. repRNA: a web server for generating various feature vectors of RNA sequences. Molecular Genetics and Genomics 291, 473–481 (2016).
Article CAS PubMed Google Scholar
Liu, B., Liu, F., Fang, L., Wang, X. & Chou, K. C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 31, 1307–1309 (2015).
Article PubMed Google Scholar
Liu, B., Wang, S., Long, R. & Chou, K. C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics. 33, 35–41 (2016).
Article PubMed Google Scholar
Liu, B., Long, R. & Chou, K. C. iDHS-EL: Identifying DNase I hypersensitive-sites by fusing three different modes of pseu-do nucleotide composition into an ensemble learning framework. Bioinformatics. 32, 2411–2418 (2016).
Article PubMed Google Scholar
Liu, B., Fang, L., Ren, L., Lan, X. & Chou, K. C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 32, 362–270 (2016).
Article CAS PubMed Google Scholar
Xu, Z. C., Jiang, S. Y., Qiu, W. R., Liu, Y. C. & Xiao,X. iDHSs-PseTNC: Identifying DNase I Hypersensitive Sites with Pseuo Trinucleotide Component by Deep Sparse Auto-Encoder. Letters in Organic Chemistry. 14, http://www.eurekaselect.com/150033 (2017).
Wei, L., Tang, J. & Zou, Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Information Sciences. 384, 135–144 (2016).
Article Google Scholar
Su, R. et al. Detection of tubule boundaries based on circular shortest path and polar‐transformation of arbitrary shapes. Journal of Microscopy 264, 127–142 (2016).
Article CAS PubMed Google Scholar
Wei, L., Xing, P., Shi, G., Ji, Z. L. & Zou, Q. Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Transactions on Computational Biology & Bioinformatics. 99, doi:10.1109/TCBB.2017.2670558 (2017).
Wei, L., Xing, P., Tang, J. & Zou, Q. PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Trans Nanobioscience. 99, doi:10.1109/TNB.2017.2661756 (2017).
Wei, L. et al. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine. doi:10.1016/j.artmed.2017.03.001 (2017).
Chou, K. C. Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry. 11, 218–234 (2014).
Article Google Scholar
Chen, W., Lei, T. Y., Jin, D. C., Lin, H. & Chou, K. C. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry. 456, 53–60 (2014).
Article CAS PubMed Google Scholar
Wang, T., Yang, J., Shen, H. B. & Chou, K. C. Predicting membrane protein types by the LLDA algorithm. Protein & Peptide Letters 15, 915–921 (2008).
Article CAS Google Scholar
Wei, C., Hao, L. & Chou, K. C. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Molecular Biosystems. 11, 2620–2634 (2015).
Article Google Scholar
Ivanov, V. I. et al. CRP-DNA complexes: inducing the A-like form in the binding sites with an extended central spacer. Journal of Molecular Biology 245, 228–240 (1995).
Article CAS PubMed Google Scholar
Ornstein, R. L. & Rein, R. An optimized potential function for the calculation of nucleic acid interaction energies I. Base stacking. Biopolymers. 17, 2341–2360 (1978).
Article CAS PubMed Google Scholar
Gorin, A. A., Zhurkin, V. B. & Olson, W. K. B-DNA twisting correlates with base-pair morphology. Journal of Molecular Biology 247, 34–48 (1995).
Article CAS PubMed Google Scholar
Vlahoviček, K., Kaján, L. & Pongor, S. DNA analysis servers: plot.it, bend.it, model.it and IS. Nucleic Acids Research 31, 3686–3687 (2003).
Article PubMed PubMed Central Google Scholar
Sivolob, A. V. & Khrapunov, S. N. Translational positioning of nucleosomes on DNA: the role of sequence-dependent isotropic DNA bending stiffness. Journal of Molecular Biology 247, 918–931 (1995).
Article CAS PubMed Google Scholar
Bram, J. Encyclopedia of molecular biology and molecular medicine. Cell Biochemistry & Function 95, 73–74 (1997).
Google Scholar
Breslauer, K. J., Frank, R., Blöcker, H. & Marky, L. A. Predicting DNA duplex stability from the base sequence. Proceedings of the National Academy of Sciences 83, 3746–3750 (1986).
Article ADS CAS Google Scholar
Sugimoto, N., Nakano, S., Yoneyama, M. & Honda, K. Improved Thermodynamic Parameters and Helix Initiation Factor to Predict Stability of DNA Duplexes. Nucleic Acids Research 24, 4501–4505 (1996).
Article CAS PubMed PubMed Central Google Scholar
Olson, W. K., Gorin, A. A., Lu, X. J., Hock, L. M. & Zhurkin, V. B. DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proceedings of the National Academy of Sciences of the United States of America 95, 11163–11168 (1998).
Article ADS CAS PubMed PubMed Central Google Scholar
Ho, P. S., Ellison, M. J., Quigley, G. J. & Rich, A. A computer aided thermodynamic approach for predicting the formation of Z-DNA in naturally occurring sequences. Embo Journal. 5, 2737–2744 (1986).
CAS PubMed PubMed Central Google Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature. 323, 533–536 (1986).
Article ADS Google Scholar
Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Advances in Neural Information Processing Systems 19. Chinese Medical Ethics 23, 80–83 (2008).
Google Scholar
Chou, K. C. Using subsite coupling to predict signal peptides. Protein Engineering 14, 75–79 (2001).
Article CAS PubMed Google Scholar
Xu, Y., Ding, J., Wu, L. Y. & Chou, K. C. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. Plos One. 8, e55844 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Chou, K. C. Some Remarks on Predicting Multi-Label Attributes in Molecular Biosystems. Molecular Biosystems. 9, 1092–1100 (2013).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was partially supported by the National Nature Science Foundation of China (No. 31560316, 61261027, 61300139), the China Scholarship Council (No. 201508360047), Natural Science Foundation of Jiangxi Province, China (No. 20142BAB207013, 20171ACB20023, 20171BAB202020), the Department of Education of Jiangxi Province (GJJ160866, GJJ160909, GJJ160910), China Postdoctoral Science Foundation Funded Project (Project No.2017M612949). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China
Zhao-Chun Xu, Peng Wang, Wang-Ren Qiu & Xuan Xiao
Department of Computer Science and Bond Life Science Center, University of Missouri, Columbia, MO, USA
Wang-Ren Qiu
Gordon Life Science Institute, Boston, Massachusetts, 02478, United States of America
Xuan Xiao

Authors

Zhao-Chun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wang-Ren Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xuan Xiao designed the study. Peng Wang collected data. Wang-Ren Qiu conceived and developed the computational model. Zhao-Chun Xu established the websever iSS-PC and wrote the article. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Zhao-Chun Xu, Wang-Ren Qiu or Xuan Xiao.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, ZC., Wang, P., Qiu, WR. et al. iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder. Sci Rep 7, 8222 (2017). https://doi.org/10.1038/s41598-017-08523-8

Download citation

Received: 15 May 2017
Accepted: 10 July 2017
Published: 15 August 2017
DOI: https://doi.org/10.1038/s41598-017-08523-8

This article is cited by

DASSI: differential architecture search for splice identification from DNA sequences
- Shabir Moosa
- Prof. Abbes Amira
- Dr. Sabri Boughorbel
BioData Mining (2021)
Splicing sites prediction of human genome using machine learning techniques
- Waseem Ullah
- Khan Muhammad
- Muhammad Sajjad
Multimedia Tools and Applications (2021)
A Two-Level Computation Model Based on Deep Learning Algorithm for Identification of piRNA and Their Functions via Chou’s 5-Steps Rule
- Salman Khan
- Mukhtaj Khan
- Kuo-Chen Chou
International Journal of Peptide Research and Therapeutics (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

An automated framework for evaluation of deep learning models for splice site predictions

Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease

Mechanism and modeling of human disease-associated near-exon intronic variants that perturb RNA splicing

Introduction

Results and Discussion

Selection of the characteristic parameter

Comparison with the existing methods

Receiver operating characteristic (ROC) curves

Comparison with traditional high-effectiveness machine learning algorithms

Web server and its user guide

Conclusions

Methods

Benchmark dataset

Feature extraction

Physical-chemical property matrix

Auto-covariance and cross covariance

Deep sparse auto-encoder

A set of metrics for measuring prediction quality

Cross-validation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing Interests

Additional information

Electronic supplementary material

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

DASSI: differential architecture search for splice identification from DNA sequences

Splicing sites prediction of human genome using machine learning techniques

A Two-Level Computation Model Based on Deep Learning Algorithm for Identification of piRNA and Their Functions via Chou’s 5-Steps Rule

Comments

Search

Quick links