iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions

The identification of microRNA precursors (pre-miRNAs) helps in understanding regulator in biological processes. The performance of computational predictors depends on their training sets, in which the negative sets play an important role. In this regard, we investigated the influence of benchmark datasets on the predictive performance of computational predictors in the field of miRNA identification, and found that the negative samples have significant impact on the predictive results of various methods. We constructed a new benchmark set with different data distributions of negative samples. Trained with this high quality benchmark dataset, a new computational predictor called iMiRNA-SSF was proposed, which employed various features extracted from RNA sequences. Experimental results showed that iMiRNA-SSF outperforms three state-of-the-art computational methods. For practical applications, a web-server of iMiRNA-SSF was established at the website http://bioinformatics.hitsz.edu.cn/iMiRNA-SSF/.

Scientific RepoRts | 6:19062 | DOI: 10.1038/srep19062 classifiers treat the pre-miRNA identification problem as a binary classification problem. Currently, the widely used classification algorithms include Support Vector Machine (SVM) 17,25 , Hidden Markov Model (HMM) 26 , Random Forest (RF) 18 , and Naive Bayes (NB) 27 . The widely used features of characterizing pre-miRNAs include stem-loop hairpin structures 28,29 , MFE of the pre-miRNAs, and P-value of randomization test [18][19][20]30 . Because the importance of the features for constructing a predictor, recently, some web-servers or stand-alone tools were proposed to extract the features from RNA sequences, such as Pse-in-One 31 , and repRNA 32 . MiPred 18 identified the human pre-miRNAs by combining Triplet-SS, MFE and P-value, in which MFE and P-value were the top 2 most important features. MiRanalyzer 19 was trained with a variety of features, in which MFE was the secondary most important feature. miRNApre was built based on Triplet-SS, primary sequence composition feature, and MFE. However, based on the feature analysis of miRNApre, MFE feature cannot improve the performance. Therefore, it is interesting to explore the reasons for the different discriminative power of the same feature in different predictors. Furthermore, there are several other challenging problems should be solved in this filed: (1) Many features have been proposed to characterize the pre-miRNAs, but their discriminative power is not investigated. Some features showed strong discriminative power in some predictors, while in other predictors, they only showed limited discriminative power, for example MFE played an important role in Triplet-SVM, but it almost had no contribution to the discriminative power of miRNApre. Therefore, the most discriminative features and their combinations for miRNA identification should be investigated. (2) The existing benchmark datasets are too small to reflect the statistical profile. Most of these datasets only contain several hundreds of real pre-miRNA samples and pseudo pre-miRNA samples. It is necessary to construct an updated benchmark dataset to fairly evaluate the performance of different methods. (3) Most of these methods performed well in cross validation test, but they showed much lower performance on independent testing sets. This is because the samples in the training set are not representative enough, especially for the pseudo pre-miRNA samples (negative samples). There is no golden standard to select or construct the negative samples 33,34 .
To solve these problems, we investigated the distributions of various benchmark datasets, and found that they had large variance, especially for the distributions of negative samples. A series of controlled experiments were conducted to find out how the performance were impacted on different distributions of negative samples. The results showed the negative samples were not representative enough. Therefore, the key to improve predictive performance was to construct an high quality benchmark dataset for miRNA identification. In this regard, a new benchmark dataset was constructed, in which the positive samples were extracted from the miRBase [35][36][37] , and the negative samples were selected from existing datasets with different data distributions. Finally, we proposed a new computational method for pre-miRNA identification, called iMiRNA-SSF, which employed the sequence and structure features trained with the updated benchmark dataset. The web-server of iMiRNA-SSF can be accessed at http://bioinformatics.hitsz.edu.cn/iMiRNA-SSF/.

Results
Negative samples have significant impact on the discriminative power of features. As reported in literatures, MFE and P-value were the top 2 most important features in MiPred 18 , but they were not so important in miRNApre 20 (out of top 10 features). The main difference between these two methods was their negative samples in benchmark datasets. The negative samples of MiPred were collected from the protein coding regions with parameter filtering method, while the negative samples of miRNApre were collected by multi-level process. For more details, please refer to 17,20 . Our hypothesis was that the different discriminative power of the same method was caused by the negative samples. In order to validate this hypothesis, two datasets S xue and S zou were constructed with the same positive set and the different negative sets: where the + S , − S xue and − S zou are the same as the subsets in Equation (4). The dataset S xue is union of + S and − S xue ; S zou is union of + S and − S zou . We investigated the discriminative power of all features mentioned in Method section on datasets S xue and S zou by assessing their information gain related to the classes. The higher information gain value 38 means the related feature is more powerful. The top 20 most important features on the two datasets were shown in Table 1(A,B), respectively.
MFE and P-value were the top 2 most important features on S xue . However, P-value was only ranked at 20 th on S zou and MFE was ranked out of top 20. We also found that 14 of the top 20 most important features on S xue belonged to local triplet sequence-structure features (Triplet-SS) category, but only 4 features belonged to primary sequence features (3-gram) category. In contrast, for S zou , only 5 of the top 20 most important features belonged to Triplet-SS category, and 14 features belonged to 3-gram category. The structure features are more powerful than sequence features on S xue database, but it is not the case on S zou database. The results showed that the negative samples have significant impact on the discriminative power of features.
We took MFE and P-value as examples to analyse the reasons. Their distributions on positive and negative samples of S xue and S zou were calculated, and the results were shown in Fig. 1. The distributions of MFE and P-value are very similar between − S zou and + S , but they are different between − S xue and + S . A feature has more dis-criminative power if its distribution has variance on positive and negative sets. This is why MFE and P-value show powerful discriminability on S xue database, but it is not the case on S zou database.   The prediction results were listed in Table 2. The cross validation results were achieved by leave-one-out strategy on S xue train and S zou train , whereas the independent testing results were achieved by testing on S zou test and S xue test . In term of Table 2, both two predictors performed well in cross validation test, achieved 87.69% and 98.57% accuracies, respectively. But they showed much lower performance on the independent testing dataset, especially the performance of the classifier trained on S zou train and tested on S xue test dropped to 51.17% from 98.57% in term of accuracy. For a SVM-based method, it generates a decision boundary that separates the positive samples from the negative ones. The generated decision boundaries based on different datasets are significant difference. As shown in Fig. 2(A,B), the two generated decision boundaries built on two datasets with different distributions are different. When using a decision boundary to classify samples in another dataset, the majority of samples can't fall on their own categories. As shown in Fig. 2 The information gain of features is a feature selection method used in many fields. In general terms, the expected information gain is the change in information entropy H from a prior state to a state that takes some information. The higher the information gain value means the feature is more discriminative.

Importance
A new predictor built on updated benchmark dataset. We constructed a new benchmark set with different data distributions of negative samples, including real human pre-miRNAs as positive set, Xue pseudo pre-miRNAs − S xue and Zou pseudo pre-miRNAs − S zou as negative sets. Trained with this high quality benchmark dataset, a new computational predictor called iMiRNA-SSF was proposed. Four kinds of features were employed to investigate that if they could be combined to improve performance of iMiRNA-SSF, including Triplet-SS, MFE, P-value and N-gram. The performance was obtained by using LibSVM algorithm with leave-one-out crossing validation on updated benchmark dataset. As shown in Table 3, the best performance (ACC = 90.42%, MCC = 0.79) was achieved with the combination of the four kinds of features. Triplet-SS is a local triplet sequence-structure-based feature; MFE and P-value are features based on the on minimum of free energy of the secondary structure; N-gram is a sequence-based feature considering the local sequence composition information. These features describe the characteristics of pre-miRNA from different aspects. Therefore the predictive performance of iMiRNA-SS can be further enhanced by combining all of features.
Furthermore, the importance of all features was also investigated. P-value and MFE features are the most discriminative, followed by the local triplet sequence-structure features and the primary sequence based features. The results were shown in Table 4.
Comparison with other methods. Three state-of-the-art methods Triplet-SVM 17 , MiPred 18 and miR-NApre 20 were selected to compare with the proposed iMiRNA-SSF. MiPred is a classifier using Random Forest algorithm combined with Triplet-SS, MFE, and P-value features. miRNApre employed the SVM algorithm with Triplet-SS, N-gram, MFE features. As mentioned in the introduction section, the reported accuracy of these methods were based on small datasets containing only several hundreds of samples without removing redundant sequences, thus, their performance might be overestimated. In order to make a fair comparison among these methods, all these methods were evaluated on the same updated benchmark dataset via leave-one-out crossing validation. Their predictive results were shown in Table 5.  To further illustrate the comparison, receiver operating characteristic (ROC) scores of different methods were provided in Fig. 3. The ROC scores of Triplet-SVM, MiPred, miRNApre and iMiRNA-SSF are 0.90, 0.92, 0.94 and 0.96, respectively. iMiRNA-SSF outperforms the other three state-of-the-art methods.
Web-server description. For the convenience of the vast majority of experimental scientists, we provided a simple guide on how to use the iMiRNA-SSF web-server. It is available at http://bioinformatics.hitsz.edu.cn/ iMiRNA-SSF/.
Step 1: The homepage was shown in Fig. 4. The users can input their test data through two ways. One way is to copy pre-miRNA sequences in FASTA format into text area. The other way is to upload test file. Example sequences can be found by clicking on the Example link.
Step 2: Click on the prediction button to submit. iMiRNA-SSF will decide whether the test sequences are real human pre-miRNA sequences or not. Note that the computational cost of P-value feature is expensive, because for each query sequence we need to predict the secondary structures of its random shuffled sequences for 1000 times via running Vienna RNA software.
Step 3: An output example was shown in Fig. 5. If the classification is predicted to Real pre-miRNA, it indicates the query most probably is a pre-miRNA. Besides the predictive classification, we output other useful information, including the secondary structure, MFE and P-value.

Discussion
By exploring two datasets that were constructed with the same positive set and different negative sets, we found that negative samples have significant impact on the predictive results of various methods. Therefore, we constructed an updated benchmark set with different data distributions of negative samples. A new predictor called iMiRNA-SSF was proposed, which was trained with this high quality benchmark dataset. Experimental    The information gain of features is a feature selection method used in many fields. In general terms, the expected information gain is the change in information entropy H from a prior state to a state that takes some information. The higher the information gain value means the feature is more discriminative.  Table 5. The performance comparison of different methods. All the methods were evaluated on the same updated benchmark dataset via leave-one-out crossing validation. Note: Since the number of positive samples is not equal to the number of negative samples, we set the penalty factors that positive samples weight is 2 and the negative samples weight is 1. The results showed that structure features are more discriminative than the sequence features for pre-miRNA identification.

Method Acc(%) Sn(%) Sp(%) MCC ROC
As shown in this study, the quality of the training samples is very important for improving the predictive performance of a computational predictor. The proposed framework of combining samples with different distributions can be applied to other important tasks in the field of bioinformatics, such as DNA binding protein identification 39,40 , protein remote homology detection 41,42 , enhancers and their strength prediction 43 , etc. Therefore, in our future studies, we will focus on applying the proposed framework to improve the performance of these problems.

Method
Datasets. Our benchmark dataset for pre-miRNA identification (see the Supplementary information) consists of real human pre-miRNAs as positive set and two pseudo pre-miRNAs subsets as negative set. The pre-miRNAs sharing sequence similarity more than 80% were removed using the CD-HIT software 44 to get rid of redundancy and avoid bias. The benchmark dataset can be formulated as: where the positive samples set + S contains 1612 human miRNA precursors, which were selected from the 1872 reported Homo sapiens pre-miRNA entries downloaded from the miRBase 36,37 ; the negative samples set − S is the union of − S xue and − S zou ; the − S xue contains 1612 Xue pseudo miRNAs, which were selected from the 8494 The users can input their test data through two ways. One way is to copy their query to the text area, and the other is to upload their test file in FASTA format.

Figure 5. An example of prediction result.
If the classification is predicted to Real pre-miRNA, it indicates the query most probably is a pre-miRNA. Some useful information is also provided, including second structure, MFE and P-value. pre-miRNA-like hairpins 17 ; the − S zou contains 1442 Zou pseudo miRNAs 20 . As miRNAs locate in the untranslated regions or intragenic regions, both − S xue and − S zou were collected from the protein coding regions. The main difference between them is that they were constructed based on different techniques. The − S xue was collected by the widely accepted characteristics and the − S zou was collected by a multi-level negative sample selection technique. For more information, please refer to 17,20 . Features for characterizing microRNA precursors. Various sequence-based features were used in this study, including primary sequence features, minimum free energy feature, P-value randomization test feature and local triplet sequence-structure features, which were described as followings: Primary sequence features (N-gram). For a given RNA sequence R: where S i ∈ {Adenine (A), Cytosine (C), Guanine (G), Uracil (U)}; S 1 denotes the nucleic acid residue at sequence position 1, S 2 denotes the nucleic acid residue at position 2, and so on. The sequence pattern S i + 1 S i + 2 S i + 3 …S i + N is called N-gram. N-grams refer to all the possible sub-sequences. The different kinds of N-grams are 4 n (n is the length of the N-gram). Following previous studies 17 , we set n as 3 and the number of different 3-grams is 64 (4 3 ).
Minimum of free energy feature (MFE). The MFE describes the stability of a RNA secondary structure. Some evidences showed that miRNAs have lower folding free energies than random sequences 45 . The MFE of the secondary structure was predicted by the Vienna RNA software package (released 2.1.6) 46 with default parameters.

P-value of randomization test feature (P-value).
In order to determine if the MFE value is significantly different from that of random sequences, a Monte Carlo randomization test was used 47 . The process can be summarized as follow: (1) Infer MFE value of the original sequence.
(2) Randomize the order of the nucleotides of the original sequence while keeping the dinucleotide distribution (or frequencies) constant 48 . Then infer the MFE value of the shuffled sequence. Local triplet sequence-structure features (Triplet-SS). In the predicted secondary structure, there are only two statuses for each nucleotide, paired or unpaired, represented as brackets "(" or ")" and dots ". ", respectively. The left bracket "(" means that the paired nucleotide is located near the 5′ -end and the right bracket ")" means one nucleotide can be paired with another at the 3′ -end. When the sequences were represented as vectors, we didn't distinguish these two situations and used "(" for both situations.

Support Vector Machine. Support Vector Machine (SVM) is a supervised machine learning technique
based on statistical theory for classification task 49 . Given a set of fixed length vectors with positive or negative labels, SVM can learn an optimal hyper plane to discriminate the two classes. New test samples can be classified based on the learned classification rule. SVM has exhibited excellent performance in practice and has a strong theoretical foundation of statistical learning. In this study, the LibSVM algorithm was employed, which is an integrated software tool for SVM classification and regression. The kernel function was set as Radial Basis Function (RBF). The two parameters C and τ were set as 11 and − 9 respectively, which were optimized by using the grid tool in LibSVM package 49 . Leave one out cross validation. Three test validation methods, including independent dataset test, sub-sampling (or K-fold cross-validation) test and leave-one-out test, are often used to evaluate the performance of a predictor. Among these three methods, the leave-one-out test is deemed the least arbitrary and most objective as elucidated in [49][50][51] . It has been widely recognized and adopted by investigators to examine the quality of various predictors. In the leave-one-out test, each sequence in the benchmark dataset is in turn singled out as an independent test sample and all the rule-parameters are calculated with the whole benchmark dataset.

Measurement.
For a prediction problem, a classifier can predict an individual instance into the following four categories: false positive (FP), true positive (TP), false negative (FN) and true negative (TN). As shown in previous studies 52, 53 , the total prediction accuracy (ACC), Specificity (Sp), Sensitivity (Sn) and Mathew's correlation coefficient (MCC) for assessment of the prediction system are given by: The receiver operating characteristic (ROC) score 54 was also employed to evaluate the performance of different methods. Because it can evaluate the trade-off between specificity and sensitivity. An ROC score is the normalized area under a curve that is plotted with true positives as a function of false positives for varying classification thresholds. An ROC score of 1 indicates a perfect separation of positive samples from negative samples, whereas an ROC score of 0.5 denotes that random separation.