Adaboost-SVM-based probability algorithm for the prediction of all mature miRNA sites based on structured-sequence features

The significant role of microRNAs (miRNAs) in various biological processes and diseases has been widely studied and reported in recent years. Several computational methods associated with mature miRNA identification suffer various limitations involving canonical biological features extraction, class imbalance, and classifier performance. The proposed classifier, miRFinder, is an accurate alternative for the identification of mature miRNAs. The structured-sequence features were proposed to precisely extract miRNA biological features, and three algorithms were selected to obtain the canonical features based on the classifier performance. Moreover, the center of mass near distance training based on K-means was provided to improve the class imbalance problem. In particular, the AdaBoost-SVM algorithm was used to construct the classifier. The classifier training process focuses on incorrectly classified samples, and the integrated results use the common decision strategies of the weak classifier with different weights. In addition, the all mature miRNA sites were predicted by different classifiers based on the features of different sites. Compared with other methods, the performance of the classifiers has a high degree of efficacy for the identification of mature miRNAs. MiRFinder is freely available at https://github.com/wangying0128/miRFinder.

MicroRNAs (miRNAs) are an ~22 nucleotide (nt) long, conserved class of noncoding RNAs that play key regulatory roles in diverse biological processes and diseases 1,2 , especially cancers, by modulating the gene expression 3 . Therefore, identification of miRNAs has important significance for mining the association of miRNAs and diseases. In mammals, mature miRNAs derive from the hairpin in primary transcripts by two cleavages process: primary miRNA (pri-miRNA) is processed and canonically cleaved to precursor miRNA (pre-miRNA) by Drosha 4 , then pre-miRNA is exported by Exportin 5 5 and cytoplasmically processed into a miRNA:miRNA* duplex by Dicer 6,7 . Afterwards, one strand of the duplex product becomes a mature miRNA and the other degrades [8][9][10] . In individual cases, both strands are entirely functional.
Because pre-miRNAs are conservative and have the typical features involving sequence, structure and free energy, much software has been development to identify the mature miRNAs from their pre-miRNAs based on computational methods. Table 1 displays the main algorithms and software for mature miRNA identification.
MatureBayes 11 identifies the starting sites of mature miRNAs for mice and humans based on Naïve Bayes. As a result, the method finds that 7, 8 and 9 nt from the starting position have the typical biological features to distinguish mature miRNAs. Microprocessor SVM 12 , MiRpara 13 , MaturePred 14 and matPred 15 are developed using the SVM algorithm. Microprocessor SVM is proposed based on 686 features that are associated with sequence and structure, the accuracy of this method is 50%, and 90% of its predictions are within a 2 nt deviation. Due to too many features, the results are not easy to use for biological analysis. In addition, the forecasting ability for the 3′ arm of miRNA is relatively weak. MiRpara is trained for animals and plants, and upon filtering these features based on the greedy algorithm and the SPSS method, its accuracy reached 80%. MaturePred uses 160 features to

Materials and Methods
Data. The training and test datasets were obtained from the miRBase V21 29 . The pre-miRNAs that have a secondary structure including multibranch loops were excluded. The process of dataset construction is shown in Fig. 2.
First, the human pre-miRNAs were downloaded from miRBase V20 and divided into 5_data and 3_data based on the location of the mature miRNAs. If the mature miRNA of pre-miRNA is located in the 5′ arm, then the pre-miRNA is divided into 5_data. Otherwise, if the mature miRNA of pre-miRNA is located in the 3′ arm, then the pre-miRNA is divided into 3_data. In a similar way, the pre-miRNAs that only belong to version 21, named miRBase V21_new, were downloaded from miRBase V21 and divided into 5_newdata and 3_newdata. Second, Test set1 and Test set2 for testing the 5′ arm mature miRNAs models were randomly selected from 5_data and 5_newdata. Between them, Test set2 was randomly selected from the newest dataset that only belongs to version 21. Moreover, Test set3 and Test set4 for testing the 3′ arm mature miRNAs models were randomly selected from 3_data and 3_newdata, and between them, Test set4 was randomly selected from the newest dataset that only belongs to version 21. Finally, except for the Test set, the other dataset was used to construct the training data. As a result, we constructed the 5_for_train and the 3_for_train datasets to train the models for identifying the 5′ arm and 3′ arm mature miRNAs, respectively. The 3_for_train and 5_for_train datasets consist of 1118 and 1071 pre-miRNAs sequences, respectively. Test set1, Test set2, Test set3 and Test set4 consist of 100 pre-miRNAs. The positive and negative datasets were constructed based on the above training and test datasets. Taking hsa-mir-19a as an example, the structure-based dataset construction method is shown in Fig. 3. Figure 1. The schematic of the overall method. MiRFinder consists of five steps, namely, dataset construction, construction of structured-sequence, biological-related features extraction, feature selection, class imbalance problem and classifier construction. For features selection, three algorithms including information gain, chi-square and relief were investigated. For class imbalance problem, the AdaBoost algorithm was adopted. Moreover, for classifier construction, the AdaBoost and SVM were used. SVM is used as the weak classifier, and AdaBoost is used as the strong classifier.  RNAfold 30 was used to predict the secondary structure of each pre-miRNA, and then the stem-loop of each pre-miRNA was constructed based on the sequence and structure. Furthermore, the stem of the pre-miRNA was complemented by a 10 base pair (bp) double strand and a right base pair (bp). The double strand regions, which include 22 nt nucleotides in the 5′ arm, were extracted from the first to the last nucleotide using the sliding window method. Finally, the double strand, which was consistent with the starting position of the true mature miRNA, was selected as the positive example, whereas the other sequences were defined as the negative example.
The extraction of feature set. Taking has-miR-19a as an example, the process of feature extraction is shown in Fig. 4. To extract the features of mature miRNAs, we predicted the secondary structure of each pre-miRNA using ViennaRNA software, and then constructed the structured-sequence based on the sequence, structure and the alignment information. Depending on the requirements of feature extraction, strand regions of different lengths were selected. Based on the sequence, structure, structured-sequence and these strand regions, the features were extracted.
The absence information, '−' , in the structured-sequence is important. It always means the change in free energy, sequence and structure associated with a 'bulge' , 'interloop' and 'multibranch loop' . To compare the performance of the structured-sequence with the absence information, we also extracted features without the absence information.
Feature selection. The feature set of mature miRNA recognition includes hundreds of feature vectors, and such high-dimensional data can cause "dimension disaster". The effective feature selection algorithm is an important part of data preprocessing. Feature selection can remove the uncorrelated redundant features, reduce the data set dimension, optimize and classify the high correlation characteristics, avoid dimension disaster in the machine learning method, and improve the generalization performance of the model. Importantly, by analyzing the characteristics of mature miRNAs, the typical characteristics of true and pseudo-mature miRNAs that differentiate mature miRNAs and other short nucleotide sequences are identified, which has important value and biological significance for miRNA recognition and its functional research.
In this paper, three feature selection algorithms are used to select the feature selection, namely, the information gain (IG), CHI (CHI) and relief algorithms. On the basis of a performance test, we selected the optimal feature selection algorithm. (1) The information gain (IG) selects characteristics by calculating the difference in information entropy. It defines class C i , and the information gain of characteristic t is defined as: where t means t nonexistence. Then, the IG of t is defined as: i i (2) Chi-square statistic (CHI) selects features between the representation variables by calculating the correlation. The larger the statistical value of CHI is, the more important the feature is. For any type of C_i, the CHI value of characteristic t is: The relief algorithm selects the nearest neighbor according to the weight by calculating the distance between the samples. Set = … X {X , X , , X } 1 2 n as the sample dataset, and X i = [X i1 , X i2 , …X iN ] T is the Nth character of the ith sample. The weight of the sample on each characteristic is defined as: where H(x) and M(x) are the nearest neighbors of the same and different class of X, respectively, and diff is defined as: Weak classifier training based on SVM with probability. Mature miRNAs identification identifies the true mature miRNAs from many short sequences that are constructed from one pre-miRNAs. Therefore, for each pre-miRNA, we cannot identify whether a short sequence is a mature miRNA, but the possibility exists that it is a mature miRNA, so here we introduced the SVM method based on probability to train a mature miRNA classifier. The SVM algorithm based on a probabilistic model is as follows: Define mature miRNA training samples = … ….
,(x , y )} C is a penalty parameter, and <x, x i > is the inner product of x and x i . The kernel function is used to calculate the inner product, and it solves the problem of the data mapping from the original space to the high dimensional linear nonseparable problem. In particular, the radial basis function (RBF) is defined as follows: where δ is the conventional control parameter, which determines the weight of the feature. The output of traditional SVM is binary. This means that for each point, it either belongs to a positive or negative class. In this way, for a pre-miRNA, all the candidate mature miRNA sequences on one arm are predicted to be pseudo-sequences, or the predicted mature miRNAs are greater than or equal to two sequences. Obviously, this is not biological. The mature miRNAs prediction needs to find the only mature miRNA in one arm of a pre-miRNA, so the SVM based on probability output was used to solve this problem.
For the sample χ, the posttest probability is: The sum of the probability of the sample belonging to two classes is 1. Therefore, the constraint conditions of equation (8) are: Set r ij to the probability estimate of two types of problems. According to (9), (10) proposes the following solution: is calculated as follows: The matrix Q is a semipositive definite matrix, so equation (11) is a convex quadratic programming problem with linear constraints. If P is the optimal solution to the quadratic programming problem, the following conditions are met: T 0 The solution to equation (11) can be obtained by solving linear equations.
Using the above method, the probability of each short sequence of a pre-miRNA is estimated, and the probability of the true mature miRNA is given. In the process of model training, it is necessary to optimize the two parameters of the planning factor C and the Gaussian width g. For the planning factor, if C → ∞, it is shown that the classification rules for satisfying all the constraint conditions will reduce the generalization ability and improve training complexity, so the C range must be as wide as possible to satisfy the classifier generalization performance. For parameter g, the optimization algorithm is used for adjusting, and the software grid py is used for training.

Strong classifier constructed based on Adaboost.
When SVM is combined with AdaBoost, on the one hand, the SVM algorithm makes up for the error in AdaBoost in processing high-dimensional data. On the other hand, the SVM algorithm follows the structure risk minimization principle, and the parameter optimization of the RBF_SVM classifier can improve the classification performance of a weak classifier. By selecting appropriate parameters C and g, it can avoid overfitting. In addition, the AdaBoost algorithm is also a process of data transfer in the integration process, and the research on the solution of the class imbalance problem is a direction worth exploring.
For the selection of data sets, the AdaBoost algorithm implements the training subset sample selection according to the continuous adjustment of the sample parameters. The method first sets the initial weight value, then adjusts the sample weight through the sample error rate during each progressive training process and adjusts the weight of the weak classifier accordingly. In the entire process, the weight of the wrong subsample is divided into emphasis to improve the recognition rate. Our method is based on AdaBoost, and the weak classifier adopts the SVM algorithm based on the probabilistic mode, and in each round of weak classifier training, the parameter optimization is performed. The Adaboost-SVM algorithm is described as follows: Set the training dataset S = {(x i , y i )|i = 1,2, … n}, where x i ∈ X is the mature miRNA sample, and y i ∈ Y = {+1, −1} is the class of the samples.
Set the weight of x i in the training set, which in the sample of training dataset S is D (i) t in round t, and the first round of sample weight is initialized to: (P , P P , P ), P P P 1/N We use the adjustable parameters of the SVM based on probability as a weak classifier. In the process of training, through the parameter adjustment, the optimal classification plane is selected, and for each of the pre-miRNA, a given probability of each sample that it is the true mature miRNA is given. The class of the candidate that has the largest probability is set as +1, the class of other candidates are set as −1, and the classifier G t (X): Set the training round T. Define the weight distribution of training set S: The classifier weight is: t t t Then, the weight of the sample can be update as: where formula (19)can be shown as, In formula (20), z t is defined as the weight of the next round t of the training set, which is a generalized constant defined as:  with respect to the imbalanced dataset including the positive and negative dataset, we designed the algorithm to solve the class-imbalance problem of mature miRNAs prediction. Set the training dataset S = {S neg , S pos }. The K-means algorithm was used to cluster the positive samples. Given the negative dataset including n samples, define a threshold that is the number of clusters. Because the ratio of the positive and the negative dataset is 1:10, the threshold was set as 10. This means that the negative samples were divided into 10 clusters, namely, K 1 , K 2 , ……, K 10 , where K i ∈ D, K j ∈ D, K i ∩ K j = φ.
For any cluster K i , define k m as the center of mass, which is obtained by calculating the average value of the features, and the distance of any sample k n and the center of mass is defined as dist(k n , k m ). The mass E is defined as the sum of all squared sample features in the cluster and the center of mass, as follows: By calculating the optimal distance distribution, the distance between samples in the cluster is minimized, but the distance between samples and other clusters is maximized and reaches the maximum degree of mutually independence between clusters.
The pseudocode for constructing the sample balance method of the center of mass near distance training based on the K-means algorithm is described as: Algorithm: The sample balance method of the center of mass near distance training based on K-means Input: negative dataset, cluster threshold Output: negative subset The processing flow: Input negative dataset; The threshold of the cluster; Ten cluster masses were selected in the negative dataset. The distances between all samples and the center of mass in the negative dataset were calculated and divided into 10 clusters.
The mean of each cluster was recalculated and was taken as the new center of mass of the cluster. If the centers of mass are the same, the next step is taken. Otherwise, the center of mass is calculated circularly until the centers of mass are the same.
After determining the center of mass, one tenth of the samples that are close to the center of mass was selected as a subset of the negative subset.
The above method is used to determine the initial training subset S 1 . When we trained the first weak classifier, the training dataset was classified, and the misclassified subset S incor1 and S 1 were combined to construct the training subset S 2 , which was used to train the next classifier. Therefore, the training subset S i is defined as: i p os neg1 incorr(i 1) i ncorr(i 2) i ncorr1 Classifier performance estimation. To evaluate the classification performance of the classifier, two indices are adopted, namely, the prediction accuracy and position deviation. The prediction accuracy was defined as the percentage of the correct mature miRNA and total mature miRNA. For N pre-miRNA sequences, there are M mature miRNA candidate sequences in the ith sequence. Assuming that T (s) is the true mature miRNA and P (T) is the true mature miRNA, the prediction accuracy, Acc, can be described as:

Results and Discussion
Comparison of the structured-sequence-based and sequence-based methods. To investigate the classifier performance of the structured-sequence-based characters and sequence-based characters, we designed the mature miRNA identification method based on sequence-based characters. Taking P5_5 as an example, the two methods are compared in Tables 2-5: As shown in Tables 2-5, taking training set 1 as an example, we compared two classifiers that are trained based on sequence-based characters and structured-sequence-based characters: the first candidate prediction accuracy are 17% and 30%, respectively, the latter being 13% higher than the former. The position deviation predicted accuracy with the 5 nt position deviation was 67% and 90%, respectively. The latter is 33% higher than the former.

DS
±0 nt ±1 nt ±2 nt ±3 nt ±4 nt ±5 nt Total   Table 5. The position deviation predicted accuracy of the top five candidates of the structured-sequence-based classifier.     The position deviation predicted accuracy of the top five candidates are 46% and 72% respectively, as the latter increased by 26%. The predicted accuracy with the 5 nt position deviation is 98% and 100%. Therefore, according to various indicators, the performance of the classifier based on the structured-sequence-based characters is greatly improved.
To investigate the efficiency of our tool as it varies in the presence of polynucleotide regions in the bulge of the pre-miRNAs that form the spacer in the pre-miRNA coding frame, we test it based on two test datasets. The components of bulge (absence information, "−") were counted, and the relationship between the amount of absence information and identification accuracy are shown in Fig. 5 and Table 6.
When the amount of absence information is 0-2, the identification accuracy of the first candidate is 51.2%, and it is the highest. When the amount of absence information is 3-5, the identification accuracy of the first candidate is 33.8%, and it is the lowest. Overall, the identification accuracy decreased as the amount of absence information increased. This result can be analyzed in two ways. On the one hand, it illustrates that the identification of our tool can be affected by the amount of absence information, and with the amount of absence information increasing, the identification accuracy decreased. On the other hand, it indicates that the capture of biological characters were affected by the amount of absence information, and with the amount of absence information increasing, biological characteristics become less typical.
Comparison of the feature selection methods. We examined three feature selection algorithms: the information gain algorithm, chi-square statistics and the relief method. First, we used all feature sets to train the classifier to obtain the accuracy of the position offset prediction of the first candidate. Then, we used these three algorithms to filter feature sets. The information gain method sorts features according to the information gain, whereas the chi-square statistic method provides a measure of the correlation between features and the categories of measurements. The relief method sorts the features according to the sample weight. According to the results of the algorithms, features that contribute a value of less than zero are deleted, and then the feature selection algorithm and the feature set are confirmed.
The position deviation predicted accuracy of the first candidate based on all features is shown in Table 7, and the position deviation predicted accuracy of the first candidate based on the chi-square and relief algorithms is shown in Tables 8 and 9. The position deviation predicted accuracy of the first candidate based on IG is shown in Table 10.    The position deviation predicted accuracy of the first candidate of the all features, chi-square Relief and IG algorithms were 24%, 12%, 13%, and 30%, respectively. The position deviation predicted accuracy of the first candidate with 5 nt deviation was 79%, 71%, 66% and 90%, respectively. The information gain method obtained the highest prediction performance, and when the feature subset selected 110 features, we obtained the maximum prediction precision. Compared with the all features method, the position deviation predicted accuracy of the first candidate of the two test sets were 30% and 59%, and improved 6% and 11%, respectively.
Comparison of methods before and after using the balance algorithm. To investigate the effect of balance algorithm on the classifier performance, taking Test1 as an example, we compared the methods before and after using the balance algorithm. The position deviation predicted accuracy and APD of the first candidate and the top five candidates are shown in Tables 11 and 12, respectively.
The position deviation predicted accuracy of the first candidate is 31% and 33% before and after using the balance algorithm, respectively, and the position deviation predicted accuracy with the 5 nt position deviation are 90% and 100%. The position deviation predicted accuracy of the top five candidates is 66% and 68%, and the   Comparison with other methods. Taking Test1 as an example, four methods that were developed to identify the mature miRNAs were compared with our method. The position deviation predicted accuracy of MiRPara, MatureByes, MiRdup, MatPred and MatFinder is shown in Fig. 6 and Table 13. The prediction accuracy of MatFinder, MiRPara, MatureByes, MiRdup and MatPred, for the first candidate was 4%, 9%, 26%, 30% and 33%, respectively. Our method, MatFinder was 29%, 24%, 7% and 3% higher than that of the other three methods. With the 5 nt position deviation, the prediction accuracy was 37%, 84%, 81%, 90% and 100%, and MatFinder was higher than the other three methods. In addition, the average position deviation is 5.43 nt, 4.65 nt, 2.67 nt, 2.45 nt and 2.05 nt, respectively. Above all, the MatFinder method is significantly superior to the other methods in the various indices. the performance of classifiers on all sites identification. Due to the excellent performance of MatFinder for the P5_5 site, other sites identification methods were trained based on different datasets to accomplish all sites identification. The performance of the P5_3 classifier was test based on Test1 and Test2, and the performance of the P3_3 and P3_5 classifiers were investigated using Test2 and Test3.
The position deviation predicted accuracy of the top five candidates of mature miRNA for all sites is shown in Table 14. The position deviation predicted accuracy of the P5_3, P3_5 and P3_3 classifiers was 66%, 55% and 67%, respectively. With the increase in the deviation nucleotide distance, the accuracy improved greatly. Within a 1 nt deviation, the accuracy was 92%, 89% and 84%, respectively. Within a 5 nt deviation, the recognition accuracy reached 100%, 100% and 98%, respectively.

Discussion and Conclusion
MiRBase data sets consists computational predictions as well, but the computational predictions had the canonical biological characters, they will not affect our prediction results. Therefore, in our train and test dataset, we choose all the pre-miRNAs from miRBase, the experimental and computational annotations in miRBase are not separate out in our data processing.
The results show that our method, Matfinder, is superior to the other methods in identification accuracy and average position deviation. These results can be explained as follows: The feature extraction methods based on sequence and structured-sequence were designed to investigate the biological significant of absence information. On basis of the comparison results, the structured-sequence-based method obtained the better classification results. This shows that structured-sequence-based features can represent mature miRNAs biological characteristics. It also illustrates the importance of the miRNAs structural biological characteristics. The secondary structure of miRNAs is divided into two parts, one part is the base complement of each other, the other part is isolated from the double helix region without base pairs, namely, the loop, which primarily includes: hairpin loops, the inner loops and multibranch loops. The extracted-features capture these characteristics of loops to a greater extent, so the identification accuracy is improved to a certain extent.
MatFinder is proposed based on the integration method, which is a strong classifier using AdaBoost and a weak classifier using the adjustable parameter SVM algorithm. Aiming to improve the data imbalance problem, the K-means algorithm is used to construct the center of mass of closer samples. To achieve the balanced training subsets of the data, in the process of the integrated classifier, and focus on incorrectly classified samples, the integrated results use common decision strategies of the weak classifier with different weights. This method not   only solves the problem of monotone sequence diversity in mature miRNA identification but also improves the performance of the classifier. The weak classifier is constructed using the SVM algorithm based on adjustable probability parameters. First, the SVM algorithm solves the problem of "overlearning" of the AdaBoost algorithm. Second, the SVM algorithm is based on the probability model, and the results provided five mature miRNA candidates. Importantly, in the process of integration, multiple weak classifiers based on SVM can be regulated by adjusting C and g to learn the complexity and classifier performance, which overcomes the effect of the static parameters on the performance of classifiers. The weak classifiers were trained alone with independent parameters.
In addition, we trained the P5_3, P3_5 and P3_3 classifiers based on the different biological characters. These classifiers are more accurate in identifying the first five candidates.