On the performance of pre-microRNA detection algorithms

Saçar Demirci, Müşerref Duygu; Baumbach, Jan; Allmer, Jens

doi:10.1038/s41467-017-00403-z

Download PDF

Article
Open access
Published: 24 August 2017

On the performance of pre-microRNA detection algorithms

Nature Communications volume 8, Article number: 330 (2017) Cite this article

3882 Accesses
34 Citations
Metrics details

Subjects

Abstract

MicroRNAs are crucial for post-transcriptional gene regulation, and their dysregulation has been associated with diseases like cancer and, therefore, their analysis has become popular. The experimental discovery of miRNAs is cumbersome and, thus, many computational tools have been proposed. Here we assess 13 ab initio pre-miRNA detection approaches using all relevant, published, and novel data sets while judging algorithm performance based on ten intrinsic performance measures. We present an extensible framework, izMiR, which allows for the unbiased comparison of existing algorithms, adding new ones, and combining multiple approaches into ensemble methods. In an exhaustive attempt, we condense the results of millions of computations and show that no method is clearly superior; however, we provide a guideline for biomedical researchers to select a tool. Finally, we demonstrate that combining all of the methods into one ensemble approach, for the first time, allows reliable purely computational pre-miRNA detection in large eukaryotic genomes.

miRgo: integrating various off-the-shelf tools for identification of microRNA–target interactions by heterogeneous features and a novel evaluation indicator

Article Open access 30 January 2020

miRNA activity inferred from single cell mRNA expression

Article Open access 28 April 2021

microRNAs in action: biogenesis, function and regulation

Article 28 June 2023

Introduction

Since their discovery, about two decades ago, microRNAs (miRNAs) have been detected in a large number of organisms including microbes¹, sponges², metazoan³, plants⁴, and viruses⁵. Nowadays, miRNAs are considered important factors in many human diseases and they are believed to be good candidates for disease markers and therapeutics⁶. In plant breeding studies miRNAs are applied to control agronomical traits such as tolerance to biotic and abiotic stress factors, to increase yield, modify fruit development, and influence crop quality^{7, 8}.

Dysregulation of miRNAs is a hallmark of diseases; among them cancer⁹, that makes miRNAs interesting as biomarkers¹⁰, even more so since they are often detectable in bodily fluids¹¹ and thus are accessible with low-invasive methods. Modulation of miRNA abundances may lead to therapeutics¹² that may prove to be especially useful in personalized medicine¹³. Particularly when taking into account miRNAs’ possible roles in cell-cell (circulating miRNAs)¹⁴ and pathogen-host communication¹⁵. Experimental detection of miRNAs can be achieved using miRNA profiling approaches, such as microarray, quantitative real-time PCR (qPCR), and deep sequencing technologies^16,17,18. There exist some challenges in employing such experimental methods¹⁹. For instance, qPCR-based and microarray miRNA experiments suffer from low specificity and need extensive normalization^{16, 18}. Both approaches also cannot detect novel miRNAs¹⁹ since either primer (qPCR) or target sequences (microarray) need to be pre-determined. This need for a priori knowledge does not pose a problem for sequencing-based approaches, but they are hampered by the need for extensive downstream computational analyses, such as by tools covered and discussed here¹⁹. Finally, a miRNA’s effect is in the best case established on the protein level, and therefore all approaches need to be amended with Western blot or mass spectrometry analyses, adding significantly to the experimental complexity²⁰.

Considering the massive impact of miRNAs especially in human disease and plant breeding (i.e., human nutrition) and taking into account our inability to experimentally determine all pre-miRNAs, it is crucial for the community to be able to rely on computational methods for pre-miRNA detection.

Although many tools for the detection of pre-miRNAs have been developed (Table 1) a number of key issues remain unaddressed. The first problem is that many of the tools do not provide a readily working implementation (Table 1), which makes it impossible for researchers to select the right tool for their data. Moreover, the most popular tool according to Google Scholar citations will turn out to be, in general, the least effective one according to our findings (see below). It is thus evident that guidelines for tool selection need to be provided to enable researchers to make an informed choice when selecting a tool. In addition, performance comparison among tools cannot be done based on their published metrics since they are based on different data sets, used different approaches to establishing pre-miRNA detection models, and present diverse performance measures. A unified, unbiased evaluation was lacking. Lastly, previous evaluations have been performed for a subset of the state of the art, but only when a new tool was to be published and using different data as well as varying parameters, which obfuscates comparative evaluations.

Table 1 Available pre-miRNA detection tools

Full size table

Here, we introduce the first neutral, comprehensive, and quantitative evaluation of the state of the art in ab initio pre-miRNA detection. To cope with the combinatorial complexity, we developed izMiR, a freely available platform (http://jlab.iyte.edu.tr/software/izmir), which allowed the exhaustive application of the 13 tools compared in this study. The izMiR framework further enables the streamlined developing of new approaches and repurposing of the methods provided for specific scenarios for the community.

In the following, we present an in-depth comparison of the 13 most popular tools (Table 1). We employed all published positive and negative data sets and added eight additional ones for tool comparison (Table 2). Part of the data was used in training and testing, and those data sets provide intrinsic, whereas, remaining data sets provide extrinsic performance measures. For both kinds, we record ten statistics, but only discuss accuracy and the area under the receiver operator characteristic (area under the curve AUC) in the following although each of the measures could be used for the ranking of the 13 approaches. Using our izMiR framework, for each of the 13 tools, we picked the best models from 3000 trials to represent it for further analysis. These models were applied to all known pre-miRNAs and to thousands of sequences that likely host no miRNAs leading to millions of calculations. All izMiR results are stored within the framework and can be directly inspected at every step of the calculation. Thus, integration and comparison with future results is seamlessly supported without the need for computationally expensive recalculations.

Table 2 Data sets

Full size table

We condense the final results of more than 20 million computations into a final summary and guideline. First, we observe that no tool significantly outperforms all other tools on all data sets. We, therefore, consider ensemble methods which unify all 13 tools into 6 different predictors. These ensemble predictors, when analyzed in the same manner as the single tools, provide a significant boost in prediction performance. In general, the Average_DT ensemble classifier works best.

Results

Comparison of available tools

Uniform implementations for all tools evaluated in this study were created since few of the original tools have been available and functional (Table 1). For our analysis, we used three machine learning algorithms, decision trees (DT), support vector machines (SVM), and naive Bayes classifiers (NB) (Online Methodology; Fig. 1). Figure 1 provides an overview of the accuracy distribution using 1000 fold Monte Carlo cross-validation (MCCV)²¹ for the averaged performance of all three classifiers (per classifier distributions are also available: DT: Supplementary Fig. 1, NB: Supplementary Fig. 2, and SVM: Supplementary Fig. 3; Supplementary Table 1).

It can be directly observed from Fig. 1 that there is no universally best model and this is further supported by Supplementary Figs. 1–3 and Supplementary Tables 2, 3.

Figure 1 indicates that although Chen_avg has the highest absolute performance, Ng_avg and Ding_avg displayed better overall performance since their accuracy distribution is much less data- and/or classifier-dependent (Supplementary Tables 2, 3) than Chen_avg, and since the difference among their maximum accuracy is not very large (Chen_SVM: 0.91, Ng_DT: 0.90, and Ding_NB: 0.89). The results further show that all models perform much better than random guessing (0.5); but also that none significantly outperforms all others. Therefore, we designed six consensus models integrating the best models from all studies (refer to “Methods” section and http://www.nature.com/protocolexchange/protocols/4919). These consensus models were compared to the individual studies regarding their receiver operator characteristic curves (Supplementary Figs. 4, 5). The Average_DT model performed best reaching an AUC of 0.99, thereby being much better than the next best models Ding_Dt (AUC: 0.93) or Chen_DT (AUC: 0.91).

Model performances

The performance of the generated models is data dependent and, therefore, we applied the NB and the DT models to all published positive and negative data sets as well as to eight novel ones (Table 2). Positive data sets derive from miRBase and MirGeneDB, where the latter is indicated with a “ + ” suffix. The mmu* data set represents a filtered version of the available mouse data in miRBase (mmu + represents mouse data from MirGeneDB) and the novel positive data sets we create present random values constrained to specific ranges (see “Methods” section).

The generated models were trained using human examples from miRBase, and they may not be applicable for other species in miRBase. In order to test how well these models generalize, they were used to predict all available pre-miRNAs from all 223 species available in miRBase (http://jlab.iyte.edu.tr/software/izmir) using the mirbase data set, data from MirGeneDB, and all negative data (Table 2). The Bentwich_NB model performed best with 97.53% correct predictions, followed by the Consensus_DT model with 96.71% (Supplementary Table 3). This analysis established the positive prediction rate of the trained models. Additionally, prediction on a subset of the positive data like hsa, mmu, and mmu* was tested, and Consensus_DT performed well in all cases (Table 3).

Table 3 Model performance summary

Full size table

It is important to assign positive examples correctly, but it is equally important to reject negative ones accurately. In order to establish how efficient non-miRNAs are rejected, nine data sets containing putative negative examples were acquired or established (Table 2). Xu_NB performed best for the combination of all negative data sets, followed by Xue_NB and Jiang_NB. Interestingly, Bentwich_NB achieved very poor results for the prediction of pre-miRNAs (last rank) and the models performing well in negative data fail for positive data ranking in the bottom of the list (Supplementary Table 3). These results clarify that both positive and negative prediction rates need to be considered at the same time. Evaluating the tools according to the summed rank for both measures concurrently, the consensus models showed highest performance taking the first three ranks (Table 3). Overall, Average_DT performed best which is consistent with the receiver operating characteristic (ROC) analysis (Supplementary Fig. 4).

Using izMiR on eukaryote genomes

Can computational models be applied to the analysis of large eukaryotic genomes is a question that needs careful evaluation. To test this, we analyzed the Drosophila melanogaster (dme) genome. We used izMiR models generated using human (hsa model) and Drosophila (dme model) hairpins. As a representative for human models, Average_DT was used to establish whether the known hairpins for dme can be found. Applying confidence thresholds of 0.96 and 0.84 to the 256 dme hairpins from miRBase, 183 hairpins were identified using the dme model while 144 hairpins were detected with the hsa model. As should be clear from previous works²² and from the filtered mouse results, it is unlikely that all dme hairpins in miRBase are true miRNAs.

From the genome-wide miRNA search in the dme chromosome 2L, we could not extract 16 out of the 56 drosophila hairpins mapped to the 2L chromosome in miRBase. Manual inspection revealed that the secondary structure predicted for the affected regions in the genome did not include structures suitable for pre-miRNAs. We discuss the validity of the 16 dme examples on the website for izMiR (http://jlab.iyte.edu.tr/files/izmir/HairpinAssessment.pdf). We note here that the dme model retrieved all of the hairpins we reject while the hsa model only detected 12 (missed ones: dme-mir-4943 (score: 0.33), dme-mir-288 (score: 0.24), dme-mir-1004 (score: 0.18), and dme-mir-4914 (score: 0.02). This analysis confirmed that both models can be used for the detection of pre-miRNAs in the dme chromosome 2L. For the remaining 40, we analyzed the prediction scores from Average_DT, and we set thresholds based on lower quartile values as 0.96 for dme model and 0.84 for the hsa model (Supplementary Fig. 9). With these cutoff scores, 25 of the miRNAs in the 2L chromosome were found in our extracted hairpins in both models’ predictions while dme-mir-275, dme-mir-9378, dme-mir-1006, dme-mir-966, dme-mir-967 hairpins, and dme-mir-125, dme-mir-275, dme-mir-9374, dme-mir-960, dme-mir-962, dme-mir-9c hairpins passed the hsa or dme models, respectively. Overall we predicted 17,455 candidate miRNAs with the dme model and 43,103 candidates with the hsa model out of 581,883 hairpins in the 2L chromosome (Supplementary Fig. 8). Among these sequences, there are likely to be redundant ones, as we confirmed using USEARCH for clustering of the ~290,000 sequences, which lead to the elimination of ~65,000 highly similar sequences from the pool. The threshold of the dme model is already very high, but the hsa model’s threshold can be further adjusted to reach a number of predictions suitable for experimental validation. For example, choosing a threshold of 0.99 for the hsa model leads to a mere 585 hairpins that need to be examined (Supplementary Table 6).

Forming consensus is better than individual effort

Deciding which of the 13 methods to use can be a daunting task. Initially, tools need to be acquired, but some may not be available or functional (Table 1). Then these tools need to be compared and their settings need to be optimized for the given problem, which is a complicated and time-consuming process. With the izMiR framework, we solved all of these problems by providing a working implementation of all 13 tools. Furthermore, for all these tools we analyzed their performance on all known pre-miRNAs in miRBase, all previously used negative data sets, and novel ones. For different models generated from different classifiers (DT, NB, and SVM) Ding, NG, and Chen perform best (Supplementary Table 1), but only Xue and Lopes can be readily discarded as alternatives while among the others none significantly outperforms all others (Fig. 1). To overcome this decision-making processes, we developed ensemble methods, and notably, Average_DT outperforms all other tools (Table 3). Applying this methodology to all available data on miRBase shows that it performs extremely well for most species, including plants (Fig. 2).

We were not able to determine a relationship between any parameters like evolutionary distance and the true prediction rate (TPR). However, spikes in the graph may indicate the presence of pre-miRNAs which are erroneously named such. This conclusion is further supported by the difference in mmu and mmu* performance (Fig. 2) as well as by the fact that dme is in the lower accuracy range, and we showed that at least 1% of the pre-miRNAs are unlikely to be correct and finally by the spike in human for which we previously showed that some of the pre-miRNAs are unlikely to be true²².

Discussion

MiRNAs are of major interest as they can be disease markers, therapeutics, or agents to increase agricultural productivity. Since it is convoluted to detect novel miRNAs experimentally, implementation and use of computational tools for miRNA analysis has gained importance. All available tools for pre-miRNA detection discussed here (Table 1) employ machine learning for which training data quality is of crucial importance.

Unfortunately, it is currently impossible to establish a true negative data set and, therefore, the quality of available negative data sets is hard to assess^{23, 24}. The Zou, NotBestFold, and pseudo_FR data sets are difficult to solve for most algorithms (Table 3). This may stem from the use of coding sequences for the construction of Zou, which may in part contain pre-miRNAs²⁴. Our NotBestFold data set consist of pre-miRNAs from miRBase, but suboptimal folds were used in its construction, so that approaches focusing on sequence will still be able to name them pre-miRNAs, whereas those using structure-related features may not. Finally, pseudo_FR was generated using random number generation within the maximum range in the pseudo data set, which apparently shifts the distribution such that the data set becomes much harder to solve this is supported by the better performance of pseudo_BQ and pseudo_AM which were generated using smaller intervals (see “Methods” section). The minimum free energy and p value distributions among data sets show that the ones for Zou and Chen are quite similar to the positive examples (hsa, mmu), which may explain low performance of models and may need further scrutiny (Supplementary Figs. 6, 7).

Positive training data is generally taken from miRBase, and while it is clear that having true negative data is not currently possible, our previous analysis²³ and our results for mouse (Table 3, Fig. 2) indicate that positive data may also need further scrutiny. A simple filtering approach based on removing mouse hairpins with low transcriptomic evidence (read per million counts of less than 100) led to a 10% difference for correct prediction rate (Fig. 2). Using only mouse hairpins from MirGeneDB further improved the results but led to a wider spread among model performance. Considering that the proteins taking part in miRNA biogenesis are conserved in most eukaryotes^{23, 25}, increasing the positive data quality may help produce better models that can be applied to a wide range of organisms as izMiR exemplifies (Fig. 2). To enable future studies in this regard, we combined the building blocks for a machine learning approach into a unified, comprehensive, yet adjustable and extensible data analytics workflow that is publicly available at http://jlab.iyte.edu.tr/software/izmir.

All relevant studies in the field of ab initio pre-miRNA detection were compared impartially, and it was established that Bentwich_NB most correctly classified positive data achieving 97.53% TPR, closely followed by Consensus_DT with a TPR of 96.71% (Supplementary Table 5). For negative data, Xu_NB performed best, achieving a true negative rate (TNR) of 94.66% followed by Xue_NB and Jiang_NB, which achieved a TNR of 94.66% and 93.68%, respectively. More important than these individual achievements is whether a model performs well for TPR and TNR at the same time. The best average correct prediction rate was achieved by Average_NB (90.72%) followed by Gao_NB (86.20%), but these numbers may be affected by outliers so that we base our decisions on the lowest sum of ranks for all data sets. Among the 13 studies evaluated, Burgt_Dt and Xue_DT had the lowest sum of ranks (tie), but most of our ensemble methods ranked in the top ten. When not considering ensemble methods, we advise to use either Jiang_NB or Ding_NB, which on average perform similarly well. Taking into account ensemble methods, Average_DT had a lower sum of ranks than Consensus_NB and considering the trained models (Supplementary Figs. 4, 5), Average_DT performed best and reached an area under the ROC curve of 0.99 while Average_NB only achieved 0.93 and only ranked fifth (Table 3). Therefore, we suggest using Average_DT for detection of pre-miRNAs as a default. Homology-based pre-miRNA detection is not discussed in this work since the automatic selection of homologs from a suitable evolutionary distance in a framework like izMiR is currently not possible. However, homology-based features can seamlessly be integrated into izMiR allowing specialists to extend the framework accordingly. We suggest, however, to separate these tasks and first perform pre-miRNA detection using izMiR followed by an evolutionary assessment of the detected hairpins by using tools like RNAmicro²⁶.

The izMiR framework contains all 13 individual models and their 6 ensemble methods. For the consensus models, we assigned equal weights to all studies although they did not perform equally well, which may be improved upon in the future. Moreover, our model selection is based on the highest accuracy scores, which may not be the most reliable method in all cases. To overcome this issue, we provided other scores like the F-measure and Youden’s index to aid alternative model selection strategies. We present the state of the art in ab initio miRNA detection, introduce methods to combine available models synergistically, and provide an implementation for all analyzed studies as well as for our consensus methods. The developed framework further simplifies the generation of new classifiers, and enables their comparison to the state of the art, thus accelerating future developments.

How well do models trained for one species generalize to all other species available on miRBase? Unfortunately, this question cannot be directly answered when simply applying the models to all available data from all species (Fig. 2; http://jlab.iyte.edu.tr/software/izmir). We conclude that the quality of the data available in miRBase determines the TPR and that any other factors are likely of less importance (Fig. 2). Thus, Average_DT should be used for the detection of pre-miRNAs from any species. The izMiR framework allows the generation of new models which can be more effective for particular scenarios, for example, a selected species. It should be noted that although the number of different types of pre-miRNAs is unknown, we experienced difficulties when training models with less than 1000 positive examples. These 1000 examples further need to be very good, for which high RPMs seems to be one indication (e.g.: RPM > 100), but the best would be to manually review all examples and remove the ones that seem unlikely. Therefore, we suggest to only train new models for very specific scenarios. The izMiR framework can also be used to develop new approaches and ensemble methods, and compare them to the state of-the art. We believe that there is still room for improvement and encourage the use of izMiR for the development of new approaches.

Millions of candidate hairpins exist in eukaryotic genomes, but the unequaled performance of the Average_DT model facilitates the computational detection of pre-miRNAs in even larger eukaryotic genomes, which is of great interest since they are hard to detect experimentally. The application of our Average_DT model using the dme genome showed that although our models are generated on human pre-miRNAs, they can perform well in other organisms (Table 3, Fig. 2). Through manual inspection of the Drosophila hairpins from miRBase, which were not classified as pre-miRNAs by our consensus models, such as dme-mir-4914, dme-mir-4912, and dme-mir-9382, we found that they have questionable secondary structures which do not conform to expectations for pre-miRNAs recognizable for proteins in the miRNA biogenesis pathway (detailed assessments for all missed pre-miRNAs are available on the izMiR website). Evaluating the 56 dme pre-miRNAs of the chromosome 2L available in miRBase reveals that the human-trained izMiR model generally gave lower scores to the dubious hairpins than the Drosophila-trained izMiR model. Applying both models to the candidate hairpins, 32 passed the hsa model at a confidence of 0.84 while 43 passed the dme model at a 0.96 confidence threshold. All but two of the dubious dme models are excluded using the hsa model whereas the dme model includes six of them. Further adjustment of the hsa model threshold can reduce the number such that experimental confirmation of all predictions becomes possible. However, we suggest to include instead targeting information²⁷ and expression information^{15, 28}, which further reduces the number of candidates and allows experimental validation of thusly filtered candidates. Pre-miRNAs are not the final mature form of miRNAs. izMiR does not predict mature miRNAs, but after detecting high-quality pre-miRNAs using izMiR, mature miRNAs can be situated within them using existing tools like MatureBayes²⁹ and MaturePred³⁰.

izMiR is instrumental for pre-miRNA detection from next generation sequencing data. With the advent of next-generation sequencing (NGS), small RNAs like miRNAs have been successfully detected using read mapping to a reference genome or to a de novo assembled transcriptome. Computational pipelines for the detection of pre-miRNAs from NGS data have been developed, and all contain a module which checks whether the mapped mature sequences are part of a viable pre-miRNA^31,32,33. These pre-miRNA tests (some of which are quite basic) employ features describing pre-miRNAs and/or read mapping, but have not been rigorously tested. Separating such tests into pre-miRNA detection using izMiR and filtering by read-mapping statistics would ensure that at least the pre-miRNA detection functionality has been thoroughly assessed.

In conclusion, izMiR allowed the impartial comparison of existing ab initio pre-miRNA detection tools, enabled the development of new and the integration of existing tools, was easily trained with novel data, was applicable to a wide range of species using Average_DT, and facilitated the detection of pre-miRNAs in large eukaryotic genomes.

Methods

Data sets for machine learning

Positive examples for machine learning were retrieved from miRBase³⁴, the de facto standard for positive training data used in ab initio pre-miRNA prediction (release 21). We performed filtering operations like removing hairpins with identical sequences which reduced the overall amount of positive examples to 1828 human pre-miRNAs for the human training data set. For prediction, however, unfiltered miRBase data were used. In the Drosophila melanogaster analysis, we also generated models by using 256 hairpins from miRBase as the positive dataset. Moreover, all of the miRNAs listed in miRGeneDB (v1.1)³⁵ were included for prediction as species specific and as one combined data set.

Similar to an idea we outlined in a previous paper in the field of proteomics³⁶, it is vital that training and testing data sets become more challenging with increased accuracy of trained models. Therefore, we used a variety of negative data sets in order to enable comparison among detection methods and to establish the current state of the art:

Pseudo: previously published by Ng³⁷, used for learning and prediction since the data set is challenging but not unsolvable³⁸ and, therefore, a good basis for creation of robust models (negative data; 8492 hairpins)
Shuffled: derived from shuffling sequences of human positive data from miRBase, used for prediction (negative data; 1423 hairpins)
NotBestFold: created by not using the best fold proposed by RNAFold³⁹ for human hairpins from miRBase, used for prediction (negative data; 1881 hairpins)
NegHsa: previously published by Gudys⁴⁰, used for prediction (negative data; 68,048 hairpins^*)
Zou: previously published by Zou et al.²⁴, used for prediction (negative data; 14,246 hairpins)
Chen: previously published by Chen et al.⁴¹, composed of samples from Pseudo and Zou, used for prediction (negative data; 3054 hairpins)
mirgenedb: all miRNAs available in miRGeneDB (v1.1)³⁵, used for prediction (positive data; 1434 hairpins)
hsa + : Homo sapiens miRNAs available in miRGeneDB (v1.1)³⁵, used for prediction (positive data; 523 hairpins)
mmu + : Mus musculus miRNAs available in miRGeneDB (v1.1)³⁵, used for prediction (positive data; 395 hairpins)
gga + : Gallus gallus miRNAs available in miRGeneDB (v1.1)³⁵, used for prediction (positive data; 229 hairpins)
dre + : Danio rerio miRNAs available in miRGeneDB (v1.1)³⁵, used for prediction (positive data; 287 hairpins)
hsa _FR: created by generating random numbers between minimum and maximum values of each feature in human miRNA data set based on miRBase, used for prediction (positive data; 5000 hairpins)
hsa _BQ: created by generating random numbers between lower quartile and upper quartile values of each feature in human miRNA data set based on miRBase, used for prediction (positive data; 5000 hairpins)
hsa _AM: created by generating random numbers between 40 quantile and 60 quantile values of each feature in human miRNA data set based on miRBase, used for prediction (positive data; 5000 hairpins)
pseudo _FR: created by generating random numbers between minimum and maximum values of each feature in pseudo data set, used for prediction (negative data, 5000 hairpins)
pseudo _BQ: created by generating random numbers between lower quartile and upper quartile values of each feature in pseudo data set, used for prediction (negative data; 5000 hairpins)
pseudo _AM: created by generating random numbers between 40 quantile and 60 quantile values of each feature in pseudo data set, used for prediction (negative data; 5000 hairpins)

^*Note, that the original data set of NegHsa (http://adaa.polsl.pl/agudys/huntmi/huntmi.htm) contains many duplicate identifiers and we forced them to be unique, thereby reducing the amount of data from 87,000 to ~68,000 examples.

Hairpin extraction from genome data

In order to extract hairpins from a genome, it was first divided into 500 nt fragments with 250 nt overlaps, and then, the sequence was converted to RNA (T→U) as well as reverse complemented for the template strand. All secondary structures were predicted using RNAfold³⁹, and regular expressions were used to extract all structures that resembled a hairpin (stem with at least three consecutive matches and a terminal loop with at least three nucleotides). The resulting hairpins were filtered according to human hairpin length distribution on miRBase, and duplicate sequences were removed. All features for pre-miRNA detection were calculated for the remaining hairpins and analyzed with the trained human models according to the protocol we deposited on Nature Protocol Exchange (http://www.nature.com/protocolexchange/protocols/4919).

The human genome (GRCh38, DNA, primary assembly) contains 12,399,093 fragments from which 108,788,895 putative hairpins for one strand and 108,276,240 hairpins for the other were extracted and filtered based on hairpin length (between 36 and 180; representing the smallest and the longest human stem loops in miRBase). After removing duplicate sequences from the 34,856,229 length-filtered hairpins, 27,932,492 putative pre-miRNA sequences remained. The same filtering approach resulted in 28,074,667 hairpins for the other strand.

The dme genome (BDGP6 genome assembly) was fragmented into overlapping (250 nt) fragments of 500 nt in length (575,896 fragments). RNAFold was used to create the secondary structure of all fragments, and regular expressions were used to extract all structures remotely similar to hairpins leading to about 5 million hairpins per strand. Hairpins with less than 30 nucleotides were filtered leaving ~2 million hairpins per strand. The chromosome 2L contained about 360,000 hairpins per strand, and after removing duplicates, all hairpin features were calculated for this subset of putative pre-miRNAs (about 290,000 per strand).

For human, we would have to calculate about 700 features for all putative hairpins which could take several months on a high-end personal computer. Processing such a large data set was beyond the scope of this study, and we selected the 2L chromosome of dme to exemplify the effectiveness of our pre-miRNA detection method even for evolutionary distant species by employing a human-trained izMiR and a Drosophila-trained izMiR model.

Features for pre-miRNA parameterization

For machine learning, pre-miRNAs need to be parameterized, and many features have been described in the literature (Table 1). The features that have been used for pre-miRNA prediction can be divided into four categories; sequence-based, structural, probability-based, and thermodynamic; although some features can be categorized into more than one of these basic categories. For this study, we implemented ~900 features, covering the features used or proposed in the 13 studies evaluated here (Table 1), published in other studies or designed by us. Some of the proposed features were ambiguously defined which opened room for their interpretation, and we implemented them to the best of our understanding. There are various methods to calculate these features⁴², and we define the features used in pre-miRNA analyses in another work²⁵ and on our web page: http://jlab.iyte.edu.tr/software/izmir.

Training models for pre-miRNA detection

In machine learning, first, a model is trained based on examples (here: positive and negative examples). There are many training and testing schemes used and different studies performing ab initio pre-miRNA detection used different approaches. It is our apprehension that high ratios of training ( >= 90%) to testing data are only useful when the amount of data is severely limiting. We believe that such schemes would overestimate the actual model performance. Since the availability of data is indeed limited, we settled for a training-testing scheme of 70–30% (Fig. 3). Classification for model generation and predictions on data sets were performed using KNIME⁴³ which is a workflow management and data analytics platform.

For classification (Fig. 3), since there was a big class imbalance among the data which may influence the overall performance significantly²⁵, it was essential to design an efficient strategy for learning. The popular approaches like k-fold cross-validation and leave-one-out have many shortcomings^{44, 45}. Due to this, we decided to sample positive and negative data separately. After random sampling equal amounts of examples from the positive and negative data pools, examples were randomly divided into training (70%) and testing groups (30%). The input data were used to train three classifiers NB, DT, and support vector machine SVM, and their performance scores and PMML models (best only) were stored for each iteration. Through 1000 iterations of the sampling and learning procedure, we obtained the best PMML models for NB and DT. To evaluate model performance, the following measures were recorded: recall, precision, sensitivity, specificity, F-measure, accuracy, Cohen’s kappa, and Youden’s index. For SVM, Weka LibSVM (3.7) was used since it was fast comparing to other SVM implementations available in KNIME. However, since Weka models’ PMML outputs were not compatible with our system, we could not save SVM models, but, produced scores during learning and testing to enable comparisons with NB and DT classifiers. The training workflow guarantees that each study and classifier receives identical data in each iteration, ensuring a fair comparison. The training workflow is publicly available on our web page: http://jlab.iyte.edu.tr/software/izmir and further explained in detail on Nature Protocol Exchange (http://www.nature.com/protocolexchange/protocols/4919).

Predicting pre-miRNAs with izMiR

Many studies present their findings of model training but fail to actually provide the model so that it can be used for prediction. In this study, we provide NB and DT models for each study as well as some consensus models combining all studies.

For prediction, the best DT and NB models for each study were loaded into another workflow which was designed to apply these models to input data and to associate scores to predictions in order to allow detection of pre-miRNAs. The individual PMML models for each classifier, which were produced during training can be used for prediction of pre-miRNAs in our system developed in KNIME (http://jlab.iyte.edu.tr/software/izmir). Additionally, some consensus schemes were devised which combine the power of individual studies to improve classification performance. These consensus approaches are also available in the izMiR framework we provide.

In order to obtain a consensus result, equal weights were given to each model, and a given sequence was labeled as “miRNA” by Consensus_DT and/or Consensus_NB models if it was predicted as miRNA in seven or more studies (majority vote).

For ConsensusRule based prediction, the average of DT and NB prediction scores for each putative pre-miRNA were taken into consideration; if average DT score or average NB score was larger than 0.89, then it was labeled as “miRNA”. Conversely, if average DT score or average NB score was less than 0.5, it was labeled as “negative”. Finally, the remainder was labeled as “candidate” pre-miRNA.

Average_DT and Average_NB-based predictions were performed in a similar manner to ConsensusRule. The average of DT and NB prediction scores for each putative pre-miRNA were taken into consideration; if their average value was smaller than 0.5, then it was labeled as “negative”, otherwise it was labeled as “miRNA”.

For ConsensusModel prediction, scores were obtained for human miRNAs and pseudo data set by using the models from each study. Then, these scores, ranging between 0 and 1, were used to train a multi-layer perceptron classifier (following the same procedure as described for learning). The model with the highest accuracy and F-measure was stored for later use.

The input data for prediction was applied to all individual and consensus models described above, and the numbers of entries predicted as “miRNA”, “negative”, and “candidate” were returned for all of them. TPR and TNR) were calculated as performance measures according to the following expressions:

TPR = (number of hairpins correctly classified as “pre-miRNA”/number of overall hairpins) × 100

TNR = (number of hairpins correctly classified as “negative”/number of overall hairpins) × 100

ROC curves to estimate model performance

The models with the highest accuracy score for DT and NB were applied to human pre-miRNAs from miRBase and pseudo negative data set to estimate true and false positive rates and to construct ROC curves. Along with the study-based models, Average_DT and Average_NB, referring to two of our consensus methods, were analyzed in this manner (Supplementary Figs. 4, 5).

Data availability

The data sets generated during and/or analyzed during the current study are available in the izMiR repository, http://jlab.iyte.edu.tr/software/izmir.

References

Gottesman, S. Micros for microbes: non-coding regulatory RNAs in bacteria. Trends. Genet. 21, 399–404 (2005).
Article CAS PubMed Google Scholar
Kim, V. N., Han, J. & Siomi, M. C. Biogenesis of small RNAs in animals. Nat. Rev. Mol. Cell. Biol. 10, 126–139 (2009).
Article CAS PubMed Google Scholar
Wheeler, B. M. et al. The deep evolution of metazoan microRNAs. Evol. Dev. 11, 50–68 (2009).
Article CAS PubMed Google Scholar
Jones-Rhoades, M. W., Bartel, D. P. & Bartel, B. MicroRNAS and their regulatory roles in plants. Annu. Rev. Plant. Biol. 57, 19–53 (2006).
Article CAS PubMed Google Scholar
Pfeffer, S. et al. Identification of virus-encoded microRNAs. Science 304, 734–736 (2004).
Article ADS CAS PubMed Google Scholar
Tüfekci, K. U., Oner, M. G., Meuwissen, R. L. J. & Genç, S. The role of microRNAs in human diseases. Methods. Mol. Biol. 1107, 33–50 (2014).
Article PubMed Google Scholar
Zheng, L.-L. & Qu, L.-H. Application of microRNA gene resources in the improvement of agronomic traits in rice. Plant. Biotechnol. J. 13, 329–336 (2015).
Article CAS PubMed Google Scholar
Zhang, B. & Wang, Q. MicroRNA-based biotechnology for plant improvement. J. Cell. Physiol. 230, 1–15 (2015).
Article PubMed Google Scholar
Peng, Y. et al. The role of MicroRNAs in human cancer. Signal Transduct. Target. Ther. 1, 15004 (2016).
Article Google Scholar
Mitchell, P. S. et al. Circulating microRNAs as stable blood-based markers for cancer detection. Proc. Natl Acad. Sci. USA 105, 10513–10518 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Razzak, R. et al. MicroRNA expression profiling of sputum for the detection of early and locally advanced non-small-cell lung cancer: a prospective case-control study. Curr. Oncol. 23, e86–e94 (2016).
Article CAS PubMed PubMed Central Google Scholar
Soifer, H. S., Rossi, J. J., Sætrom, P. & Saetrom, P. MicroRNAs in disease and potential therapeutic applications. Mol. Ther. 15, 2070–2079 (2007).
Article CAS PubMed Google Scholar
Avci, Ç. B. & Baran, Y. Use of microRNAs in personalized medicine. Methods. Mol. Biol. 1107, 311–325 (2014).
Article CAS PubMed Google Scholar
Turchinovich, A., Samatov, T. R., Tonevitsky, A. G. & Burwinkel, B. Circulating miRNAs: cell-cell communication function? Front. Genet 4, 119 (2013).
Article CAS PubMed PubMed Central Google Scholar
Saçar Demirci, M. D., Bağcı, C. & Allmer, J. in Non-coding RNAs and Inter-kingdom Communication 143–159 (Springer International Publishing, 2016).
Baker, M. MicroRNA profiling: separating signal from noise. Nat. Methods. 7, 687–692 (2010).
Article CAS PubMed Google Scholar
Tian, T., Wang, J. & Zhou, X. A review: microRNA detection methods. Org. Biomol. Chem. 13, 2226–2238 (2015).
Article CAS PubMed Google Scholar
Dong, H. et al. MicroRNA: function, detection, and bioanalysis. Chem. Rev. 113, 6207–6233 (2013).
Article CAS PubMed Google Scholar
Pritchard, C. C., Cheng, H. H. & Tewari, M. MicroRNA profiling: approaches and considerations. Nat. Rev. Genet. 13, 358–369 (2012).
Article CAS PubMed PubMed Central Google Scholar
Baek, D. et al. The impact of microRNAs on protein output. Nature 455, 64–71 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, Q.-S. & Liang, Y.-Z. Monte Carlo cross validation. Chemom. Intell. Lab. Syst. 56, 1–11 (2001).
Article CAS Google Scholar
Saçar, M. D., Hamzeiy, H. & Allmer, J. Can MiRBase provide positive data for machine learning for the detection of MiRNA hairpins? J. Integr. Bioinform 10, 215 (2013).
PubMed Google Scholar
Saçar, M. D. & Allmer, J. in miRNomics: MicroRNA Biology and Computational Analysis SE – 10, Vol. 1107 (eds Yousef, M. & Allmer, J.) 177–187 (Humana Press, 2014).
Wei, L. et al. Improved and promising identification of human microRNAs by incorporating a high-quality negative set. IEEE/ACM Trans. Comput. Biol. Bioinform. doi: 5C261F39-0F1D-4AE0-923F-CEF4DA8E06AE (2013).
Sacar, M. D. & Allmer, J. in 2013 8th International Symposium on Health Informatics and Bioinformatics 1–6 (IEEE, 2013).
Hertel, J. & Stadler, P. F. Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics 22, 197–202 (2006).
Article Google Scholar
Çakır, M. V. & Allmer, J. in 2010 5th International Symposium on Health Informatics and Bioinformatics 31–38 (IEEE, 2010).
Saçar, M. D., Bağcı, C. & Allmer, J. Computational prediction of microRNAs from Toxoplasma gondii potentially regulating the hosts’ gene expression. Genomics. Proteomics. Bioinformatics 12, 228–238 (2014).
Article PubMed PubMed Central Google Scholar
Gkirtzou, K., Tsamardinos, I., Tsakalides, P. & Poirazi, P. MatureBayes: a probabilistic algorithm for identifying the mature miRNA within novel precursors. PLoS. ONE. 5, e11843 (2010).
Article ADS PubMed PubMed Central Google Scholar
Xuan, P., Guo, M., Huang, Y., Li, W. & Huang, Y. MaturePred: efficient identification of microRNAs within novel plant pre-miRNAs. PLoS ONE 6, e27422 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Friedländer, M. R. et al. Discovering microRNAs from deep sequencing data using miRDeep. Nat. Biotechnol. 26, 407–415 (2008).
Article PubMed Google Scholar
Hackenberg, M., Rodríguez-Ezpeleta, N. & Aransay, A. M. miRanalyzer: an update on the detection and analysis of microRNAs in high-throughput sequencing experiments. Nucleic Acids Res. 39, W132–W138 (2011).
Article CAS PubMed PubMed Central Google Scholar
Schotte, D. et al. Discovery of new microRNAs by small RNAome deep sequencing in childhood acute lymphoblastic leukemia. Leukemia 25, 1389–1399 (2011).
Article CAS PubMed Google Scholar
Griffiths-Jones, S. miRBase: microRNA sequences and annotation. Curr. Protoc. Bioinformatics Chapter 12, Unit 12.9.1–10 (2010).
Google Scholar
Fromm, B. et al. A uniform system for the annotation of vertebrate microRNA genes and the evolution of the human microRNAome. Annu. Rev. Genet. 49, 213–242 (2015).
Article CAS PubMed PubMed Central Google Scholar
Allmer, J. A Call for Benchmark Data in Mass Spectrometry-Based Proteomics. J. Integr. OMICS 2, 1–5 (2012).
Article Google Scholar
Ng, K. L. S. & Mishra, S. K. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23, 1321–1330 (2007).
Article CAS PubMed Google Scholar
Saçar, M. D. & Allmer, J. in Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms 190–195 (SciTePress - Science and and Technology Publications, 2013).
Hofacker, I. L. Vienna RNA secondary structure server. Nucleic Acids Res. 31, 3429–3431 (2003).
Article CAS PubMed PubMed Central Google Scholar
Gudyś, A., Szcześniak, M. W., Sikora, M. & Makałowska, I. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification. BMC Bioinformatics 14, 83 (2013).
Article PubMed PubMed Central Google Scholar
Chen, J., Wang, X. & Liu, B. iMiRNA-SSF: improving the identification of microRNA precursors by combining negative sets with different distributions. Sci. Rep. 6, 19062 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Yones, C. A., Stegmayer, G., Kamenetzky, L. & Milone, D. H. miRNAfe: A comprehensive tool for feature extraction in microRNA prediction. Biosystems 138, 1–5 (2015).
Article CAS PubMed Google Scholar
Berthold, M. R. et al. in Data Analysis, Machine Learning and Applications (eds Preisach, C., Burkhardt, H., Schmidt-Thime, L. & Decker, R.) 319–326 (Springer, 2008).
Kohavi, R. in International Joint Conference on Articial Intelligence 1137–1143 (Morgan Kaufmann Publishers Inc., 1995).
Varma, S. & Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (2006).
Article PubMed PubMed Central Google Scholar
Xue, C. et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 6, 310 (2005).
Article PubMed PubMed Central Google Scholar
Jiang, P. et al. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344 (2007).
Article PubMed PubMed Central Google Scholar
Batuwita, R. & Palade, V. microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25, 989–995 (2009).
Article CAS PubMed Google Scholar
Xu, Y., Zhou, X. & Zhang, W. MicroRNA prediction with a novel ranking algorithm based on random walks. Bioinformatics 24, i50–i58 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ding, J., Zhou, S. & Guan, J. MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features. BMC Bioinformatics 11, S11 (2010).
Article PubMed PubMed Central Google Scholar
van der Burgt, A., Fiers, M. W. J. E., Nap, J.-P. & van Ham, R. C. H. J. In silico miRNA prediction in metazoan genomes: balancing between sensitivity and specificity. BMC Genomics 10, 204 (2009).
Article PubMed PubMed Central Google Scholar
Ritchie, W., Gao, D. & Rasko, J. E. J. Defining and providing robust controls for microRNA prediction. Bioinformatics 28, 1058–1061 (2012).
Article CAS PubMed Google Scholar
Bentwich, I. Identifying human microRNAs. Curr. Top. Microbiol. Immunol. 320, 257–269 (2008).
CAS PubMed Google Scholar
Lopes, I. D. O. N. et al. The discriminant power of RNA features for pre-miRNA recognition. BMC Bioinformatics 15, 124 (2014).
Article Google Scholar
Gao, D., Middleton, R., Rasko, J. E. J. & Ritchie, W. miREval 2.0: a web tool for simple microRNA prediction in genome sequences. Bioinformatics 29, 3225–3226 (2013).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by the Scientific Research Council of Turkey (TUBITAK, Grant No: 113E326). We are indebted to Drs Anne Frary, Bilge Karaçalı, Bünyamin Akgül, and Knut Reinert for critical feedback on the manuscript.

Author information

Authors and Affiliations

Molecular Biology and Genetics, Izmir Institute of Technology, Urla, Izmir, 35430, Turkey
Müşerref Duygu Saçar Demirci & Jens Allmer
Computational Systems Biology, Max Planck Institute for Informatics, 66123, Saarbrücken, Germany
Jan Baumbach
Computational Biology, University of Southern Denmark, DK-5230, Odense M, Denmark
Jan Baumbach
Bionia Incorporated, IZTEKGEB A8, Urla, Izmir, 35430, Turkey
Jens Allmer

Authors

Müşerref Duygu Saçar Demirci
View author publications
You can also search for this author in PubMed Google Scholar
Jan Baumbach
View author publications
You can also search for this author in PubMed Google Scholar
Jens Allmer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.D.S.D. and J.A. designed the study with critical input from J.B. J.A. largely implemented the features and M.D.S.D. developed and executed training and prediction workflows. M.D.S.D., J.B., and J.A. together wrote the manuscript.

Corresponding author

Correspondence to Jan Baumbach.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Saçar Demirci, M.D., Baumbach, J. & Allmer, J. On the performance of pre-microRNA detection algorithms. Nat Commun 8, 330 (2017). https://doi.org/10.1038/s41467-017-00403-z

Download citation

Received: 29 May 2017
Accepted: 23 June 2017
Published: 24 August 2017
DOI: https://doi.org/10.1038/s41467-017-00403-z

This article is cited by

A hybrid CNN-LSTM model for pre-miRNA classification
- Abdulkadir Tasdelen
- Baha Sen
Scientific Reports (2021)
Deep neural networks for human microRNA precursor detection
- Xueming Zheng
- Xingli Fu
- Meng Wang
BMC Bioinformatics (2020)
Multi-branch Convolutional Neural Network for Identification of Small Non-coding RNA genomic loci
- Georgios K. Georgakilas
- Andrea Grioni
- Panagiotis Alexiou
Scientific Reports (2020)
Nucleotide-level Convolutional Neural Networks for Pre-miRNA Classification
- Xueming Zheng
- Shungao Xu
- Xinxiang Huang
Scientific Reports (2019)
Identification of microRNAs associated with the exogenous spermidine-mediated improvement of high-temperature tolerance in cucumber seedlings (Cucumis sativus L.)
- Ying Wang
- Shirong Guo
- Na Lu
BMC Genomics (2018)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Comparison of available tools

Model performances

Using izMiR on eukaryote genomes

Forming consensus is better than individual effort

Discussion

Methods

Data sets for machine learning

Hairpin extraction from genome data

Features for pre-miRNA parameterization

Training models for pre-miRNA detection

Predicting pre-miRNAs with izMiR

ROC curves to estimate model performance

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links