Introduction

MicroRNAs (miRNAs) represent a class of short (18–28 nucleotide [nt]) non-coding RNA molecules. They achieve post-transcriptional and translational regulation of protein expression via base-pairing with complementary sequences of messenger RNA (mRNA) molecules.

Gene regulation by miRNAs does not adhere to a simple one miRNA–one target gene mapping. Rather, the distribution of predicted targets reflect commensurately more complicated miRNA–mRNA combinatorics: miRNAs exhibit target multiplicity wherein more than one mRNA is targeted per miRNA, and mRNAs exhibit signal integration wherein more than one miRNA may target an mRNA. Consequently, the multi-valency of miRNAs enable their targeting of multiple genes, thus regulating the expression of several proteins. These miRNAs play key roles in gene regulation and their dysregulation is associated with several diseases1,2. Studies have revealed miRNAs involved in disease pathogenesis3, biological development4, stress response5, and cell cycle control6. The elucidation of miRNAs within genomes is, thus, critical to understanding the underlying mechanisms of organismal biology and cellular function.

While wet-laboratory experimentation are traditionally used to identify miRNA target interactions and gene regulation, these methods are resource-intensive as compared to complimentary computational approaches. Common examples of wet-laboratory experiments used for miRNA target interactions include low-throughput methods such as quantitative polymerase chain reaction (qPCR), western blotting, and reporter gene assays; numerous computational approaches leverage the higher throughput methods which are based on cross-linking and immunoprecipitation (CLIP). Here, we provide a brief overview of contemporary wet-laboratory miRNA–mRNA interaction detection methods.

The qPCR and western blot methodologies are used to determine the change in mRNA or protein concentrations and are, therefore, considered indirect methods for the detection of physical interactions7. Reporter gene assays, such as the luciferase reporter assay, transfect miRNAs into a cell line that stably expresses a luciferase reporter containing the 3‘-UTR of the target miRNA being investigated in order to quantify the degree of interaction based on the change in reporter gene expression8. The reporter gene methods are generally more informative to elucidate specific miRNA–mRNA binding and are, therefore, considered a high-confidence measure of interaction9. The high-throughput sequencing of RNA isolated by CLIP (HITS-CLIP) is also used to investigate miRNA–mRNA interactions by probing for Argonaute-miRNA and Argonaute-mRNA interactions and overlaying the results to identify putative interactions10. Similar to the HITS-CLIP and CLIP-Seq methods is the photoactivatable-ribonucleoside-enhanced CLIP (PAR-CLIP) procedure which leverages a more efficient cross-linking to stabilize the protein-RNA complexes and identified RNA-binding proteins sites on the target RNAs11. Finally, the cross-linking ligation and sequencing of hybrids (CLASH) method, as its name indicates, is an experimental procedure to identify miRNA–mRNA interaction sites using cross-linking, ligation, and sequencing of hybrids12. Taken together, the low- and high-throughput wet-laboratory techniques provide an ensemble of methods to identify interacting miRNA–mRNA pairs amenable to developing learning algorithms capable of further exploring the space of possible pairs to identify putative interactions.

Current state of miRNA–target prediction

Within humans alone, the number of known miRNA exceeds 2,30013 and the human mRNA population is estimated to exceed 25,00014; current wet laboratory technology is unable to feasibly test each possible miRNA–gene pair. Consequently, there is significant interest in using computational approaches for miRNA target prediction. These predictors are used to narrow the scope of potentially interesting interactions, functioning as a guide to wet-laboratory validation experiments to more rapidly elucidate gene regulation networks.

Many ancestral and popularized computational methods follow an ab initio approach, including miRanda15, TargetScan16, and PITA17. These methods compare the nucleotide sequences of the miRNA against a \(\sim 20-30\)nt region of the mRNA in a search for matching sub-sequences. These methods also incorporate other string-matching rules based on the observed deviations from Watson-Crick pairing rules, such as G:U wobble pairing18, originally discovered in early miRNA–mRNA experimental validations. miRanda uses a scoring matrix based on the complementarity of each nucleotide pair, where a set of weighted heuristic rules for each nucleotide pairing contribute to a summative score for a given interaction15. PITA leverages both structure- and sequenced-based information by computing fusing target-site accessibility (the energy required to access target binding sites within mRNA secondary structures) with sequence-based matching to determine an improved overall score17. TargetScan leverages a series of stepwise linear regression models to identify the most informative features from 74 datasets culminating into what the authors denote the context++ model, demonstrated to outperform preceding methods as of 201516.

In the time since, increasingly accurate miRNA target prediction algorithms have emerged, many leveraging machine learning as reviewed in19. Notable examples of classical machine learning include TarPMir20, RFMirTarget21, and MirTarget22 with recent models leveraging deep learning, including MiRTDL23, DeepMirTar24, and miRAW25. MirTarget is a Support Vector Machine (SVM) trained on CLIP experimentally validated interactions and miRNA overexpression data. The miRNA overexpression data provides a complimentary view to understanding functional targets as the elucidation of target interaction does not necessarily result in gene down-regulation22. The miRTDL method implemented a Convolutional Neural Network (CNN) with selected features obtained from the convolved feature maps23. DeepMirTar used Stacked denoising Autoencoders (SdA) to learn a lower-dimensional representation of latent features24 while miRAW leveraged autoencoders without the denoising step25. TarPMir used a Random Forest (RF) classifier trained on an experimentally validated dataset20. RFMirTarget also used a Random Forest classifier, however it was trained on data originally pre-computed by miRanda, thus acting as a cascaded refinement of ab initio predictions21. While all methods reviewed here relate to miRNA target prediction, a subset formulate the problem for the identification of the binding site as distinctly different from scoring the likelihood of interaction between a given miRNA and target; this work focuses on the latter problem.

Pre-computed prediction databases facilitate access to predictions without having to execute predictive models. Conveniently, the predictions from multiple predictors have been aggregated in databases to generate a quantitative measure of confidence in a given miRNA–gene pair. The largest of such databases, both in number of integrated sources and total number of pairs, is the mirDIP pre-computed miRNA–target interaction database9. This dataset is a boon, not only to wet-laboratory experimentalists, but also to researchers seeking to develop new methods to further improve miRNA target prediction. Figure 1 provides a conceptual overview of how the mirDIP dataset was used to develop one such cascaded machine learning method.

Figure 1
figure 1

Conceptual overview of the Reciprocal Perspective algorithm applied to miRNA Target Prediction. Panel (A) depicts a high-level overview of cascading the Reciprocal Perspective learning algorithm based on an initial set of predictions from an initial learning algorithm. Panel (B) depicts an in-depth illustration of the combination of numerous individual miRNA targeting predictors to produce an ensembled mirDIP score. These scores are used to generate One-to-All score curves from which the Reciprocal Perspective features are extracted to train the cascaded learning algorithm and produce the final predictions for evaluation.

Reciprocal Perspective cascaded learning algorithm

The Reciprocal Perspective (RP) method was originally introduced as a cascaded, semi-supervised learning algorithm to improve the pair-wise predictive performance of existing learning algorithms within the context of protein-protein interaction (PPI) prediction26. Figure 1A provides a conceptual overview of how the scores produced by an initial predictor can be used to train a cascaded learning algorithm to provide refined prediction. Leveraging the output scores generated by an initial learning algorithm as input to the RP method, those scores are cast into a new rank-order domain denoted a One-to-All score curve (O2A; Fig. 2) which, in the case of miRNA targeting prediction, provides two complimentary views, an miRNA-based “perspective” and a gene-based “perspective” (Fig. 1B).

For a given query pair (XY), RP examines the pair’s predicted score in the context of all predicted scores for all pairs involving either \((X,*)\) or \((*,Y)\). By leveraging predictions made on pairs not definitively known to be positive or negative (i.e. unlabelled pairs), this method can be described as semi-supervised machine learning. Both labeled and unlabeled scored pairs are available for feature extraction and use as part of a cascaded learning algorithm. For example, by examining the O2A curve for a given miRNA, X, it is relatively straightforward to determine a suitable local decision threshold for all putative targets of that miRNA. By repeating this analysis for the given gene, Y, a local threshold can be determined for all miRNA that may interact with Y. In combination with the local threshold, these O2As enable the extraction of several additional context-based features from and between these two views for each (XY) to train the RP cascaded learning algorithm and refine the original predictions. Applied to two state-of-the-art PPI predictions over five organisms, the RP-augmented models produced a statistically significant improvement over all conditions26.

Figure 2
figure 2

Example one-to-all score curves. In both perspectives, each point represents the mirDIP predicted score for one gene-miRNA pair. In the miRNA perspective (left), each curve represents all miRNA–gene pairs involving a specific miRNA (distinguished by shade), sorted by prediction score. The complimentary perspective (right) contains the rank order distribution of pairs involving individual genes (again distinguished by shade), where each data point is a specific miRNA. For each curve, the point identified as the knee is highlighted with a horizontal dashed line. The pair involving the hsa-miR-548ar-3p miRNA and ZNF674 gene is highlighted in each perspective with a larger marker and the corresponding colour. Since it falls above the knee in both perspectives, it will likely be scored highly by RPmirDIP.

Cascaded prediction refinement techniques in miRNA–target prediction is common. For example, the RFMirTarget method21, previously discussed, improves the predictions produced by miRanda15 using an additional 34 sequence-based features in an RF model to refine predictions. The RP method differentiates itself from other cascaded predictors in that it is domain-agnostic; it derives features from a domain entirely independent of the context of the original problem, functioning as an error-correction process. This flexibility enables it to be applied broadly to pair-wise prediction tasks in combination with a given initial predictor.

However, in its original implementation, the RP algorithm required an initial predictor capable of generating a complete-graph of prediction scores; that is, generate a score for every possible pair producing a comprehensive prediction matrix (CPM). This is considered computationally intractable for the majority of predictors (typically too slow) or for the majority of tasks (typically too large). In this work, we not only seek to improve the state-of-the-art in miRNA prediction, but additionally demonstrate the applicability of RP to predictions tasks where the assumption of a complete CPM is relaxed. To the best of our knowledge, this work presents the first instance where the predicted outputs of miRNA–gene pairs in a (near) all-to-all context has been reported.

Methods

The following section describes the acquisition of pre-computed miRNA–target prediction data (i.e. initial scores); the collection of experimentally validated miRNA–target interactions to be used for training and testing (ground truth labels that were experimentally validated only after the generation of initial scores); the adaptations to the RP learning algorithm for this task; and the implementations of the RP cascaded model. Figure 1 depicts the conceptual overview of the prediction pipeline including data acquisition, RP feature extraction, training of a cascaded learning algorithm, and performance evaluation. Briefly, the prototypical prediction pipeline will use a data source to train and evaluate a predictor and generate a set predictions. The RP method leverages these predicted scores as a data source (i.e. in a cascade) to train and evaluate the RP model to generate a final set of predictions (Fig. 1A).

Acquisition of the miRNA–target prediction data

The miRNA–target prediction data was acquired from the mirDIP database. This database contains 41 million unique miRNA–gene interactions, involving 2,585 and 27,591 unique miRNA and genes, respectively. Each interaction is assigned an integrative score, which is statistically inferred using predictions obtained from 30 independent resources9. The mirDIP database contains predicted scores for approximately 60% of all possible interactions for this set of miRNA and genes. Notably, the size of the predicted interaction set for a given miRNA varies within the range [44, 17596] with an average of 15,055 and median of 15,118. The size of the gene set varies within the range [5, 2375] with an average of 2,207 and median of 2,375 unique miRNAs targeting a typical gene.

Experimentally validated miRNA–target interactions

Two sets of experimentally validated miRNA–gene interactions were obtained from DIANA-TarBase v87 and mirTarBase v.8.0. The number of validated interactions between these databases is 684,107, consisting of 2,585 miRNAs and 17,629 genes. We eliminated any genes lacking at least one validated interaction; the O2A, by definition, requires at least one validated data point to extract the context-based features.

These validated interactions were then split into a training set for the cascaded classifier and a hold-out test set for independent performance evaluation. In order to mitigate potential bias by using test data that may have also been used to train mirDIP or its sources, the test set was curated by selecting only recently validated interactions. Recency was enforced by collecting a test set comprised only of interactions new to TarBase v8, which was published after the acceptance of the mirDIP 4.1 publication9. The resulting test set comprised 31,131 positively validated interactions; the remaining 652,976 positive samples were used for training the RP model.

To train and test the RP model, a set of negative pairs is also required. This poses a challenge, as there is a lack of validated non-interactions. Among published non-interactions, the reliability of these findings is considered by some to be questionable9. Furthermore, Helwak and Tollervey emphasized that pair non-interactivity may be due to properties of the specific detection method used rather than a property of the pair interactivity itself12. To overcome this, we conservatively assumed that pairings that are not validated are non-interactions. A uniform, random sample of these pairs was obtained to match the number of validated interactions in order to create a balanced training and testing set. Since the vast majority of miRNA–gene pairs are not expected to interact, selecting our negative set in this way was reasonable and follows the methodology used in several previous studies9,27.

Adapting the baseline estimation for variable and sparse one-to-all curves

The O2A curve plots the rank-order distribution of scores involving a given miRNA or a given gene. That is, if a given miRNA has a predicted score for 200 distinct genes, its O2A is a monotonically decreasing curve with 200 points; the highest scoring gene is plotted with rank 1, the second highest as rank 2, and so on. In its original formulation, the RP method would require that every miRNA be scored against all possible genes, and conversely, all genes be scored against all possible miRNAs26.

A consequence to relaxing the complete-graph constraint is a variable number of data points within the rank-order O2A for a given miRNA or gene. For example, a given gene that has only been predicted to interact with ten miRNAs will have a sparse O2A as compared to an miRNA that has been predicted to target hundreds of genes. The original baseline estimation method used within RP leveraged LOESS to first fit a smooth curve to the O2A data. This approach assumed a sufficiently large number of data points to reliably identify the knee of the curve. Therefore, for curves with a sufficient number of points (i.e. \(n > 100\)), the Kneedle algorithm28 was used to locate the knee of the curve. To accommodate those curves with sparse number of points (i.e. \(1 \le n \le 100\)), which only represented less than 1% of the curves, we leveraged the median of the distribution of scores in the O2A to define the baseline.

Table 1 Extracted RP features for use in the cascaded model.

Adaptation of RP feature calculation

For each miRNA–gene interaction, a set of features adapted from the original RP implementation were computed based on the identified baseline. These features leverage the distribution of non-validated pairs to extract “context”-based information available from all the scored interactions involving a specific miRNA and gene.

Modifications to the original RP implementation enabled its application to the miRNA–gene task. To accommodate a variable number of data points in the O2A, we substituted the absolute rank order of a scored pair in favour of its percentile rank. This ensured that rank-type features could be compared across different sized miRNA and gene perspectives. The binary RP metric “Above-Global-Threshold” was excluded as it was determined that RP models could learn the optimal global threshold using the raw mirDIP score. Similarly, the binary features for indicating if a score is above a local threshold were excluded as the Fold-Difference metrics provide similar information to the model. The resulting features used for training the cascaded learning algorithm are shown in Table 1.

Training the cascaded machine learning model

Two learning algorithms were independently considered for the training and evaluation of the RP cascaded model. From its reported successes as part of the cascaded RFMirTarget method, we considered a Random Forest model. The model, denoted RPmirDIP*, was trained on the RP features derived from the training set (Table 1). Ten-fold cross validation was performed for hyperparameter tuning. This produced a model where the maximum size of the feature subset considered at each split was four, the forest comprised 100 trees, each with a maximum tree depth of 19. For its widespread application in various machine learning tasks and reported success within Kaggle competitions, the eXtreme Gradient Boosting (XGBoost) model was also considered. The model, denoted RPmirDIP, was trained on the RP features derived from the training set (Table 1). The learning task used logistic regression as the objective function and ROC AUC as the evaluation metric. The default parameters for the tree booster were used: the learning rate was set to 0.3, the \(\gamma\) parameter was set to 0, and the max depth was set to 6. The model was trained over 200 iterations with an early stopping of five rounds.

Evaluating RP performance improvement using independent miRNA targeting predictors

To comprehensively evaluate the utility of RP for improving the predictive performance for independent miRNA targeting predictors, we compared the performance of the predictor alone with the RP-augmented predictions of that predictor, mirDIP, and RPmirDIP. Twenty-six independent methods were augmenting using the RP cascaded model. For each method and its dataset of pre-computed predictions, RP features were calculated (Table 1). Since each individual dataset did not contain predictions for the entirety of the training and testing set, only the overlapping subset was used (Supplementary Figure 2). We note that the training datasets used to produce the predicted scores for each individual method in the mirDIP database are unknown which risks the possible inclusion of training samples among these datasets. This may result in the possible overstatement of results, however, this bias would be consistent across the four compared models which makes for a fair comparison. For each method, the XGBoost algorithm, using the same hyperparameters as RPmirDIP, was trained using the corresponding training set and features. To evaluate performance, the testing set was used to calculate the ROC AUC and PR AUC.

Results and Discussion

Of the two learning algorithms independently trained on the RP features, Random Forest and XGBoost, the most performant was selected as the official RPmirDIP model (XGBoost) while the less performant was denoted the RPmirDIP* model (Random Forest). Here, we report and discuss the results of the RPmirDIP model and leave the RPmirDIP* results to the Supplementary Materials. Similarly, the experiments applying RP to 26 individual miRNA targeting predicitors are sumarized with supportive material in the Suppmentary Materials.

The evaluation of the performance difference between the original mirDIP method and the RP-augmented scores was achieved using bootstrap testing (\(n = 1,000\)) on the test set choosing both the area under the resulting precision-recall and receiver operating characteristic curves (PR AUC and ROC AUC, respectively). Considering the null hypothesis, (\(H_0\): no significant difference in AUC between Original mirDIP and RP-augmented mirDIP), p values were computed using Welch’s unequal variances t-test and the observed differences in AUC were significant at the \(p < 0.001\) level (Table 2; Fig. 3A,B). Interestingly, the RPmirDIP model benefited considerably from an increasingly large training set, with sharp improvements in AUC observed as the training set size approached 100K samples with relatively diminishing, though marked, improvement thereafter (Fig. 3C). The distributions of mirDIP, RPmirDIP, and RPmirDIP* scores are depicted in Fig. 4.

Table 2 Performance following 1,000 bootstrap iterations. Bold identifies the highest score achieved by a model for a given metric.
Figure 3
figure 3

Results comparing the predictive performance of RPmirDIP and RPmirDIP* models against the original mirDIP predictions. (A) and (B) depict the ROC and PR curves, respectively. (C) illustrates the change in predictive performance as a function of the size of the training set. (D) is a kernel density estimation plot of the pair-wise difference in scores between RPmirDIP and mirDIP.

Figure 4
figure 4

Comparison of the distribution of scores between mirDIP (A), RPmirDIP (B), and RPmirDIP* (C).

A further analysis of the relative feature importance from the resultant RPmirDIP model revealed a heavy reliance upon the raw mirDIP score, which is intuitive given that RPmirDIP needed to compensate for the removal of the binary “Above-Global-Threshold” feature which originally captured global-level information (Fig. 5). Notably, the RP-derived features, particularly those from the miRNA persepctive (sxtrxtpdxfdx), each contributed complimentary information to the model, exemplified both by their low correlation with the mirDIP score and relatively large information gain. Furthermore, while the RPmirDIP model does not place considerable emphasis on the ARRO feature which has the lowest correlation with the mirDIP score, the RPmirDIP* model considers it as the third highest feature by importance (Supplementary Fig. 1A). This strongly suggests that the ARRO feature, which encodes the reciprocal, context-based information provided from both perspectives, is independent of the predicted score and provides complimentary information useful to distinguishing interacting pairs from non-interacting pairs.

Figure 5
figure 5

Exploration of RPmirDIP Features. (A) is a heatmap of feature correlation. Notably, the ARRO metric is the least correlated with other features and, as expected, rank (rxy, ryx) are strongly correlated with percent difference (pdx, pdy), respectively. (B) Plots the gain of a given feature as indicative of the relative contribution of that feature to the model based upon the feature’s contribution to each tree in the model. Note the break in the y-axis due to the large relative information gain on the initial score. Formally, gain is the average reduction in training loss when selecting that feature for splitting.

We further analyzed the RP contribution to increased performance by computing the score augmentation when applying RPmirDIP and plotting the distribution of the score difference with mirDIP for the validated interactions and non-interactions from the test set (Fig. 3D). Interestingly, we observe a consistently positive increase in score for both classes with only a rare few receiving a decreased score. Consistent with the strong results observed for RMmirDIP, the validated interactions generally received a larger score augmentation than the non-interacting samples. These findings suggest that the use of interaction-specific context-based RP features increase overall predictive performance by augmenting the scores of true interactions for which the initial predictors had originally assigned a lower score that failed to exceed the globally defined threshold. This further suggests that the application of RP leverages previously underutilised information within pair-wise data sets to increase discriminability of the classes.

Reciprocal Perspective improves performance of 26 miRNA targeting predictors

To comprehensively determine whether RP would consistently improve the predictive performance of independent miRNA targeting predictors regardless of their unique architectures and implementations, we compared the predictor’s performance with an RP-augmented model as well as the mirDIP and RPmirDIP models (Fig. 6). Promisingly, in all cases the cascaded application of RP to each predictor resulted in a notable increase in ROC AUC (between [0.0606–0.4999]) and PR AUC (between [0.0045–0.4333]). Comparing the RPmirDIP model trained on the subset of data available to each method (see Supplementary Materials for details), we observed further increases in ROC AUC (between [0.0198–0.2321]) as well as PR AUC (between [0.0021–0.1048]). Table 3 summarizes the various sizes of the datasets used for each experiment and supporting information can be found in the Supplementary Materials.

We note that where the initial predictor performs particularly well, there is little additional performance gain to be made (e.g. PR AUC of TargetRank and TargetScan). Conversely, where the initial model performs relatively poorly, there are substantial gains in performance observed, both in the application of RP to that model as well as with the use of the ensembled mirDIP score in conjunction with RP (i.e. RPmirDIP). From these experiments, we arrive to the following conclusions:

  1. 1.

    The cascaded application of RP to an initial predictor results in improved predictive performance (i.e. RP + Predictor).

  2. 2.

    The cascaded application of RP to an ensemble-based predictor (e.g. mirDIP) may result in further improvement in predictive performance (i.e. RPmirDIP).

  3. 3.

    RP compliments miRNA targeting predictor reliant only on the scores produced by that predictor and no other information.

Figure 6
figure 6

Increase in individual predictor performance using RP and the RPmirDIP model. The 26 prediction methods are ordered by the size of the available training set and each is notably improved with the combined use of RP and the use of the ensembled mirDIP score leveraged by RPmirDIP.

Table 3 Comparison of the predictive performance of 26 individual miRNA targeting predictors.

Discovering novel putative interactions

The accurate prediction of miRNA–target interactions is critical to our understanding of dynamic biological regulation networks. In miRNA–target prediction, RPmirDIP represents a novel method to discover new and possibly unexpected interactions that current predictors missed. Analyses and experimental validation are warranted for the set of interaction which mirDIP originally assigned low confidence scores, in contradiction to RPmirDIP’s assignment of a substantially higher score. These putative interactions promise potentially unique information that may improve our overall understanding of miRNA targeting and gene expression networks. For the benefit of the broader scientific community, we make available the sorted list of these putative interactions, available at https://doi.org/10.5683/SP2/LD8JKJ54.

We re-scored all \(\sim \! 30\) million pairs listed in the mirDIP database using both the RPmirDIP and RPmirDIP* models. We then used the RPmirDIP scores to compute the Difference of Scores (DoS) with the original mirDIP prediction, defined as the RPmirDIP score minus the mirDIP score. With a noticeable peak in non-interactors around a DoS of 0.5 in Fig. 3D, we posit that these pertain to putative false negatives. Of the \(\sim \! 30\) million predictions, we extracted those with a \(DoS \ge 0.5\) and then sorted this set in two ways, one by DoS and another by RPmirDIP score; the top-10 interactions for each set are tabulated in Table 4. An analogous table of the top-10 interactions stored by mirDIP score is available in the Supplementary Materials. Following the mirDIP dataset convention, each of these sets were split into top-1%, top-5%, top-10%, top-33%, and bottom-66% sets and were released to the scientific community. The data are available at cu-bic.ca/RPmirDIP and https://doi.org/10.5683/SP2/LD8JKJ,54.

Table 4 Top-10 predictions when sorted by difference of score (top) and by RPmirDIP Score (bottom). Bold emphasizes the column that is sorted.

The two sorting methods are meant to capture differing and complimentary rankings of interactions. The results sorted by RPmirDIP score are analogous to the results presented in the mirDIP database; they correspond to those interactions for which the trained model places the greatest confidence in being a true interaction. Sorted in this way, the ranking doesn’t account for the magnitude of the DoS; pairs high-scoring in mirDIP can be expected to be generally high-scoring in RPmirDIP.

The results sorted by DoS present pairs that are of a possibly more serendipitous nature. That is, these results are those for which mirDIP assigned very little confidence (possibly considered to be non-interacting) yet RPmirDIP assigned great confidence. Per the KDE plot in Fig. 3D, these are the pairs with the greatest likelihood of being novel discoveries to elucidate new gene regulatory dynamics. By definition, the predicted pairs with the highest DoS also tend to have very high RPmirDIP scores. While not included within this work, there also exists the possibility to generate a new ranking of pairs by combining scores. For example, generating a ranking for which both mirDIP and RPmirDIP are most confident, the mirDIP score, \(m_i\), of a given pair i can be multiplied by the RPmirDIP score, \(r_i\). Applied to all pairs, the resulting set of scores, \({\varvec{s}}\), can be sorted into rank order (and optionally min-max normalized to reside in [0,1]):

$$\begin{aligned} s_i = m_ir_i \end{aligned}$$
(1)

This assumes an equal weight (i.e. 0.5) is given to both the mirDIP and RPmirDIP scores. A generalization of this transformation would incorporate a weighting factor, \(\lambda \in [0,1]\), for the RPmirDIP score such that the mirDIP score is weighted by a factor \(1-\lambda\):

$$\begin{aligned} s^{\lambda }_{i} = (1-\lambda )m_i \times \lambda r_i \end{aligned}$$
(2)

For example, more heavily weighting the RPmirDIP score using \(\lambda =0.9\) yields the set:

$$\begin{aligned} \varvec{s^{0.9}} ~\leftarrow ~ s^{0.9}_{i} = 0.1 m_i \times 0.9 r_i,~~\forall ~i \end{aligned}$$
(3)

Moreover, this form of exploratory re-ranking can be applied between any of the three available values, RPmirDIP score, mirDIP score, and the DoS. However, the multiplication of either RPmirDIP and DoS or mirDIP and DoS with a \(\lambda =0.5\) will produce the identical ranking as the ranking by DoS, by definition. The investigation of such rankings is left to the users of the RPmirDIP datasets and future work.

Review of corroborating literature

Considering the twenty interactions resulting from the top-10 pairs sorted by RPmirDIP score and DoS, we reviewed the existing literature for corroborating evidence. We summarized relevant details of the four miRNAs appearing within this set in Table 5 along with selected references. Interestingly, each of the four appear to have critical oncogenic roles, typically in the suppression of factors leading to cell proliferation and metastasis; three of the four miRNAs are involved in the suppression of non-small cell lung cancer proliferation and metastasis. Specifics to each miRNA and their predicted mRNA partners are discussed in the following sections. The corresponding Table 4 results produced from the RPmirDIP* model are available in Supplementary Table 1.

Table 5 Details of the miRNAs identified among the Top-10 RPmirDIP predictions.

hsa-miR-522-5p

Among the set of top-10 predicted pairs sorted by DoS, hsa-miR-522-5p is involved in nine. While the mirDIP dataset assigns a nearly negligible score to each of these pairs, the RPmirDIP score approaches the maximal value producing the highest possible DoSs. Interestingly the mirDIP dataset lists 18,537 scores involving hsa-miR-522-5p however only four genes are ranked within the “Very High” confidence class:

  1. 1.

    Gene: NRCAM, Accession: Q92823, mirDIP: 0.4338

  2. 2.

    Gene: CBX4, Accession: O00257, mirDIP: 0.4047

  3. 3.

    Gene: CBX8, Accession: Q9HC52, mirDIP: 0.3941

  4. 4.

    Gene: GIPC3, Accession: Q8TF64, mirDIP: 0.3863

Each of the nine genes fell within the bottom 66% or “Low” confidence class. Furthermore, the RNACentral entry for this miRNA lists 2,964 interacting target genes and their proteins, however, none of which are the nine among these top-10, suggestive that these nine interactions may present novel interactions. RNACentral does list targeted genes related to those predicted by RPmirDIP: the Annexin protein ANXA5 is targeted, but not the predicted ANXA4; the F-Box proteins FBXO9, FBXO25, FBXO32, and FBXO45 are each targeted, but not the predicted FBXO33; the Eukaryotic transition initiation factor 4 (EIF4) proteins EIF4G1, EIF4G2, EIF4B, EIF4EBP2, and EIF4EBP3 are each targeted, but not the predicted EIF4A2; finally, the Mitogen-Activated Protein Kinase (MAPK) proteins MAPK1IP1L, MAPKAPK3, MAPK14, MAPKAPK2, and MAPK9 proteins are each targeted, but not the predicted MAPK6. No proteins related to any of AGO2, TSC22D2, NCKAP1, CASTOR2, or MARCKS were listed in RNACentral.

Of the nine candidate interactors, AGO2 is an effector of small RNA mediated gene silencing and, therefore, a promising candidate interactor with critical roles in oncogene progression64. FBXO33 is a prognostic marker in both renal and lung cancers, while CASTOR2 is disease-associated with spinal cord glioma, spinal cancer, and its RNA is overexpressed in testis cancer. The miRBase dataset lists 16 open access papers that mention hsa-mir-522, many relating to the proliferation of tumor cells. The miRNA, when downregulated, suppresses tumorigenesis by directly regulating the DENN Domain Containing 2D (DENND2D) tumor suppression gene for non-small cell lung cancer cells55. Additionally, hsa-miR-522-5p regulates cell proliferation, detachment, migration, and the epithelial-mesenchymal transition56. This curated evidence, while preliminary and circumstantial, suggests that hsa-miR-522-5p may have a more extensive role in cancer dynamics, warranting further investigation.

hsa-miR-34a-5p

The hsa-miR-34a miRNA is a regulator of tumor suppression and its use as part of an miRNA-based oncosuppressor replacement therapy is an effective strategy against tumor heterogeneity59. Among the set of top-10 predicted pairs sorted by RPmirDIP score, hsa-miR-34a-5p is involved in eight, with a ninth among the top-10 by DoS. The mirDIP database lists 19,748 pairs involving hsa-miR-34a-5p of which 1,330 are within the “Very High” confidence class, containing six of the nine genes (excluding TUBB2A, CALM3, and SLC10A6). Therefore, these predictions may may simply reiterate previous findings from an miRNA having been previously extensively investigated, reinforcing previous findings. To that point, the miRBase database lists 927 open access papers that mention hsa-mir-34a. Numerous studies demonstrate hsa-mir-34a’s involvement in the initiation and progression of cancers. It has been demonstrated to inhibit the proliferation and metastasis of osteosarcoma cells both in vitro and in vivo57. The under-expression of hsa-mir-34a led to the development and progression of human malignancy via Notch158. Moreover, in combination with p53, hsa-miR-34a-5p has been demonstrated to suppress colorectal cancer metastasis by inhibiting cell proliferation, migration, and invasion65.

While not directly implicated in the development or progression of human cancers, we here briefly introduce the major functions associated to the predicted pairs. To appreciate the diversity of genes and functions highly predicted with this miRNA, the analyses should be expanded beyond the top-10. The Solute Carrier Family 10 Member 7 (SLC10A7) protein was targeted in the RNACentral entry of hsa-miR-34a-5p, but the predicted SLC10A6 was not. This family of proteins is responsible for the sodium-dependent transport of sulfoconjugated steroid hormones, taurolithocholic acid-3-sulfates, sulfoconjugated pyrenes66. The PKNOX1 is a homeobox gene that is disease-associated with Down Syndrome67. Members of the WD repeat protein family are involved in a variety of cellular processes such as apoptosis, cell cycle progression, gene regulation, and signal transduction68. The G-rich sequence factor 1 (GRSF1) protein regulates post-transcriptional mitochondrial gene expression and is necessary to recruit mRNAs to mitochondial ribosomes. The OTU domain-containing protein 3 (OTUD3) hydrolyzes Lys-6- and Lys-11-linked polyubiquitin. The Calmodulin-3 (CALM3) protein mediates the activity of a diverse array of proteins via calcium-binding. The Sorting nexin-33 (SNX30) protein is required to eficiently progress through mitosis and cytokinesis (e.g. necessary for normal formation of the cleavage furrow at the end of mitosis). Finally, the Paralemmin-2 (PALM2) protein is disease-associated with Hypertrichosis Universalis Congenita, Ambras Type and Kallmann Syndrome.

hsa-miR-675-5p

The hsa-miR-675-5p miRNA is another that has been extensively studied for its role in suppressing cancers, it has been found overexpressed in many cancers. The downregulation of miR-675-5p has previously been demonstrated to suppress lung cancer progression and metastasis through the regulation of the G protein-coupled receptor 55 (GPR55)60. Another study evidences its oncogenic role in esophageal squamous cell carcinoma (ESCC) by inhibiting RALBP1 Associated Eps Domain Containing 2 (RESP2) via the RalBP1/RAC1/CDC42 signaling pathway; among ESCC patients, hsa-miR-675-5p is a valuable prognostic biomarker and therapeutic target61. The miRBase database lists 106 open access papers that mention hsa-mir-675.

One pair involving this miRNA appeared among the top-10, involving the Leucine-Rich Repeats and Immunoglobulin-Like Domains 2 (LRIG2) gene. The mirDIP database lists LRIG2 as the 5th gene among the top-28 “Very High” confidence interactions within the mirDIP database. The encoded LRIG2 protein is known to promote epidermal growth factor signalling leading to increased cell proliferation. Promisingly, this function suggests that miR-676-5p regulation of the LRIG2 gene would suppress proliferation, as seen in the miRNA’s related activity. Our findings suggest that wet laboratory investigations into hsa-miR-675-5p and LRIG2 interactions are warranted.

hsa-miR-16-5p

The hsa-miR-16-5p miRNA is the last among the top-10 predicted pairs. It is another tumor suppressor and recently identified as a promising biomarker or therapeutic target for cholangiocarcinoma through its direct targeting of the yes-associated protein 1 (YAP1) transcriptional regulator62. Within breast cancer tumors, it is a stably-expressed housekeeping miRNA, found to be the most consistently expressed among other housekeeper candidate subtypes63. The miRBase database lists 730 open access papers that mention hsa-mir-16-1, which comprises hsa-miR-16-5p and hsa-miR-16-1-3p.

One pair involving this miRNA appeared among the top-10 predicted pairs, involving the Aly/REF export factor gene (ALYREF) that encodes the THO complex subunit 4 (THOC4) protein. RNACentral lists 4,821 target proteins although the THOC4 protein is not listed among them. The mirDIP classifies this pair among the “High” confidence class. Considering THOC4’s role as a prognostic marker in liver cancer and its detected expression in several cancers, this predicted pair is a likely candidate to play a more extensive role in oncogenic regulatory dynamics and warrants further wet laboratory investigation.

Conclusion

This work demonstrates the successful application of the RP method to miRNA–gene prediction resulting in significantly improved predictive performance over mirDIP (\(p < 0.001\)). We present a pragmatic implementation of RP which relaxes the constraint for a complete CPM by leveraging pre-computed scores. Future work will investigate the impact of CPM (in)completeness as a function of improved predictive performance and score augmentation. We anticipate that the RPmirDIP method might be applied widely to miRNA–gene prediction and yield promising putative interactions which may form the basis of testable hypotheses. We made publicly available the set of the most likely candidates, available at https://doi.org/10.5683/SP2/LD8JKJ,54.