EMCMDA: predicting miRNA-disease associations via efficient matrix completion

Abundant researches have consistently illustrated the crucial role of microRNAs (miRNAs) in a wide array of essential biological processes. Furthermore, miRNAs have been validated as promising therapeutic targets for addressing complex diseases. Given the costly and time-consuming nature of traditional biological experimental validation methods, it is imperative to develop computational methods. In the work, we developed a novel approach named efficient matrix completion (EMCMDA) for predicting miRNA-disease associations. First, we calculated the similarities across multiple sources for miRNA/disease pairs and combined this information to create a holistic miRNA/disease similarity measure. Second, we utilized this biological information to create a heterogeneous network and established a target matrix derived from this network. Lastly, we framed the miRNA-disease association prediction issue as a low-rank matrix-complete issue that was addressed via minimizing matrix truncated schatten p-norm. Notably, we improved the conventional singular value contraction algorithm through using a weighted singular value contraction technique. This technique dynamically adjusts the degree of contraction based on the significance of each singular value, ensuring that the physical meaning of these singular values is fully considered. We evaluated the performance of EMCMDA by applying two distinct cross-validation experiments on two diverse databases, and the outcomes were statistically significant. In addition, we executed comprehensive case studies on two prevalent human diseases, namely lung cancer and breast cancer. Following prediction and multiple validations, it was evident that EMCMDA proficiently forecasts previously undisclosed disease-related miRNAs. These results underscore the robustness and efficacy of EMCMDA in miRNA-disease association prediction.


Materials and methods
The EMCMDA model is structured around three key phases, as illustrated in Fig. 1.First, we calculated the similarities across multiple sources for miRNA/disease pairs and combined this information to create a holistic miRNA/disease similarity measure.Second, we utilized the preprocessed data to create a heterogeneous network and established a target matrix derived from this network.Third, we complemented the missing values of the correlation matrix by minimizing matrix truncated schatten p-norm.

Disease semantic similarity
In this research, we integrated two methods for computing disease semantic similarity to improve accuracy.First, we compute the semantic correlation of each disease node by different methods, which in turn leads to www.nature.com/scientificreports/Disease Semantic Similarity 1 and Disease Semantic Similarity 2. Subsequently, by integrating these two semantic metrics and applying weighted averaging, we obtain a comprehensive disease similarity metric.This integrated approach not only enriches the computational process, but also greatly improves the accuracy of disease similarity assessment.
Disease semantic similarity 1 Wang et al. 29 introduced a approach for assessing semantic similarity in diseases by utilizing Medical Subject Headings (MeSH).For disease d, they built a directed acyclic graph labeled as DAG d .The graph consists of three parts, specifically including the ancestor node d, d itself, and the direct edges connecting the parent node to its respective children.
In DAG d , the semantic contribution of the disease term t to d is calculated below: where ϕ represents the semantic contribution factor, which we assign a value of 0.5 following the work of Wang et al. 29 .The semantic score of disease d was computed as shown below: Building upon the premise that diseases with a greater overlap in their DAGs are likely to demonstrate higher similarity, the semantic similarity score between disease d i and disease d j were calculated as shown below: Disease semantic similarity 2 Due to the shortcomings of the semantic similarity measure presented by Wang et al. 29 , Chen et al. 26 introduced an alternative measure.Specifically, the second semantic contribution score W 2d for each disease d is described below: We utilized the second semantic contribution score W 2d to compute the disease semantic score S 2 and semantic similarity DS 2 between d i and d j .The specific formulas are illustrated below:

Integrated semantic similarity of disease
Based on these two measures, we use a weighted average strategy for integration.The calculation equation is shown below:

GIPK similarity for miRNA and disease
To enrich the similarity measures, we employed Gaussian kernels to compute the Gaussian interaction profile kernel (GIPK) similarity of miRNA/disease.Initially, we utilized the vector MD(m i ) to depict the interaction characteristic of miRNA m i by exploring its associations with various diseases.Similarly, the vector MD(d i ) was employed to indicate the interaction characteristic of disease d i .The specific formula is shown below: where MGKS(m i , m j ) indicates the GIPK similarity between miRNA m i and m j , and DGKS(d i , d j ) denotes the GIPK similarity between disease d i and d j .The adjustable parameters m and d are determined using the fol- lowing equations: (2) the number of DAGs including d the number of disease

Heterogeneous network construction
To efficiently utilize the available prior knowledge, we constructed a heterogeneous network.First, we introduced MM and DD into the heterogeneous network to improve the overall performance of EMCMDA.Second, we used the association matrix A MD to complete this miRNA-disease heterogeneous network.Finally, we defined the goal matrix H by utilizing this heterogeneous network.

EMCMDA
The present MDA matrix inherently exhibits sparsity, featuring low-rank structures and containing a substantial amount of redundancy information that can be leveraged for data recovery and feature extraction.Minimizing nuclear norm methods are often employed to address low-rank matrix completion problems.The nuclear norm is defined as the summation of singular values within a matrix.It is employed to enforce the low-rank constraint on the matrix, thereby facilitating dimensionality reduction.Let's consider the objective function H as a predefined low-rank or approximately low-rank matrix, and X as the low-rank matrix we aim to recover.The issue of minimizing the nuclear norm for X can be stated the following way: where �X� * = min(nm,nd) i=1 σ i (X) indicates the nuclear norm for X.Given the possibility of a substantial pres- ence of "noisy" data within miRNA and disease datasets, it becomes imperative for MDA prediction models to exhibit a high degree of tolerance towards potential noise.Below, a comprehensive noise tolerance matrix model is presented: where ε 0 signifies the noise parameter, denotes the set of all known associated index pairs (i,j) in H and P indicates the projection operator on .
Although nuclear norm minimization is a viable method for predicting MDAs, it still exhibits certain limitations.The size of the singular value reflects the amount of information in the matrix, with larger values carrying the main information and smaller values containing smaller changes or noise.The standard nuclear norm treats each singular value identically, which greatly limits its ability to handle practical problems.Therefore, we proposed the matrix truncated schatten p-norm minimization method for MDA prediction.The truncated schatten p-norm treats different singular values differently and retains the first r larger singular values, ignoring small singular values.In addition, the pth power of the remaining singular values is summed.Mathematically, it can be expressed as �X� p r = nm+nd i=r+1 σ p i (x) .This fully takes into account the physical significance of the singular values and yields a superior solution.Therefore, the truncated schatten p-norm exhibits greater proximity to the rank than other rank relaxation norms.
Next, the important lemma of truncated schatten p-norm is introduced to facilitate the solution.
Lemma 1 (See 30 and 31 ) Consider a matrix X ∈ R (nm+nd)×(nm+nd) with a rank s(s ≤ nm + nd) , and its singular value decomposition as www.nature.com/scientificreports/V ∈ R (nm+nd)×(nm+nd) .When A ∈ R r×(nm+nd) , B ∈ R r×(nm+nd) and 0 < p ≤ 1 , the optimization problem has optimal solution.The specific formula is shown below: Thanks to Lemma 1, we enhanced the initial model for minimizing the nuclear norm [Eq.(17)] and developed a new model: Equation ( 20) is non-convex, providing a more accurate approximation than the convex nuclear norm.However, its solution poses a challenge, as conventional methods are inadequate for addressing this non-convexity.For this reason, we first transformed the model [Eq.( 20)].
We let . Subsequently, we computed the derivative of the equation with regard to σ (X).
Then, the first-order Taylor expansion for Q(σ (X)) was attained as shown below: We let , where W := {ω i } nm+nd 1 is a weight sequence.After processing, we acquired the following solvable convex optimization model: However, solving models with inequality constraints presents numerous challenges.Therefore, it is a widely adopted approach to replace the constrained model with a regularized counterpart.The incorporation of soft regularization not only allows for the accommodation of unforeseen noise but also significantly enhances the efficiency of our problem-solving procedures.Furthermore, we applied a constraint within the range of [0, 1] to all matrix values to ensure their practical significance 32,33 .In conclusion, we constructed the following model: where α represents a equilibrium coefficient and 0 ≤ X i,j ≤ 1 (where 0 ≤ i, j ≤ nm + nd ) signifies that all the elements in matrix X fall within the range of [0, 1].
We formulated a framework utilizing the alternating direction multiplier method (ADMM) 34 to handle the optimization issue as shown below.
Step 1: Initialize X 1 = H and calculate the (l+1)-th iteration of X l = U l l V T l .Next, determine A l and B l based on the values of U l and V l .Experimental validation shows that l ∈ [1, 4] gives the optimal result.
Step 2: Calculate the k-th iteration of W = {ω i } nm+nd

1
. Following this, the ADMM-based framework is employed for solving equation (24).Experimental validation shows that k = 1 produces the best result.
To facilitate the computation, we introduce an auxiliary matrix T for subsequent solution. ( AA ⊤ = I r×r , BB ⊤ = I r×r , and 0 < p ≤ 1 The augmented Lagrangian form of Eq. ( 25) is represented below: where E denotes the Lagrange multiplier, β denotes the penalty parameter.The minimization of Eq. ( 26) is an iterative computation process.In the k-th iteration step, T k+1 , X k+1 , and E k+1 are calculated serially.The following is the detailed procedure for the iterative algorithm's solution process.Update T k+1 : Fix X k and E k to update T k+1 via minimizing function ℓ(T,X,E,α,β).
We attain the optimal solution T k+1 of Eq. ( 27) exclusively when the derivative of Eq. ( 27) is 0, as shown below: where P * represents the adjoint operator of P , and it fulfills the condition P * P = P .The solution is continued as follows: where I denotes the identity operator.Based on reference 35 , it is known that (I + α β P * � P � ) −1 = (I − α α+β P * � P � ) .To ensure that the predictions are meaningful, we restrict the elements of T k+1 to the range [0, 1].Update X k+1 : Fix T k+1 and E k to update X k+1 by minimizing function ℓ(T,X,E,α,β ) .
is the weighted singular value contraction operator and W = {ω i } nm+nd 1 (refer to 36 ).Update E k+1 : Fix T k+1 and X k+1 to update E k+1 .( 25) min Vol:.( 1234567890) Here, the values of ε 1 and ε 2 refer to the paper by Yang et al. 37 .The complemented adjacency matrix H * is shown below: We fetched the complemented MDA matrix A * MD from H * .Specifically, we replaced all the unrecorded values in A * MD with predicted scores within the [0, 1] range, indicating the probability of potential MDAs.To elucidate this solution procedure, we present 1 below.

Performance evaluation
In this study, the predictive capability of EMCMDA is assessed through Global LOOCV and 5-fold CV using the benchmark dataset.To assess the proposed model, we compared its predictions with those generated by HGCLAMIR 16 , BNNRMDA 24 , WBNPMD 19 , KATZBNRA 18 , PMFMDA 25 , IMCMDA 26 .

Global LOOCV
To make the most of the existing biological data, we utilized Global LOOCV on the benchmark dataset.In Global LOOCV, we systematically treated each of the 5430 known MDAs as a test set, while the remainder of the known associations were employed as training.All unidentified MDA pairs were employed as candidate set.
After EMCMDA computes all relevant prediction scores, we ranked these scores in descending order for both the test and candidate samples.Finally, we employed distinct thresholds to compute AUC.As depicted in Fig. 2a, EMCMDA got the highest AUC (0.9640).It also demonstrates that EMCMDA outperforms other comparative methods in the study.

5-fold CV
The 5-fold CV was implemented to further validate EMCMDA's prediction performance.In 5-fold CV, all known MDAs were split into five equal-sized subsets.For each fold, a segment was designated as the testing set, and the other four segments were used for training purposes.We performed the same operation in the other comparison models.As with Global LOOCV, we used AUC values to compare these models's performance.As depicted in Fig. 2b, EMCMDA obtained the highest AUC (0.9615).This also demonstrates the superior ability of our model to predict potential MDAs.

Parametric sensitivity analysis
A sensitivity analysis of the important parameters of the model was performed to ensure that EMCMDA achieved better prediction.The following parameters are our main focus: equilibrium coefficient α , penalty parameter β , power of singular values p and truncation position of the target matrix rank r.We implemented 5-fold CV on the benchmark dataset to determine the optimal parameters of EMCMDA.The results are depicted in Fig. 3.
The AUC values was utilized as an indicator for the evaluation of the parameter.We first optimized the values of α and β and subsequently held them constant while determining the optimal values for p and r.As illustrated in Fig. 3, the model achieved the highest AUC (0.9612) when α=20, β=5, p=1 and r=5.Based on the above, we here set α=20, β=5, p=1 and r=5.

Experimental results on HDMM v3.0
To assess the EMCMDA's applicability on different datasets, we conducted Global LOOCV and 5-fold CV based on the HMDD v3.0 database 38 .We acquired 1062 miRNAs, 893 diseases and 35362 known MDAs from the HMDD v3.0 database.In this context, we set the parameters α=2, β=2, p=1 and r=3.Table 1 lists the AUC scores for both HMDD v2.0 and HMDD v3.0 datasets.In the global LOOCV, EMCMDA achieves AUC scores of 0.9640 for HMDD v2.0 and 0.9725 for HMDD v3.0.Meanwhile, in the 5-fold CV, EMCMDA demonstratesAUC scores of 0.9615 for HMDD v2.0 and 0.9706 for HMDD v3.0.It is evident from the table that EMCMDA continues to exhibit excellent performance when applied to the newly collected dataset, reaffirming its robustness and effectiveness in diverse data settings.

Ablation experiment
To verify the importance of GIPK similarity, we presented a variant of EMCMDA that does not contain a GIPK similarity method (EMCMDA-W).Based on implementing 5-fold CV on the benchmark dataset, we compared the performance of both using the AUC and AUPR metrics.As illustrated in Table 2, EMCMDA attains an AUC of 0.9615 and an AUPR of 0.3279, while EMCMDA-W achieves an AUC of 0.9036 and an AUPR of 0.2095.The AUC and AUPR scores for EMCMDA are higher than those of EMCMDA-W under different metrics.Therefore, we can assert that GIPK similarity plays a substantial role in enhancing the predictive power of EMCMDA.

Sensitivity analysis with known number of associations
To examine the impact of the quantity of known associations on the model's performance, we randomly selected 10% and 50% of the original 5430 known associations to construct the new association matrix.We executed Global LOOCV and 5-fold CV to assess EMCMDA using the benchmark dataset.The results are depicted in Fig. 4. In the global LOOCV, EMCMDA achieves AUC scores of 0.8760, 0.9470, and 0.9640, corresponding to www.nature.com/scientificreports/10%, 50%, and 100% of the original known associations, respectively.In the 5-fold CV, EMCMDA demonstrates AUC scores of 0.8668, 0.9315, and 0.9615, respectively.Figure 4 vividly illustrates the trend of increasing AUC values for EMCMDA as the number of known associations grows.Therefore, it can be inferred that the predictive capability of EMCMDA shows a positive correlation with the quantity of known associations.

Hypothesis testing
We employed hypothesis testing to analyze the disparity in predictive capabilities between EMCMDA and other previously employed models.Initially, we assumed that the results obtained from Global LOOCV and 5-fold CV were equivalent between EMCMDA and the comparison models.Subsequently, we conducted t-tests separately on the two CV results for EMCMDA and the other comparison models.The p-values resulting from these hypothesis tests are presented in Table 3.The significant differences between EMCMDA and other comparison methods (BNNRMDA, WBNPMD, KATZBNRA, PMFMDA and IMCMDA) can be observed.Given that the obtained p-value between our method and the compared models substantially less than 0.05, we can confidently assert that EMCMDA exhibits significant distinctions and outperforms other comparison models.

Performance evaluation of multiple metrics
To adequately assess the EMCMDA's reliability, we conducted 10-fold CV on the HMDD v2.0 and HMDDv3.0datasets.As depicted in Fig. 5, EMCMDA obtained AUC values of 0.9635 and 0.9715 on the respective datasets,   underscoring its reliability in MDA prediction.Additionally, we introduced five supplementary metrics to comprehensively assess the EMCMDA's performance.To maintain a balance between positive and negative samples, we randomly selected negative samples from the unknown MDAs while ensuring a 1:1 ratio between the number of positive and negative samples.Subsequently, these metrics were computed based on three thresholds that optimize Accuracy, F1 Score, and MCC.Table 4 showcases that EMCMDA acquired Accurary of 0.9341, Precision of 0.8229, Recall of 0.8155, F1 score of 0.7961, and MCC of 0.7576, affirming EMCMDA is an excellent MDA prediction model.

Case studies
We tested two common human diseases (lung tumors and breast tumors) to demonstrate the ability of EMCMDA for practical applications.The EMCMDA model was trained using data sourced from the HMDD v2.0 database.For both lung and breast tumours, we have certain disease-associated miRNAs both as unknown associations, effectively treating them as novel diseases.For each disease under investigation, candidate miRNAs were sorted according to their predicting correlation scores.The top 50 candidates were subsequently authenticated using two other well-established MDA datasets, namely dbDEMC 39 and miR2Cancer 40 .In all case studies, a significant quantity of disease-associated miRNAs were validated through experimental evidence, underscoring the reliability of EMCMDA's predictions.
Lung tumors are widely recognized as one of the deadliest and most challenging cancers to treat due to their tendency to spread or metastasize early in their development.The lungs are particularly vulnerable to tumor metastasis in other parts of the body 41 .Recent biological experiments have provided strong evidence of miRNAs related to lung tumors.For example, miR-718 has demonstrated its efficacy in hindering the advancement of non-small cell lung cancer (NSCLC) by targeting CCNB1 mRNA as a therapeutic intervention 42 .Moreover, a notable upsurge in miR-522 expression was observed in human tissues affected by NSCLC.Inhibiting miR-522 has shown to be an effective strategy in restraining NSCLC cell proliferation and inducing apoptosis 43 .Moreover, the introduction of exogenous miR-202 has been demonstrated to reduce NSCLC cell viability, migration, and invasion 44 .Notably, the outcomes reveal that 46 of the top 50 predicted miRNAs linked to lung tumors were validated in either the dbDEMC or miR2Cancer datasets (see Table 5).
Breast tumors are among the most common cancers affecting women.However, the rates of cure and prognosis can be significantly improved through early detection, regular screening, and timely treatment 45 .An increasing number of biological experiments has affirmed the effect of miRNAs in breast tumors.For example, miR-132 assumes a crucial function in restraining the proliferation, invasion, migration, and metastasis of breast cancer through direct inhibition of HN1 46 .Additionally, miR-34a suppresses the proliferation of breast cancer Table 5.We predicted the top 50 miRNAs for lung tumors (i and ii refer to dbDEMC and miRCancer, respectively).www.nature.com/scientificreports/via specifically targeting LMTK3 and holds promise as an anti-ER (estrogen receptor) agent in breast cancer therapy 47 .Moreover, Upregulation of miR-101 effectively suppresses the development of breast cancer cells 48 .Notably, the results indicate that all of the top 50 predicted miRNAs linked to breast tumors were certified in either the dbDEMC or miR2Cancer datasets (refer to Table 6).Furthermore, we acquired miRNAseq data associated with lung and breast cancers, enabling us to perform a comparative analysis of the differential expression patterns of the top 10 miRNAs predicted by EMCMDA for these specific diseases.Notably, EMCMDA's predictions regarding these miRNAs were validated through expression changes observed in expression within the corresponding disease contexts.This supplementary evidence serves to further validate the efficacy of our model.Figure 6 exhibits the detailed outcomes of the differential expression analysis.

Discussion and conclusion
As our comprehension of the fundamental biological mechanisms underlying various diseases continues to grow, the implications of MDA prediction are poised to be both extensive and profound.This endeavor is expected not only to significantly enhance our ability to detect diseases in their early stages but also to advance our strategies for addressing complex diseases.In the last few years, more and more computational models have been developed.HGCLAMIR 16 combines view-aware attention mechanisms of hypergraph contrast learning and combined multi-view representation techniques to forecast MDAs.Its advantage lies in proposing a multiview representation integration approach, enriching embedded representation information.However, it lacks interpretability.BNNRMDA 24 employs bounded kernel paradigm regularization for predicting potential MDAs.Its innovation lies in constraining the prediction structure to the interval of 0-1, ensuring interpretability of predictions.Nonetheless, the model's solution is suboptimal.PMFMDA 25 uses probability matrix decomposition to predict unknown MDAs.However, it relies on a single similarity measure and its solution is suboptimal.Current MDA prediction models fail to sufficiently capture the miRNA/disease similarities.While matrix completion proves effective for association prediction, existing models fall short in delivering optimal solutions.To address these challenges, we introduce the EMCMDA model to address the issue of missing MDAs by minimizing matrix truncated schatten p-norm.The key contributions of the EMCMDA model are outlined below: (i) We calculated the similarities across multiple sources for miRNA/disease pairs and combined this information to create a holistic miRNA/disease similarity measure.This enriches the similarity types, reduces the bias caused by a single similarity, and improves the similarity accuracy of miRNAs/diseases.(ii) We complement the predicted values of the unknown MDAs by minimizing matrix truncated schatten p-norm.This norm offers a more accurate approximation to the rank than other rank relaxation norms, and therefore obtains more accurate solutions.(iii) Table 6.We predicted the top 50 miRNAs for breast tumors (i and ii refer to dbDEMC and miRCancer, respectively).We improved the conventional singular value contraction algorithm through using a weighted singular value contraction technique.This technique dynamically adjusts the degree of contraction using the significance of each singular value, ensuring that the physical meaning of these singular values is fully considered.We conducted Global LOOCV and 5-fold CV using the benchmark dataset, and EMCMDA consistently achieved the highest AUC values, surpassing the AUC of all compared methods.When applied to the HMDD v3.0 dataset, EMCMDA yielded AUCs of 0.9756 and 0.9706 for Global LOOCV and 5-fold CV, respectively.These results demonstrate the robust generalization capability of EMCMDA across different datasets.To further illustrate the practical utility of EMCMDA, we conducted two case studies that highlight its efficiency in realworld applications.
While EMCMDA demonstrates strong predictive performance, it does come with certain limitations.First, the model's parameters may not always be optimized, potentially affecting prediction accuracy.Second, the utilization of a weighted average strategy for merging multi-source data pertaining to miRNAs and diseases may not represent the most optimal fusion method.Third, the available correlation information remains limited, thereby constraining the predictive capacity of the model.Lastly, although our model can predict potential MDAs, it falls short in pinpointing the specific mechanisms through which miRNAs contribute to disease onset.The study of gene/protein signaling networks using ode-based theoretical models is not only crucial for identifying potential therapeutic targets for diseases, but also helps to explore the mechanisms of gene/protein signaling networks in disease treatment 49,50 .Therefore, we can achieve a more comprehensive prediction by integrating the miRNA

Figure 1 .
Figure 1.The framework of EMCMDA.Step1, computing and integrating miRNA/disease multi-source similarities to obtain a comprehensive miRNA/disease similarity; Step2, building a heterogeneous network and creating a target matrix derived from that network; Step3, minimizing the matrix truncated schatten p-norm and using a weighted singular value contraction algorithm yields the predicted score matrix. https://doi.org/10.1038/s41598-024-63582-y

Figure 2 .
Figure 2. Global LOOCV and 5-fold CV were employed on the benchmark dataset to compare the predictive capabilities of various models.

Figure 3 .
Figure 3.The results of the prametric sensitivity analysis.

Figure 6 .
Figure 6.outcomes of the expression analysis for miRNAs.

Table 1 .
Performance comparison of EMCMDA using AUC values on two datasets.

Table 2 .
The result of the ablation experiment.

Table 3 .
P-value derived from hypothesis testing by EMCMDA and other comparative methods.

Table 4 .
Five additional metrics were incorporated to validate the EMCMDA's efficacy.Notes: T 1 , T 2 , and T 3 denote the threshold values that specifically maximize optimize the accuracy, F1 score, and MCC respectively.