Novel miRNA signature for predicting the stage of hepatocellular carcinoma

Hepatocellular carcinoma (HCC) is one of the leading causes of cancer deaths worldwide. Recently, microRNAs (miRNAs) are reported to be altered and act as potential biomarkers in various cancers. However, miRNA biomarkers for predicting the stage of HCC are limitedly discovered. Hence, we sought to identify a novel miRNA signature associated with cancer stage in HCC. We proposed a support vector machine (SVM)-based cancer stage prediction method, SVM-HCC, which uses an inheritable bi-objective combinatorial genetic algorithm for selecting a minimal set of miRNA biomarkers while maximizing the accuracy of predicting the early and advanced stages of HCC. SVM-HCC identified a 23-miRNA signature that is associated with cancer stages in patients with HCC and achieved a 10-fold cross-validation accuracy, sensitivity, specificity, Matthews correlation coefficient, and area under the receiver operating characteristic curve (AUC) of 92.59%, 0.98, 0.74, 0.80, and 0.86, respectively; and test accuracy and test AUC of 74.28% and 0.73, respectively. We prioritized the miRNAs in the signature based on their contributions to predictive performance, and validated the prognostic power of the prioritized miRNAs using Kaplan–Meier survival curves. The results showed that seven miRNAs were significantly associated with prognosis in HCC patients. Correlation analysis of the miRNA signature and its co-expressed miRNAs revealed that hsa-let-7i and its 13 co-expressed miRNAs are significantly involved in the hepatitis B pathway. In clinical practice, a prediction model using the identified 23-miRNA signature could be valuable for early-stage detection, and could also help to develop miRNA-based therapeutic strategies for HCC.

Scientific RepoRtS | (2020) 10:14452 | https://doi.org/10.1038/s41598-020-71324-z www.nature.com/scientificreports/ profiling of 89 HCC patients, followed by unsupervised hierarchical clustering, to categorize HCC into three sub classes 20 . Machine learning models have been used to predict the treatment response of trans-arterial chemoembolization in patients with HCC 21 . The random forest method and multiple urine DNA biomarkers have been used for HCC screening 22 . A. Nagy et al. identified 223 miRNAs as prognostic biomarkers based on previous literature and validated their prognostic power using the independent datasets; in which, 55 individual miRNAs are significantly associated with the overall survival of HCC 23 . Previously developed methods and studies have mainly focused on identifying differentially expressed genes and survival variants in HCC. Early stage detection and diagnosis of cancer remains a challenge for clinicians. MiRNAs are considered as potential tumor markers due to their tissue specificity and capability to predict clinicopathological parameters 24 . Several studies have been demonstrated that miRNAs have the potential to be new biomarkers in various cancers for early detection [25][26][27][28] . Moreover, miRNAs can be detectable not only from tissue samples but also from a wide range of biological samples, such as urine, blood plasma, and serum. However, few studies have attempted to predict the stage of HCC using the genomic profiling. Therefore, this study aims to identify a miRNA signature consisting of a small set of miRNA biomarkers that can predict the cancer stage of patients with HCC, so that this miRNA signature can be useful for developing gene-based target therapies in HCC.
In this study, we proposed a method for predicting the early and advanced stages of HCC using miRNA expression profiles. We retrieved 348 expression profiles of 540 miRNAs (348*540) from 348 HCC patients from The Cancer Genome Atlas (TCGA) database. Our dataset includes 258 patients with early-stage disease and 90 patients with advanced HCC. We utilized a support vector machine (SVM)-based classifier 29 , SVM-HCC, which incorporated with an inheritable bi-objective combinatorial genetic algorithm (IBCGA) 30 to identify a miRNA signature capable of distinguishing early-stage patients from advanced-stage HCC. Though optimization technique of the SVM-HCC was adopted from our previous study 31 , identified miRNA signature is novel in HCC stage prediction. The main purpose of this study is to identify a miRNA signature associated with cancer stage of patients with HCC. We ranked the miRNAs in the signature based on their contributions to predictive performance, and subjected the 10 top-ranked miRNAs to further analysis. Next, to investigate the prognostic power of the identified miRNA signature among the patients with HCC, Kaplan-Meier (KM) survival analysis was performed. The expression difference of the 10 top-ranked miRNAs was compared between cancer and normal samples. The biological significance of the identified miRNA signature was analyzed using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontology (GO) annotations. Finally, we identified co-expressed miRNAs to the miRNA signature to provide the more information on its overall impact on HCC.

Results and discussion
The proposed method, SVM-HCC, distinguished patients with HCC into early-stage and advanced-stage groups based on their miRNA expression profiles. We used a dataset containing 540 miRNA expression profiles from 348 HCC patients, of whom 248 had early-stage and 90 had advanced-stage HCC. SVM-HCC, used a feature selection algorithm (IBCGA) to select a significant miRNA signature associated with early and advanced stages of HCC. The system flowchart of the overall process is depicted in Fig. 1.
We compared SVM-HCC with standard machine learning methods including sequential minimal optimization (SMO), multilayer perceptron (MLP), naïve Bayes, LibSVM, and random forest. For the feature selection, we used the Ranker search and correlation attribute evaluation method of Waikato Environment for Knowledge Analysis (Weka) to select 30 features to classify early-stage and advanced-stage HCC patients. SVM-HCC performed well relative to these machine learning methods in terms of training accuracy. Using the training set (n = 348), SVM-HCC achieved a mean training accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC) of 89.56 ± 1.27%, 0.94 ± 0.01, 0.73 ± 0.03, and 0.71 ± 0.03, respectively. SVM-HCC achieved the best training accuracy, sensitivity, specificity, MCC, and area under the receiver operating characteristic curve (AUC) of 92.24, 0.96, 0.81, 0.79, and 0.90, respectively. The comparison results are shown in Table 1. www.nature.com/scientificreports/ Next, to observe the difference in the prediction performance among the standard machine learning methods with the feature size, we used the greedy stepwise search and Cfs Subset Evaluator attribute evaluation method of Weka to select 19 features to classify early and advanced stages of patients with HCC. The prediction performance results shown that only a slight difference in the prediction accuracies was observed for the SMO, MLP, naïve Bayes, LibSVM, and random forest methods when compared to the Table 1. SMO, MLP, naïve Bayes, LibSVM, and random forest methods shown the accuracy differences of 1.15%, 1.43%, 0.86%, 2.87%, and 1.15%, respectively. However, there was no larger difference observed for the AUCs among these methods, shown in Supplementary Table S1. the comparison of AUcs. Further, statistical analysis was performed to compare the prediction ability of SVM-HCC with some machine learning methods using the AUC comparison method proposed by Hanley and McNeil 32 . This analysis provides the statistical test comparison between the AUC of SVM-HCC and the AUCs of other machine learning methods. When compared the statistical test AUC of SVM-HCC (AUC = 0.9), SMO obtained a standard error (SE), AUC area difference (AUCd), Z value, and a p value of 0.03, 0.40, 10.29, and p < 0.001, respectively; MLP obtained SE, AUCd, Z value, and a p value of 0.033, 0.3, 8.09, and p < 0.001, respectively; naïve Bayes obtained SE, AUCd, Z value, and a p value of 0.029, 0.19, 5.71, and p < 0.001, respectively; LibSVM obtained SE, AUCd, Z value, and a p value of 0.034, 0. 36, 9.34, and p < 0.001, respectively; and random forest obtained SE, AUCd, Z value, and a p value of 0.028, 0.18, 5.48, and p < 0.001, respectively. The statistical analysis shows that SVM-HCC method is significantly (p < 0.001) different and performed better when compared with the other standard machine learning methods, shown in Supplementary Table S2.
We performed 30 independent runs of SVM-HCC to select a robust miRNA signature using the appearance score 33 , which was calculated based on the frequency of each feature over the independent runs. The most robust miRNA signature had an appearance score of 6.17. SVM-HCC identified a 23-miRNA signature associated with the early and advanced stages of HCC, and achieved a tenfold cross-validation (10-CV) accuracy, sensitivity, specificity, MCC and AUC of 92.59%, 0.98, 0.74, 0.80, and 0.86, respectively, and a test accuracy and test AUC of 74.28% and 0.73, respectively. The predictive performance of SVM-HCC was evaluated using a receiver operating characteristic (ROC) curve, and is shown in Fig. 2. Additionally, to investigate the effect of clinical characteristics on the prediction performance, we added some of the clinical characteristics of patients with HCC such as gender, risk factors, race, hepatitis serology, and vital status to the miRNA signature for the stage prediction. However, addition of these clinical features did not improve the prediction performance of SVM-HCC.

Expression difference of top ranked miRNAs in tumor vs normal.
We then compared the expression levels of the 10 top-ranked miRNAs in tumor and normal samples using UALCAN web portal 36 , and observed a significant difference in miRNA expression levels between the two groups. Of the top 10 ranked miRNAs, eight miRNAs, hsa-mir-550a, hsa-miR-518b, hsa-miR-512, hsa-miR-574, hsa-miR-424, hsa-miR-4286, hsa-let-7i, hsa-miR-320a are significantly expressed in tumor and normal samples, a p value < 0.05 was considered a threshold to describe the statistical significant. Among, two of the ten top-ranked miRNAs (hsa-miR-549 and hsa-miR-1179), the role of hsa-miR-549 was not reported earlier in HCC, and hsa-miR-1179 was not significantly expressed between the tumor and adjacent normal tissues of the TCGA cohort. However, hsa-miR-549 contributed better towards predicting the stage of HCC (MED rank 2) and possessed a significant role in other cancers [37][38][39] . Though, the expression of hsa-miR-1179 was not significant between the tumor and adjacent normal tissues of the TCGA cohort, a Quantitative Real Time-Polymerase Chain Reaction study on 40 HCC samples reported that hsa-miR-1179 was significantly expressed between HCC and matched normal tissues, and plays an important role in HCC progression and metastasis 40 . The expression levels of the 10 top-ranked miRNAs in the tumor and normal samples are listed in Supplementary Table S3. Box plot representation of relative expression difference of top ranked miRNAs in tumor and normal samples is given in Supplementary Fig. S2. The individual data points of the expression analysis can be accessed from the UALCAN web portal.
Further, we attempted to distinguish the tumor and normal samples using the identified miRNA signature. We used a dataset consisting of 32 normal samples and randomly selected 32 tumor samples, and LibSVM of the WEKA to distinguish the tumor and normal samples. LibSVM achieved a leave-one-out accuracy of 100% to distinguish tumor and normal samples using the 23-miRNA signature.

Significance of top-ranked miRNAs in cancer.
Nine of the ten top-ranked miRNAs are involved in HCC and various other cancers; we summarize their functions in HCC, based on reports in the experimentally validated literature, in Supplementary Table S4.
Hsa-miR-549, the second-ranked miRNA, is differentially expressed in cancer cells relative to normal cells. For example, hsa-miR-549 is highly expressed in colon cancer 37 , colorectal cancer 38 , and breast cancers 39 , with log-fold changes of 0.51, 1.75, and 0.66, respectively, relative to normal cells. However, the role of hsa-miR-549 in HCC has not been reported previously. Our results suggest that hsa-miR-549 is significantly associated with overall survival in HCC patients, and that it actively participates in other major cancers. Hence, it is a worthy subject of further investigation.
We constructed a miRNA target interaction network using Cytoscape 51 to investigate regulatory interactions compiled in the miRTarBase database. The top-ranked miRNAs annotated with miRBase accession numbers and predicted miRNA interactions using miRTarBase was 2,274. The predicted miRNA target interaction network is shown in Supplementary Fig. S3.
KeGG pathway and gene ontology enrichment analysis. Next, we investigated the biological significance of the top-ranked miRNAs using KEGG pathway and GO annotation analysis. First, we used the DIANA-miRPath web tool 52 to examine their functional annotations. Fisher's exact test was used for the enrichment analysis. The 10 top-ranked miRNAs are involved in several pathways, the most significant of which are fatty acid metabolism, fatty acid biosynthesis, fatty acid elongation, endocytosis, fatty acid degradation, pathways in cancer, lysine degradation, viral carcinogenesis, glioma, and the Hippo signaling pathway. The top-ranked miRNAs, along with the numbers of predicted target genes in each pathway, are listed in Table 3. The heatmap of the 10 top-ranked miRNAs enriched in KEGG pathways is shown in Fig. 4(A) and the number of target genes involved in pathways is shown in Fig. 4(B). The 23-miRNA signature enriched in KEGG pathways is shown in Supplementary Fig. S4.
Second, we analyzed the involvement of the top-ranked miRNAs in biological pathways, molecular functions, and cellular components using GO annotations. We found that these miRNAs are significantly involved in biological pathways including the mitotic cell cycle, blood coagulation, cellular protein metabolic process, membrane organization, epidermal growth factor receptor signaling pathway, and cell death, with p values < 1.11E-16. They are also involved in molecular functions including protein binding transcription factor activity, nucleic acid binding transcription factor activity, ion binding, and RNA binding. Finally, they are involved in cellular components including cytosol, protein complex, neoplasms, and organelles. Details of these associations are given in Supplementary Table S5, and the enrichment of the 23-miRNA signature in GO annotations is shown in Supplementary Fig. S5 the identification of robust miRNAs that are essential for cancer. Moreover, correlated miRNAs may represent similar biological processes. In its optimization process, SVM-HCC selects a minimal set of biomarkers; hence it selected 23 biomarker miRNAs as a signature associated with HCC stage. However, it might not select some important biomarker miRNAs that are also associated with cancer stage in patients with HCC; also, the priority of miRNA selection may change with the size and number of miRNA profiles used. Hence, to select robust  www.nature.com/scientificreports/ miRNAs outside the 23-miRNA signature, we sought to identify miRNAs that were co-expressed with those 23 miRNAs.
We computed the correlations among each miRNA in the signature using the Pearson correlation coefficient. The miRNAs with the highest correlation coefficient in the signature (0.92) were hsa-miR-512 and hsa-miR-518. These two miRNAs were also significantly associated with overall survival in HCC patients, with p values of 0.0022 and 0.0021. Because these miRNAs had high correlation coefficients, we considered them for further analysis. Thus, we sought to analyze the top-ranked individual miRNAs in the signature. The correlation heatmap of the miRNA signature is shown in Fig. 5.
Additionally, we sought to identify the miRNAs that were highly correlated with the miRNA signature in the 540 expression profiles constituting our dataset. To this end, we measured the correlations of the 23-miRNA signature with the 540 miRNA expression profiles. We considered miRNAs with higher correlations to be co-expressed with the signature. These co-expressed miRNAs and their correlation coefficients are listed in Supplementary Table S6. We considered R ≥ 0.5 to be statistically significant. Five miRNAs in the top-ranked miRNA signatures had co-expressed miRNAs with correlations ≥ 0.5. Hsa-miR-518 and hsa-miR-512 had nine co-expressed miRNAs in common. Three miRNAs, hsa-miR-424, hsa-let-7i, and hsa-miR-320a, had 15, 16, and 1 co-expressed miRNA, respectively. Furthermore, we examined the biological significance of the top-ranked miRNAs and their co-expressed miR-NAs to determine whether they were involved in any common pathways. KEGG pathway analysis of hsa-miR-518 and hsa-miR-512 and their co-expressed miRNAs included glycosphingolipid biosynthesis-lacto and neolacto series (hsa00601), folate biosynthesis (hsa00790), one-carbon pool by folate (hsa00670), mucin-type O-Glycan biosynthesis (hsa00512), and central carbon metabolism in cancer (hsa05230). Details of the involvement of hsa-miR-518, hsa-miR-512, and their co-expressed miRNAs in KEGG pathways are provided in Supplementary  Table S7. Hsa-miR-424 and its co-expressed miRNAs are significantly involved in several cancer pathways, including proteoglycans in cancer (hsa05205), the Hippo signaling pathway (hsa04390), viral carcinogenesis (hsa05203), pathways in cancer (hsa05200), and glioma (hsa05214). Details of the involvement of hsa-miR-424 and its coexpressed miRNAs in KEGG pathways are provided in Supplementary Table S8. Hsa-let-7i and its co-expressed miRNAs are involved in several cancer pathways, including proteoglycans in cancer, viral carcinogenesis, pathways in cancer, chronic myeloid leukemia, thyroid cancer, bladder cancer, colorectal cancer, glioma, and prostate cancer. Interestingly, we found that hsa-let-7i and its co-expressed miRNAs, hsa-miR-145-5p, hsa-miR-10a-3p, hsa-let-7b-5p, hsa-miR-155-5p, hsa-miR-142-5p, hsa-miR-125a-3p, hsa-miR-199a-5p, hsa-miR-214-3p, hsa-miR-424-3p, hsa-miR-708-3p, hsa-miR-542-5p, hsa-miR-342-5p, and hsa-miR-450a-5p, are significantly (p value of 1.18E-10) involved in the hepatitis B pathway (hsa05161), and target 77 genes. Chronic hepatitis B infection has www.nature.com/scientificreports/ been linked to HCC 53 . Details of the involvement of hsa-let-7i and its co-expressed miRNAs in KEGG pathways are provided in Supplementary Table S9. Experimentally validated gene interactions for hsa-let-7i and its coexpressed miRNAs in the hepatitis B pathway are shown in Supplementary Table S10. Hsa-miR-320a had a co-expressed miRNA, hsa-miR-1301. These two miRNAs are significantly involved in cancer pathways including transcriptional misregulation in cancer, glioma, viral carcinogenesis, pathways in cancer, colorectal cancer, and pancreatic cancer. Their involvement in biological pathways is shown in detail in Supplementary Table S11. Together, these analyses revealed that not only the 23-miRNA signature, but also its co-expressed miRNAs, are involved in important pathways, and are therefore worthy of further exploration in the context of HCC. These findings could facilitate the development of miRNA-based therapeutic strategies for HCC.

MiRnAs correlated to the hepatitis infection.
We measured the correlation between identified miRNA signature and clinicopathological features of HCC using Spearman correlation coefficient. Three miR-NAs, hsa-let-7i, hsa-miR-320a, and hsa-miR-2355 of the signature were significantly correlated with the hepatitis infection, shown in Supplementary Table S12. Additionally, correlation was measured for hsa-let-7i and its 13 co-expressed miRNAs, which were involved in hepatitis B pathway. Two of these miRNAs, hsa-miR-145 and hsa-miR-125a were significantly correlated with the hepatitis infection, shown in Supplementary Table S13.

conclusions
Detecting liver cancer at an early stage is difficult because its symptoms often appear only at the later stages. Currently, the diversity of miRNAs, and their differential expression in multiple types of cancer, make them worthy of investigation in the context of cancer research. Recently, miRNAs have been explored as biomarkers of various cancers. Identifying miRNA signatures associated with early-stage HCC could provide useful insight into miRNA-mediated diagnosis of this disease. Developing computational methods for early-stage detection based on miRNA expression could elucidate the variants involved in cancer progression. Besides, potential feature selection methods can easily deal with high-dimensional samples such as gene expression profiles.
In this study, we introduced a SVM-based prediction method, SVM-HCC, which incorporates an optimal feature selection algorithm (IBCGA) to identify miRNA signatures capable of distinguishing early-stage and advanced-stage patients with HCC. SVM-HCC identified a 23-miRNA signature associated with early-stage and advanced-stage HCC, and achieved a 10-CV mean accuracy, sensitivity, specificity, and MCC of 92.44 ± 0.99, 0.96 ± 0.01, 0.78 ± 0.03, and 0.79 ± 0.02, respectively. We prioritized the 23-miRNA signature based on MED scores; miRNAs with higher MED score contributed more to prediction accuracy. The highest-ranked miRNAs were subjected to further analysis.
We validated the prognostic power of the top-ranked miRNAs in HCC using KM survival curves. The results revealed that 7 of the 10 top-ranked miRNAs, hsa-miR-550a, hsa-miR-574, hsa-miR-424, hsa-let-7i, hsa-miR-549, hsa-miR-518, and hsa-miR-512, were significantly associated with overall survival in patients with HCC. In addition, the top-ranked miRNAs are all significantly involved in HCC, with the exception of hsa-miR-549. This miRNA plays an important role in other cancers, but its role in HCC had not been reported previously. However, our results suggest that hsa-miR-549 is significantly associated with overall survival in patients with HCC, and is therefore worthy of further investigation. KEGG pathway and GO enrichment analyses revealed the functional mechanisms of top-ranked miRNAs in several cancer and non-cancer pathways. Although, the identified miRNA signature is potential to predict the stage of HCC, additional information on co-expressed miRNAs to the miRNA signature was provided to explore the possible miRNAs beyond this 23-miRNA signature that might provide specific information/knowledge on its overall impact on HCC. Interestingly, we found that hsa-let-7i and its 13 co-expressed miRNAs were significantly involved in the hepatitis B pathway.
Together, our findings help to explore the role of miRNAs in HCC, and could facilitate early-stage detection and prevention.

Materials and methods
Dataset. From the TCGA database, we retrieved a dataset containing miRNA expression profiles from 348 patients with liver HCC; these profiles were obtained using the Illumina HiSeq 2000 platform. After filtering, the final dataset contained 540 miRNA expression profiles from 348 patients. The HCC stage system in the TCGA dataset was based on the size of the primary breast tumor (T), the spread of cancer to lymph nodes (N) and distant metastasis (M) according to the American Joint Committee on Cancer. For classification purposes, the dataset was divided into early-stage (stages 1 and 2) and advanced-stage (stages 3 and 4) groups. There were 258 patients in the early-stage group and 90 patients in the advanced-stage group. Clinical characteristics of the patients with HCC used in this current study is displayed in Supplementary Fig. S6. establishing the SVM-Hcc. We proposed a method, SVM-HCC, to identify a miRNA signature capable of distinguishing early-stage and advanced-stage HCC based on miRNA expression profiles. SVM-HCC is based on an SVM 29 incorporating the feature selection algorithm IBCGA. SVMs are powerful statistical learning algorithms that use non-linear transformation to map data from input space to higher-dimensional space to identify better predictive models. SVMs have become popular in the biomedical sciences, especially in cancer research, due to their potential predictive performance 54 .
We used miRNA expression profiles of patients with HCC as inputs. SVMs work implicitly by only computing the corresponding kernels in the feature space between two data points, x i and x j . The SVM kernel function is defined as  55 , where the radial basis function (RBF) is used as the kernel function for the implementation of the SVM. RBF is defined as follows: In this study, the SVM parameters C and γ were optimized based on 10-CV. While establishing the SVM-HCC, an optimal feature selection algorithm, IBCGA, was incorporated into the SVM. IBCGA is an intelligent evolutionary algorithm 56 that uses an orthogonal array crossover to solve large parameter optimization problems. In the optimization process, IBCGA selects a minimum number of features, in this case miRNAs, while improving its predictive performance. We have successfully applied IBCGA to various types of cancer predictions 31,33,57,58 . To distinguish early-stage from advanced-stage HCC, the parameter settings of SVM and IBCGA were encoded into binary "genes. " In this study, genetic algorithm (GA) terms were used to represent the genes and "chromosomes. " We used 540 miRNA (m = 540) expression profiles from 348 HCC patients (n = 348) as input. IBCGA parameters were r start = 10, r end = 50, N pop = 50, and G max = 60, as used in 31 . The steps involved in IBCGA are as follows.
Step 1: (Evaluation) Evaluate the fitness value of all individuals using the fitness function, which is the prediction accuracy in terms of 10-CV.
Step 2: (Selection) Use a tournament selection method that selects the winner from two randomly selected individuals to generate a mating pool.
Step 3: (Crossover) Select two parents from the mating pool to perform an orthogonal array crossover operation.
Step 4: (Mutation) Apply a conventional mutation operator to randomly selected individuals in the new population. To prevent the highest fitness value from deteriorating, mutation is not applied to the best individuals.
Step 5: (Termination test) If the stopping condition for obtaining the solution is satisfied, then output the best individual as the solution. Otherwise, go to Step 2.
Step 6: (Inheritance) If r < r end , randomly change one bit in the binary GA genes for each individual from 0 to 1; increase the number r by one, and go to Step 2. Otherwise, stop the algorithm.
Weka classifier. We used Weka 59 , a powerful data mining tool that uses well-known machine learning algorithms. We compared the predictive performance of SVM-HCC with those of some machine learning methods such as SMO, MLP, naïve Bayes, LIBSVM, and random forest. We performed 10-CV to evaluate the performance of the machine learning models. evaluation metrics. We evaluated the predictive performance of the classifier using the following evaluation metrics: sensitivity (SN), specificity (SP), Matthews correlation coefficient (MCC), accuracy (ACC ), and area under the ROC curve (AUC).
where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Data availability
All data analyzed during this study are publicly available at TCGA data portal (https ://porta l.gdc.cance r.gov/).