Introduction

Hepatocellular carcinoma (HCC) is the 3rd leading cause of cancer deaths globally1. Globally, more than of 80% liver cancers are responsible for HCC2 and its prevalence is high in males compared to females3. It usually occurs in people aged 30–50 years3. Different factors such as hepatitis B or hepatitis C4,5, alcohol abuse, smoking, obesity, and type 2 diabetes (T2D) were significantly associated with HCC6. Among them, Hepatitis B is one of the prominent risk factors for the development of HCC, responsible for 50% of cases7. Despite various treatment approaches, namely radiotherapy, chemotherapy, and target therapy have been commonly used to improve the prognosis and recurrence of HCC. Nevertheless, the survival rate of HCC patients is still low8. As a result, the risks of cancer death are still increased due to the lack of early detection and diagnosis of genes and limited treatment facilities. Therefore, it is essential to develop a system for identifying the key or core genes for early detection and better prognosis of HCC.

Recently, bioinformatics analysis has been widely utilized to determine the key prognostic genes or biomarkers as well as their associated molecular pathways for multiple cancers, including HCC8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58. Zhou et al.35 identified 15 prognostic biomarkers as well as their associated gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway using bioinformatics analysis. Chen et al.39,59,60 also identified 11 potential biomarkers that can play crucial roles in the development and progression of HCC patients. Qiang et al.40 proposed five core genes which were significantly associated with early diagnosis and poor prognosis of HBV-HCC. Wang et al.41 identified 36 hub DEGs and illustrated that 10 candidate genes out of the 36 have significant effect on the tumorigenesis and progression of HCC. Among them, eight candidate genes were inversely related to the survival rate of HCC patients. Dai et al.61 proposed a prognostic model for predicting the prognosis of HCC patients. They identified 17 genes that were potentially associated with the prognosis of HCC patients. These 17 genes were used to make a prognostic model using the Cox hazard regression model and validated its performance using the TCGA and GSE14520 datasets. They showed that six genes were involved in the prognosis of HCC patients. Most researchers simply used hub genes derived from the PPI network to identify the key or core genes. One of the major challenges in studying genetic data was the identification of relevant biomarkers or genes. Recently, machine learning (ML)-based techniques have gained more attraction to address this problem59,60,62,63,64,65,66. Despite the fact that several studies have been carried out for the identification and development of potential candidate genes for HCC8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,67, it remains a challenging issue and still has some scope for more research for the identification of potential genes as well as understanding molecular mechanisms for the development, pathogenies, and progression of HCC.

In this work, we used three microarray gene expression (MGE) datasets as training sets to determine the key or core candidate genes for HCC. First, we selected individual DEGs for three datasets. Secondly, support vector machine (SVM) with radial basis function (RBF) was implemented on the identified DEGs from each of the three datasets and calculated the classification accuracy of each DEG. We selected the DEGs from each of the three datasets that provided a classification accuracy of more than 80.0%. At the same time, the overlapping or shared DEGs were identified from three datasets. These overlapping or shared DEGs were called differentially expressed discriminative genes (DEDGs). Thirdly, DAVID was used to perform enrichment analysis on common DEDGs. Fourthly, PPI networks were constructed using STRING and visualized using Cytoscape. Then the hub genes were identified using degree, maximum neighborhood component (MNC), maximal clique centrality (MCC), closeness, and betweenness on the basis of cytoHubba. After that, the central hub genes were determined by overlapping or shared hub genes from the degree, MNC, MCC, centralities of closeness, and betweenness. Molecular Complex Detection (MCODE) was performed for cluster or module analysis and determined the important or significant modules as well as their associated genes. Moreover, the significant meta-hub genes were determined from meta-hub genes, which were extracted from existing studies. The key or core candidate genes were determined among the central hub genes, potential module hub genes, and significant meta-hub genes, which can be easily discriminated against in HCC patients compared to healthy controls. Furthermore, we used another two independent test datasets for the validation as well as to show the discriminative power of the key candidate genes. We also performed a survival analysis of the identified key candidate genes for HCC patients. Therefore, the overall flowchart of our proposed system to determine key candidate genes for HCC is presented in Fig. 1.

Figure 1
figure 1

Flowchart of proposed system for the identification of key candidate genes for HCC.

Results

Identification of DEGs from each dataset

We implemented limma for identifying DEGs from each of the three GEO datasets (GSE36376, GSE39791, and GSE57957). Using the threshold of \(|log_2 FC|{>1}\), and adj.p-value < 0.01, we identified 699 (up-regulated: 431 vs. down-regulated: 268), 428 (up-regulated: 88 vs. down-regulated: 340 DEGs), and 413 DEGs (up-regulated: 107; down-regulated: 306) DEGs between HCC and healthy controls from GSE36376, GSE39791, and GSE57957 datasets and their volcano plots and heatmap were presented in Fig. 2.

Figure 2
figure 2

Volcano plot and heatmap of DEGs for each GEO dataset were generated using “ggplot2” version 3.3.6 package110 ( https://cran.r-project.org/package=ggplot2) and “NMF” version 0.24.0 package111 (https://cran.r-project.org/package=NMF) in R . (a) Volcano plot and (b) heatmap of GSE36376 dataset; (c) Volcano plot and (d) heatmap of GSE39791 dataset; (c) Volcano plot and (d) heatmap of GSE57957. Dodger blue represents down-regulated, gray represents no significant genes, and fire brick represents up-regulated DEGs.

Identification of common DEDGs using SVM

SVM with RBF kernel was applied on the identified DEGs (699 DEGs for GSE36376; 428 DEGs for GSE39791; and 413 DEGs for GSE57957) of each dataset in order to identify the DEDGs of HCC patients. Then, the classification accuracy was computed per gene for DEGs from each dataset. The calculation procedure is clearly discussed in the methodology section. The classification accuracies of all DEGs for individual datasets were ordered in descending order of magnitude, which is presented in Fig. 3. As shown in Fig. 3, we observed that a total of 502 from GSE36376, 169 from GSE39791, and 242 from GSE57957 DEGs were selected as DEDGs because their classification accuracy was more than or equal to 80.0%. Furthermore, 75 common DEDGs were determined among the identified DEDGS from GSE36376, GSE39791, and GSE57957 datasets, which is shown in Fig. 4.

Figure 3
figure 3

Classification accuracy of individual genes using SVM for three GEO datasets: (a) GSE36376; (b) GSE39791, and (c) GSE57957.

Figure 4
figure 4

Identification of common or overlapping DEDGs among DEDGs from GSE36376, GSE39791, and GSE57957 datasets.

Enrichment analysis of common DEDGS

Enrichment analysis was conducted on 75 shared or overlapping DEDGs clearly grasp the mechanism and development of HCC. The functional characteristics of DEDGs were explored using GO and KEGG pathway analysis. The GO analysis was partitioned into three groups: biological process (BP), cellular component (CC), and morphological component. Using p-values \((< 0.05)\), we identified the significant GO and KEGG pathways, and chose the top five prominent GO terms and KEGG pathway. The top five GO terms, including BP, CC, and MF, are presented in Table 1.

Table 1 GO analysis of common DEDGs in terms of BP, CC, and MF. Top 5 items were selected.

For BP-based GO terms, the common DEDGs were strongly enriched with retinol metabolic process, cellular response to cadmium ion, retinoid metabolic process cellular response to copper ion, and steroid catabolic process. Moreover, the extracellular region, extracellular exosome, extracellular space, high-density lipoprotein particle, and apical plasma membrane were found to be top CC, which were significantly enriched with common DEDGs. As shown in Table 1, MF group GO terms, including retinol dehydrogenase activity; oxidoreductase activity; androsterone dehydrogenase activity; androstan-3-alpha,17-beta-diol dehydrogenase activity; and steroid dehydrogenase activity, were mainly enriched with common DEDGs.

The study of the KEGG pathway for common DEDGs is displayed in Table 2. As shown in Table 2, the common DEDGs were significantly associated with multiple pathways such as retinol metabolism, metabolic pathways, tryptophan metabolism, steroid hormone biosynthesis, and drug metabolism-cytochrome P450.

Table 2 KEGG pathway analysis of common DEDGs. Top five items were selected.

PPI network construction and central hub genes identification

STRING was utilized to build a PPI network to show the significant connections between proteins encoded by common DEDGs. Cytoscape was used to show the PPI network, which had 51 nodes and 144 edges (see Fig. 5a). Five hub gene-based identification algorithms, including the degree of connectivity, MNC, MCC, closeness, and betweenness in the Cytoscape plug-in cytoHubba, were implemented to determine the hub genes from PPI networks. Then we chose the top 30 hub genes from each algorithm. We made a Venn diagram among the five algorithms, which is shown in Fig. 5b. As shown in Fig. 5b, eight overlapping central hub genes were identified among these algorithms. These eight central hub genes were NUSAP1, TOP2A, CDC20, PRC1, UBE2C, ASPM, PNPLA7, and MT1E, which were utilized to determine the key or core genes for HCC.

Figure 5
figure 5

PPI network and Venn diagram for common DEDGs and central hub genes. (a) PPI network of common DEDGs with 51 nodes and 144 edges which was generated by Cytoscape 3.9.1118 (www.cytoscape.org); (b) identification of central hub genes among five methods (Degree, MNC, MCC, Closeness, and Betweenness based HGs). Here, HGs represent the hub genes.

Hub modules and its associated genes identification

Module or cluster analysis was performed using MCODE to determine the prominent modules. Three clusters or modules were generated using MCODE and provided 3–6 MCODE scores. We chose the prominent modules that provided the MCODE scores of \(\ge 5\) and the number of nodes \(\ge 5\). Finally, we chose module 1 as a prominent hub module that contained 6 nodes and 30 edges with the highest MCODE scores of 6 and their PPI networks were displayed in Fig. 6. The correspondence six genes were treated as hub module genes.

Figure 6
figure 6

PPI network of module 1 with 6 nodes and 30 edges which was generated by Cytoscape 3.9.1118 (www.cytoscape.org).

Identification of significant meta-hub genes from metadata

We reviewed 52 existing studies related to gene identification of HCC patients8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58. We listed their hub genes in order to make metadata which were presented in Table 3. To make metadata, we extracted 10 hub genes from Maddah et al.9, 5 hub genes from Yan et al.10, 20 from Zhao et al.11, 7 from Zhao et al.12, 10 from Liu et al.13, 11 from Meng et al.14, 42 from Rosli et al.15, 5 from Zhang et al.8, 5 from Li et al.16, 8 from Li et al.17, 5 from Tian et al.18, 12 from Wan et al.19, 10 from Zhu et al.20, 10 from Wang et al.21, 9 from Zhou et al.22, 10 from Zhang et al.23, 18 from Mou et al.24, 8 from Wu et al.25, 9 from Gui et al.26, 10 from Wang et al.27, 28 from Lu and Zhu28, 6 from Bhatt et al.29, 10 from Zhang et al.30, 13 from Jiang et al.31, 20 from Zhang et al.32, 12 from Wu et al.33, 5 from Nguyen et al.34, 15 from Zhou et al.35, 6 from Yu et al.36, 10 from Kakar et al.37, 10 from Ji et al.38, 11 from Chen et al.39, 10 from Qiang et al.40, 10 from Wang et al.41, 10 from Zhang et al.42, 14 from Kim et al.43, 10 from Zhang et al.44, 14 from Sha et al.45, 10 from Chen et al.46, 4 from He et al.47, 10 from Zhang et al.48, 4 from Hu et al.49, 9 from Zhang et al.50, 15 from Li et al.51, 5 from Cao et al.52, 7 from Yang et al.53, 5 from Wang et al.54, 9 from Jiang et al.55, 16 from Li et al.56, 15 from Xing et al.57, 10 from Zhu W et al.58, and 20 from Dai et al.61. Now, we took the union of extracted hub genes and got 214 hub genes as meta-hub genes. At the same time, we also computed the frequency of each meta-hub gene depending on how many studies got that gene as hub gene and selected 52 significant meta-hub genes because their frequency was more than 3. These selected 52 significant meta-hub genes were utilized for the determination of key genes.

Table 3 Formation of metadata by listing hub genes from existing studies.

Key candidate genes identification

Eight central hub genes were identified from five methods (degree of connectivity, MNC, MCC, closeness, and betweenness), 6 hub module genes from potential hub modules, and 52 significant meta-hub genes from meta-hub genes. Six overlapping genes were identified using the Venn diagram from these three gene identification methods, which is presented in Fig. 7. These six genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) were considered as key genes, which can be easily classified into the subjects as HCC and healthy.

Figure 7
figure 7

Identification of key candidate genes of HCC from central hub genes, hub module genes, and significant meta-hub genes.

Validation of key candidate genes

Discriminative power analysis using ROC curve

Six key or core genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) were validated using AUC, computed from ROC curves. We compared the performance of two independent test datasets (GSE76427 and TCGA-LIHC) with one of our train datasets (GSE57957) in order to show the precision of the selected key candidate genes. The ROC curves of six key genes as well as their heatmap for both training and independent test datasets were illustrated in Fig. 8.

Figure 8
figure 8

Validation of the six key candidate genes using AUC and heatmap: (a), (b) GSE57957-based training dataset; (c), (d) GSE76427-based independent test dataset; and (e), (f) TCGA-LIHC based independent test dataset. Whereas, ROC curves were generated using pROC version 1.18.0 package121 and heatmap was generated using “NMF” version 0.24.0 package in R111.

The ROC curve of six key candidate genes with their AUC values for the training dataset (GSE57957) was displayed in Fig. 8a: TOP2A (AUC: 0.936, 95% CI 0.871–1.000), CDC20 (AUC: 0.917, 95% CI 0.838–0.996), ASPM (AUC: 0.919, 95% CI 0.851–0.987), PRC1 (AUC: 0.938, 95% CI 0.871–1.000), UBE2C (AUC: 0.803, 95% CI 0.703–0.904), and NUSAP1 (AUC: 0.930, 95% CI 0.895–1.000). As displayed in Fig. 8c, the AUC values of six key or core genes were more than almost 0.780. The AUC values of six key or core genes for the GSE76427 dataset were: TOP2A (AUC: 0.900, 95% CI 0.851–0.949), CDC20 (AUC: 0.887, 95% CI 0.883–0.941), ASPM (AUC: 0.893, 95% CI 0.844–0.942), PRC1 (AUC: 0.931, 95% CI 0.889–0.975), UBE2C (AUC: 0.792, 95% CI 0.723–0.863), and NUSAP1 (AUC: 0.881, 95% CI 0.831–0.933).

Similarly, the ROC curves of six key candidate genes with their AUC values for the TCGA-LIHC-independent test dataset were presented in Fig. 8e. As presented in Fig. 8e, it was observed that six key candidate genes were provided the AUC values of more than 0.900 and their individual AUC values were as follows: TOP2A (AUC: 0.961, 95% CI 0.939–0.984), CDC20 (AUC: 0.968, 95% CI 0.949–0.986), ASPM (AUC: 0.960, 95% CI 0.938–0.983), PRC1 (AUC: 0.967, 95% CI 0.948–0.987), UBE2C (AUC: 0.965, 95% CI 0.946–0.985), and NUSAP1 (AUC: 0.919, 95% CI 0.889–0.949). Therefore, these six key genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) showed strong discriminative power to classify HCC patients from healthy controls. These validations would be supported our findings and provided them more robust.

Survival analysis

In this work, we adopted survival analysis of six key candidate genes (TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C) using univariate Cox regression in R and its results are presented in Fig. 9. As shown in Fig. 9, we observed that our identified six key candidate genes for HCC patinets such as TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C were strongly associated with the survival status of HCC patients (\(\hbox {p}<0.05\)). So, the over-expression levels of TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C had poor survival periods compared to lower expression levels of that key candidate genes.

Figure 9
figure 9

Survival analysis of six key candidate genes for HCC: (a) TOP2A; (b) CDC20; (c) ASPM; (d) PRC1; (e) NUSAP1; and (f) UBE2C. The horizontal axis (x-axis) represents the time to event (in days) and the vertical axis (y-axis) represents survival probability. The HCC patients were divided into two groups: high-risk and low-risk and assigned a color. The red line designates the samples with high risk, and the green line represents the samples with low risk. \(\hbox {p} < 0.05\) indicates a statistically significant difference in mortality between groups. The survival plots were generated using the “Survfit” package in R122.

Discussion

In this work, we assessed three datasets, namely GSE36376, GSE39791, and GSE57957, to detect the DEGs for HCC patients. We determined 699, 428, and 413 DEGs using “limma” from the GSE36376, GSE39791, and GSE57957 datasets, which were illustrated in Fig. 2. Moreover, we implemented SVM to determine the DEDGs from individual datasets (see in Fig. 3) and selected overlapping or shared 75 DEDGs among the identified DEDGS from GSE36376, GSE39791, and GSE57957 datasets, which were clearly shown in Fig. 4. At the same time, enrichment analysis was executed on overlapping or shared DEDGs to clear understand their better exploration and molecular mechanism (see in Table 1). We found that the potential BP functional categories were strongly related to the development and progression of HCC patients. Retinol and retinoid metabolic processes have been linked to a variety of liver diseases, including fatty liver disease, which leads to HCC68,69. The rest of the BP categories were also enriched with common DEDGs, which also coincided with existing studies, like cellular response to cadmium ion42,57,70, cellular response to copper ion36,70, and steroid catabolic process42.

The top 5 GO terms were significantly enriched with common DEDGS, which were also consistent with previous results, such as extra cellular region35,37,38,57, extracellular exosome37,38, extracellular space37,38,57, high-density lipoprotein particle57, and apical plasma membrane53. In the case of MFs, common DEDGs were also enriched with top five GO terms. Existing studies supported these enrichment factional categories, including retinol dehydrogenase activity14, and oxidoreductase activity37,38,42. We also analyzed KEGG pathways and chose five pathways that were closely related to our overlapping DEDGs (see in Table 2). Different existing studies supported our findings, such as retinol metabolism35,37,38,40,43,70, metabolic pathways37,38, tryptophan metabolism38,42,70, steroid hormone biosynthesis42,70, and drug metabolism-cytochrome P45035,42,70.

A PPI network was built with shared DEDGs using Cystoscape (see in Fig. 5a and then eight central hub genes (NUSAP1, TOP2A, CDC20, PRC1, UBE2C, ASPM, PNPLA7, and MT1E) were identified from five hub gene selection methods, which were presented in Fig. 5b. The potential modules were identified using MCODE scores and module 1 was identified due to having the highest MCODE scores. We selected six hub module genes from module 1 as well as constructed their PPI network (see in Fig. 6). In addition, we examined 52 papers and took the hub genes from earlier studies8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58 in order to make metadata. At the same time, we listed 214 meta-hub genes by taking the union of extracted hub genes, which were presented in Table 3. We selected 52 significant meta-hub genes from the list of meta-hub genes whose frequency was greater than 3. Finally, we identified the six shared genes (TOP2A,CDC20, ASPM, PRC1, UBE2C, and NUSAP1) by intersecting central hub genes, hub module genes, and significant meta-hub genes, extracted from the earlier studies, known as key relevant or candidate genes, which were clearly depicted in Fig. 7. We validated these key relevant or candidate genes using AUC for one training and two independent test datasets (see Fig. 8). We observed that these six key relevant or candidate genes had high discriminative power for the differentiation of HCC patients.

TOP2A is a cell cycle-related gene that encoded a DNA topoisomerase which controls and alters the topologic states of DNA during transcription. TOP2A overexpression has been identified as a core or potential biomarker for ovarian cancers71, glioma72, and lung cancers73. A study showed that TOP2A overexpression in HCC patients was significantly correlated with progression and poor prognosis74,75. In the case of our study, TOP2A was also considered as a key or core gene for the progression and development of HCC. This finding was coincided with previous studies12,14,15,18,20,21,22,23,24,25,27,32,33,34,35,36,39,41,42,43,45,46,48,50,51,54,55,56,57,58,61.

CDC20 is a vital regulator of cell division in humans76,77. Overexpression or high expression of CDC20 has also been linked to lung cancer78, colorectal cancer79, breast cancer80,81, and other cancers. Moreover, CDC20 was strongly correlated with poor prognosis in gastric cancer82, bladder cancer83, and breast cancer84. A study revealed that CDC20 over-expression was significantly associated with HCC85. Another recent study demonstrated that there existed a strong relationship between CDC20 overexpression and the prognosis of HCC86. Our findings also showed that CDC20 was a potential key biomarker that played an crucial or essential role for the development and progression of HCC. Different existing studies also supported our findings10,13,14,17,18,20,24,30,32,37,40,41,44,48,50,55,56,57.

ASPM is a protein that have a major influence in the development of HCC. ASPM is located on chromosome 1 and band 1q31 and consists of 28 exons and 3477 amino-acid proteins87. Lots of studies have identified ASPM as a hub gene or key biomarker for multiple cancers88,89,90. Zhang et al.90 reported that ASPM can be a promising therapeutic target for liver. Moreover, ASPM overexpression was strongly correlated with bladder cancer and consiered as promising predictor91. Our findings also illustrated that ASPM was a novel key biomarker for HCC, which was supported by the existing studies9,22,35,38,39,41,42,43,45,46,47,48,58.

PRC1 is an essential protein that is the regulator of cytokinesis92. The higher expression level of PRC1 was found among HCC patients than healthy controls. The overexpression of PRC1 was associated with a poor prognosis for HCC patients93. Our work also indicated that PRC1 was a promising or key biomarker for the development of HCC, which coincided with previous studies15,22,25,33,35,39,42,43,45,46,56,57,61.

Similarly, we proposed UBE2C as a key or core predictor for development of HCC, which was supported by various existing studies10,18,33,36,41,44,58. Xiong et al.94 suggested UBE2C as a potential biomarker or gene for HCC. High expression of UBE2C was also found in HCC than healthy subjects95. UBE2C is not play a crucial role HCC but also in variety of cancers: lung cancer, gastric cancer96,97.

NUSAP1 is a protein associated with the nucleolar-spindle that have a vital role in spindle microtubule organization98. overexpression of NUSAP1 was found in a variety of malignancies, including HCC58,99, colon cancer100,101, prostate cancer102,103, and cervical carcinoma104. Moreover, overexpression of NUSAP1 was strongly linked with poor prognosis of prostate cancer103 and colon cancer101. Another study revealed that NUSAP1 is related to HCC105. Roy et al.105 illustrated that NUSAP1 expression might rise in HCC samples with low expression levels of miRNA 193a-5p, and that this overexpression was strongly associated with a shorter patient survival time. Our findings also illustrated that NUSAP1 was one of the key candidate genes that the highest expression levels were found in HCC subjects compared to healthy subjects. These findings were consistent with existing studies15,22,46,56,58,61.

Moreover, two independent test datasets were also used to validate these six key candidate genes using AUC. A survival analysis was also performed of these six candidate genes for HCC patients. In both cases, our identified six key candidate genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) showed significant association with the development and progression of HCC. This finding will provide evidence and new insight to physicians and readers in determining the diagnosis of HCC as well as the correlated pathway of HCC.

Materials and methods

Data acquisition and preprocessing

In this work, three publicly available microarray gene expression datasets with GEO accession: GSE3637666, GSE39791106, and GSE57957107 with GPL10558 [Illumina HumanHT-12 V4.0 expression bead chip] were used to determine the key candidate genes. Another two independent test datasets were used to validate key candidate genes. One independent dataset was taken from the GEO database with accession number: GSE76427 with GPL10558 platform102 and another independent test dataset was taken from the Cancer Genome Atlas (TCGA) database. Microarray gene expression datasets were downloaded from the GEO database (www.ncbi.nlm.nih.gov/geo/) and TCGA-liver hepatocellular carcinoma (TCGA-LIHC) dataset was downloaded from the TCGA database (https://portal.gdc.cancer.gov/). The datasets underwent a log2 transformation and quintile normalization. Although these datasets were taken from the publicly available GEO repository, being human data, all methods were performed in accordance with the relevant guidelines and regulations. Table 4 presents a summary of the utilized datasets.

Table 4 Summary of utilized HCC datasets.

Identification of DEGs from each dataset

To identify the DEGs between HCC and healthy controls, each of the selected datasets was analyzed using the “limma” package108 in R-software with version 4.1.2. We computed the \(|log_2 FC|\) and adj. p-value of each gene from the selected dataset. “Bioconductor annotation”109 package was used to convert microarray data probes into gene symbols. If multiple probes were matched with a gene symbol, take the gene with their associated expression values that provided the lowest or minimum adjusted p-value. The DEGs between HCC and healthy controls were identified with a cutoff of point: \(|log_2FC|>1\) and \(adj. p-value<0.01\) (false discovery rate). The volcano plot of DEGs was generated using the “ggplot2 version 3.3.6” package in R110. Moreover, a heat map of the expression of DEGs was generated with the “NMF” version 0.24.0 package in R111.

SVM-based identification of DEDGs from DEGs for each dataset

The main purpose of SVM is to identify a hyperplane in a high dimensional space112,113 that can easily discriminate HCC patients from healthy control patients using the following discriminate function:

$$\begin{aligned} f(x)=\ \sum _{i=1}^{n}{\alpha _iK(x_i,\ x_j)}+b \end{aligned}$$
(1)

where, b is the bias term.

In this study, we have used radial basis kernel, which is defined as follows:

$$\begin{aligned} K(x_i,\ x_j)=\text {exp}(-\gamma \Vert x_i-x_j \Vert ^2) \end{aligned}$$
(2)

We set the different values of cost (C) and gamma \((\gamma )\) and tuned these values using a grid search method and select the optimal value of C and \((\gamma )\) to improve classification accuracy. In this current study, we adopted SVM as a gene selection method, and its identification procedure is described as follows:

  1. Step 1

    Select one gene from a list of identified DEGs.

  2. Step 2

    Trained SVM-based model with five-fold cross-validation (CV) protocols.

  3. Step 3

    Calculate the classification accuracy for this selected gene.

  4. Step 4

    Repeat Step 2 to Step 3 for all identified DEGs.

  5. Step 5

    Sort the classification accuracy of all DEGs in descending order of magnitude.

  6. Step 6

    Choose the genes that will produce a classification accuracy of more than 80.0.

Identification of common DEDGs

After selecting differentially expressed discrimination genes (DEDGs) using SVM, we identified the shared or overlapping or common DEDGs among three datasets using the following formula:

$$\begin{aligned} \text {Common DEDGs =}\bigcap _{i=1}^{r}{\text {Identified DEDGs from GEO Datasets}}_i \end{aligned}$$
(3)

where, r is the number of utilizing GEO dataset (here, r = 3).

Enrichment analysis of common DEDGs

To better understand the mechanism and progression of HCC patients, we obtained enrichment analysis, including GO and KEEG analysis114,115 on DEDGs using DAVID version 6.8 tools116 (david.ncifcrf.gov). A p-value < 0.05 was considered for significant.

PPI network analysis and central hub gene identification

The STRING version 11.5 software (www.string-db.org) was utilized to obtain the potential interactions among common DEDGs117. A protein-protein interaction (PPI) with a confidence score of \(> 0.70\) and a maximum number of interactors of 0 was preserved and loaded into Cystoscape version 3.9.1118 to build a PPI network. The degree of connectivity, maximum neighborhood component (MNC), maximal clique centrality (MCC), centralities of closeness, and betweenness were computed using cytoHubba119. Then, we sorted the values of degree of connectivity, MNC, MCC, centralities of closeness, and betweenness in descending order of magnitude and chose the top 30 DEDGs, known as hub genes. The central hub genes were selected by overlapping hub genes, which were computed from the degree of connectivity, MNC, MCC, centralities of closeness, and betweenness. Mathematically, it is defined as follows:

$$\begin{aligned} \text {Central Hub Genes=}\bigcap _{i=1}^{hg}{\text {Hub Genes from Identification Methods}}_i \end{aligned}$$
(4)

where, hg is the number of hub gene identification methods (Here, hg=5).

Hub modules and its associated genes identification

MCODE was used to determine the most closely connected modules from the PPI network120. We analyzed the modules with the following cutoff points: degree =2, cluster finding =haircut, nodes score =0.2, K-score =2, and max depth =100, respectively. We determined the potential modules that provided the MCODE with scores of \(\ge 6\) and the number of nodes of \(\ge 6\). Then, the hub module genes were identified using the following formula:

$$\begin{aligned} \text {Hub Module Genes =}\bigcup _{i=1}^{h_m}{\text {Genes from Module}}_i \end{aligned}$$
(5)

where, \(h_m\) is the number of significant modules.

Significant meta-hub genes identification from metadata

We reviewed some existing studies related to HCC-based gene identification. To make metadata, we listed their identified hub genes for HCC, called “meta-hub genes,” which can be written as follows:

$$\begin{aligned} \text {Meta-Hub Genes} =\bigcup _{i=1}^{m}{\text {Hub Genes from Previous Study}}_i \end{aligned}$$
(6)

where, m is the number of studies obtained from obtaining hub genes (here, m = 52).

We also counted the frequency of each meta-hub gene depending on how many studies identified that gene as a hub gene. Finally, we identified significant meta-hub genes from meta-hub genes whose frequency was greater than or equal to 3, which can be written as follows:

$$\begin{aligned} \text {Significant Meta-Hub Genes} =\{g_i\}; i=1,2,...,n \end{aligned}$$
(7)

where, \(g_i \in \text {meta-hub gene}\) and n is the number of meta-hub genes whose frequency is \(\ge 3\)

Key candidate genes identification

To identify the key candidate genes, we selected the central hub genes from the PPI network, hub module genes from significant modules, and significant meta-hub genes from existing studies. Therefore, we identified the key candidate genes for HCC using the following formula:

$$\begin{aligned} \text {Key Candidate Genes =}\bigcap _{i=1}^{k}{\text {Important Genes from Identification Methods}}_i \end{aligned}$$
(8)

where, k is the number of significant gene identification methods (Here, k = 3). In this work, central hub genes, hub module genes, and significant gene selection methods will be considered “Important Gene Identification Methods”.

Validation of key candidate genes

Discriminative power analysis using ROC curve

In this work, we used two independent test datasets in order to validate the key candidate genes. One independent test dataset (GSE76427) was taken from the GEO database, and another independent dataset was taken from the TCGA database. The description of these independent test datasets is more clearly explained in Table 4. We validated the selected key candidate genes using the area under the curve (AUC), computed from the receiver operating characteristic curve (ROC). In ROC analysis, first, we selected one gene and class label, and then we adopted logistic regression with the leave-one-out CV protocol. We computed AUC values using the “pROC” R-package121. Moreover, we also compared the performances of independent test datasets with one of our training datasets (GSE57957) in order to show the precision of the selected key candidate genes.

Survival analysis

In this work, we used TCGA-LIHC dataset for survival analysis in order to show prognostic status of key candidate genes. We classified HCC patients into high-risk and low-risk groups on the basis of median expression level of each key candidate gene. We performed survival analysis of our identified key candidate genes using the “Survfit” package in R language122. A p-value < 0.05 was considered statistically significant (“Supplementary information”).