Differentially expressed discriminative genes and significant meta-hub genes based key genes identification for hepatocellular carcinoma using statistical machine learning

Hasan, Md. Al Mehedi; Maniruzzaman, Md.; Shin, Jungpil

doi:10.1038/s41598-023-30851-1

Download PDF

Article
Open access
Published: 07 March 2023

Differentially expressed discriminative genes and significant meta-hub genes based key genes identification for hepatocellular carcinoma using statistical machine learning

Md. Al Mehedi Hasan^1,2,
Md. Maniruzzaman^1,3 &
Jungpil Shin¹

Scientific Reports volume 13, Article number: 3771 (2023) Cite this article

2350 Accesses
4 Citations
Metrics details

Subjects

Abstract

Hepatocellular carcinoma (HCC) is the most common lethal malignancy of the liver worldwide. Thus, it is important to dig the key genes for uncovering the molecular mechanisms and to improve diagnostic and therapeutic options for HCC. This study aimed to encompass a set of statistical and machine learning computational approaches for identifying the key candidate genes for HCC. Three microarray datasets were used in this work, which were downloaded from the Gene Expression Omnibus Database. At first, normalization and differentially expressed genes (DEGs) identification were performed using limma for each dataset. Then, support vector machine (SVM) was implemented to determine the differentially expressed discriminative genes (DEDGs) from DEGs of each dataset and select overlapping DEDGs genes among identified three sets of DEDGs. Enrichment analysis was performed on common DEDGs using DAVID. A protein-protein interaction (PPI) network was constructed using STRING and the central hub genes were identified depending on the degree, maximum neighborhood component (MNC), maximal clique centrality (MCC), centralities of closeness, and betweenness criteria using CytoHubba. Simultaneously, significant modules were selected using MCODE scores and identified their associated genes from the PPI networks. Moreover, metadata were created by listing all hub genes from previous studies and identified significant meta-hub genes whose occurrence frequency was greater than 3 among previous studies. Finally, six key candidate genes (TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C) were determined by intersecting shared genes among central hub genes, hub module genes, and significant meta-hub genes. Two independent test datasets (GSE76427 and TCGA-LIHC) were utilized to validate these key candidate genes using the area under the curve. Moreover, the prognostic potential of these six key candidate genes was also evaluated on the TCGA-LIHC cohort using survival analysis.

Analysis of multiple databases identifies crucial genes correlated with prognosis of hepatocellular carcinoma

Article Open access 30 May 2022

Feature selection with the Fisher score followed by the Maximal Clique Centrality algorithm can accurately identify the hub genes of hepatocellular carcinoma

Article Open access 21 November 2019

CD69 serves as a potential diagnostic and prognostic biomarker for hepatocellular carcinoma

Article Open access 08 May 2023

Introduction

Hepatocellular carcinoma (HCC) is the 3rd leading cause of cancer deaths globally¹. Globally, more than of 80% liver cancers are responsible for HCC² and its prevalence is high in males compared to females³. It usually occurs in people aged 30–50 years³. Different factors such as hepatitis B or hepatitis C^4,5, alcohol abuse, smoking, obesity, and type 2 diabetes (T2D) were significantly associated with HCC⁶. Among them, Hepatitis B is one of the prominent risk factors for the development of HCC, responsible for 50% of cases⁷. Despite various treatment approaches, namely radiotherapy, chemotherapy, and target therapy have been commonly used to improve the prognosis and recurrence of HCC. Nevertheless, the survival rate of HCC patients is still low⁸. As a result, the risks of cancer death are still increased due to the lack of early detection and diagnosis of genes and limited treatment facilities. Therefore, it is essential to develop a system for identifying the key or core genes for early detection and better prognosis of HCC.

Recently, bioinformatics analysis has been widely utilized to determine the key prognostic genes or biomarkers as well as their associated molecular pathways for multiple cancers, including HCC^{8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58}. Zhou et al.³⁵ identified 15 prognostic biomarkers as well as their associated gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway using bioinformatics analysis. Chen et al.^39,59,60 also identified 11 potential biomarkers that can play crucial roles in the development and progression of HCC patients. Qiang et al.⁴⁰ proposed five core genes which were significantly associated with early diagnosis and poor prognosis of HBV-HCC. Wang et al.⁴¹ identified 36 hub DEGs and illustrated that 10 candidate genes out of the 36 have significant effect on the tumorigenesis and progression of HCC. Among them, eight candidate genes were inversely related to the survival rate of HCC patients. Dai et al.⁶¹ proposed a prognostic model for predicting the prognosis of HCC patients. They identified 17 genes that were potentially associated with the prognosis of HCC patients. These 17 genes were used to make a prognostic model using the Cox hazard regression model and validated its performance using the TCGA and GSE14520 datasets. They showed that six genes were involved in the prognosis of HCC patients. Most researchers simply used hub genes derived from the PPI network to identify the key or core genes. One of the major challenges in studying genetic data was the identification of relevant biomarkers or genes. Recently, machine learning (ML)-based techniques have gained more attraction to address this problem^{59,60,62,63,64,65,66}. Despite the fact that several studies have been carried out for the identification and development of potential candidate genes for HCC^{8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,67}, it remains a challenging issue and still has some scope for more research for the identification of potential genes as well as understanding molecular mechanisms for the development, pathogenies, and progression of HCC.

In this work, we used three microarray gene expression (MGE) datasets as training sets to determine the key or core candidate genes for HCC. First, we selected individual DEGs for three datasets. Secondly, support vector machine (SVM) with radial basis function (RBF) was implemented on the identified DEGs from each of the three datasets and calculated the classification accuracy of each DEG. We selected the DEGs from each of the three datasets that provided a classification accuracy of more than 80.0%. At the same time, the overlapping or shared DEGs were identified from three datasets. These overlapping or shared DEGs were called differentially expressed discriminative genes (DEDGs). Thirdly, DAVID was used to perform enrichment analysis on common DEDGs. Fourthly, PPI networks were constructed using STRING and visualized using Cytoscape. Then the hub genes were identified using degree, maximum neighborhood component (MNC), maximal clique centrality (MCC), closeness, and betweenness on the basis of cytoHubba. After that, the central hub genes were determined by overlapping or shared hub genes from the degree, MNC, MCC, centralities of closeness, and betweenness. Molecular Complex Detection (MCODE) was performed for cluster or module analysis and determined the important or significant modules as well as their associated genes. Moreover, the significant meta-hub genes were determined from meta-hub genes, which were extracted from existing studies. The key or core candidate genes were determined among the central hub genes, potential module hub genes, and significant meta-hub genes, which can be easily discriminated against in HCC patients compared to healthy controls. Furthermore, we used another two independent test datasets for the validation as well as to show the discriminative power of the key candidate genes. We also performed a survival analysis of the identified key candidate genes for HCC patients. Therefore, the overall flowchart of our proposed system to determine key candidate genes for HCC is presented in Fig. 1.

Results

Identification of DEGs from each dataset

We implemented limma for identifying DEGs from each of the three GEO datasets (GSE36376, GSE39791, and GSE57957). Using the threshold of $|log_2 FC|{>1}$, and adj.p-value < 0.01, we identified 699 (up-regulated: 431 vs. down-regulated: 268), 428 (up-regulated: 88 vs. down-regulated: 340 DEGs), and 413 DEGs (up-regulated: 107; down-regulated: 306) DEGs between HCC and healthy controls from GSE36376, GSE39791, and GSE57957 datasets and their volcano plots and heatmap were presented in Fig. 2.

Identification of common DEDGs using SVM

SVM with RBF kernel was applied on the identified DEGs (699 DEGs for GSE36376; 428 DEGs for GSE39791; and 413 DEGs for GSE57957) of each dataset in order to identify the DEDGs of HCC patients. Then, the classification accuracy was computed per gene for DEGs from each dataset. The calculation procedure is clearly discussed in the methodology section. The classification accuracies of all DEGs for individual datasets were ordered in descending order of magnitude, which is presented in Fig. 3. As shown in Fig. 3, we observed that a total of 502 from GSE36376, 169 from GSE39791, and 242 from GSE57957 DEGs were selected as DEDGs because their classification accuracy was more than or equal to 80.0%. Furthermore, 75 common DEDGs were determined among the identified DEDGS from GSE36376, GSE39791, and GSE57957 datasets, which is shown in Fig. 4.

Enrichment analysis of common DEDGS

Enrichment analysis was conducted on 75 shared or overlapping DEDGs clearly grasp the mechanism and development of HCC. The functional characteristics of DEDGs were explored using GO and KEGG pathway analysis. The GO analysis was partitioned into three groups: biological process (BP), cellular component (CC), and morphological component. Using p-values $(< 0.05)$, we identified the significant GO and KEGG pathways, and chose the top five prominent GO terms and KEGG pathway. The top five GO terms, including BP, CC, and MF, are presented in Table 1.

Table 1 GO analysis of common DEDGs in terms of BP, CC, and MF. Top 5 items were selected.

Full size table

For BP-based GO terms, the common DEDGs were strongly enriched with retinol metabolic process, cellular response to cadmium ion, retinoid metabolic process cellular response to copper ion, and steroid catabolic process. Moreover, the extracellular region, extracellular exosome, extracellular space, high-density lipoprotein particle, and apical plasma membrane were found to be top CC, which were significantly enriched with common DEDGs. As shown in Table 1, MF group GO terms, including retinol dehydrogenase activity; oxidoreductase activity; androsterone dehydrogenase activity; androstan-3-alpha,17-beta-diol dehydrogenase activity; and steroid dehydrogenase activity, were mainly enriched with common DEDGs.

The study of the KEGG pathway for common DEDGs is displayed in Table 2. As shown in Table 2, the common DEDGs were significantly associated with multiple pathways such as retinol metabolism, metabolic pathways, tryptophan metabolism, steroid hormone biosynthesis, and drug metabolism-cytochrome P450.

Table 2 KEGG pathway analysis of common DEDGs. Top five items were selected.

Full size table

PPI network construction and central hub genes identification

STRING was utilized to build a PPI network to show the significant connections between proteins encoded by common DEDGs. Cytoscape was used to show the PPI network, which had 51 nodes and 144 edges (see Fig. 5a). Five hub gene-based identification algorithms, including the degree of connectivity, MNC, MCC, closeness, and betweenness in the Cytoscape plug-in cytoHubba, were implemented to determine the hub genes from PPI networks. Then we chose the top 30 hub genes from each algorithm. We made a Venn diagram among the five algorithms, which is shown in Fig. 5b. As shown in Fig. 5b, eight overlapping central hub genes were identified among these algorithms. These eight central hub genes were NUSAP1, TOP2A, CDC20, PRC1, UBE2C, ASPM, PNPLA7, and MT1E, which were utilized to determine the key or core genes for HCC.

Hub modules and its associated genes identification

Module or cluster analysis was performed using MCODE to determine the prominent modules. Three clusters or modules were generated using MCODE and provided 3–6 MCODE scores. We chose the prominent modules that provided the MCODE scores of $\ge 5$ and the number of nodes $\ge 5$. Finally, we chose module 1 as a prominent hub module that contained 6 nodes and 30 edges with the highest MCODE scores of 6 and their PPI networks were displayed in Fig. 6. The correspondence six genes were treated as hub module genes.

Identification of significant meta-hub genes from metadata

We reviewed 52 existing studies related to gene identification of HCC patients^{8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58}. We listed their hub genes in order to make metadata which were presented in Table 3. To make metadata, we extracted 10 hub genes from Maddah et al.⁹, 5 hub genes from Yan et al.¹⁰, 20 from Zhao et al.¹¹, 7 from Zhao et al.¹², 10 from Liu et al.¹³, 11 from Meng et al.¹⁴, 42 from Rosli et al.¹⁵, 5 from Zhang et al.⁸, 5 from Li et al.¹⁶, 8 from Li et al.¹⁷, 5 from Tian et al.¹⁸, 12 from Wan et al.¹⁹, 10 from Zhu et al.²⁰, 10 from Wang et al.²¹, 9 from Zhou et al.²², 10 from Zhang et al.²³, 18 from Mou et al.²⁴, 8 from Wu et al.²⁵, 9 from Gui et al.²⁶, 10 from Wang et al.²⁷, 28 from Lu and Zhu²⁸, 6 from Bhatt et al.²⁹, 10 from Zhang et al.³⁰, 13 from Jiang et al.³¹, 20 from Zhang et al.³², 12 from Wu et al.³³, 5 from Nguyen et al.³⁴, 15 from Zhou et al.³⁵, 6 from Yu et al.³⁶, 10 from Kakar et al.³⁷, 10 from Ji et al.³⁸, 11 from Chen et al.³⁹, 10 from Qiang et al.⁴⁰, 10 from Wang et al.⁴¹, 10 from Zhang et al.⁴², 14 from Kim et al.⁴³, 10 from Zhang et al.⁴⁴, 14 from Sha et al.⁴⁵, 10 from Chen et al.⁴⁶, 4 from He et al.⁴⁷, 10 from Zhang et al.⁴⁸, 4 from Hu et al.⁴⁹, 9 from Zhang et al.⁵⁰, 15 from Li et al.⁵¹, 5 from Cao et al.⁵², 7 from Yang et al.⁵³, 5 from Wang et al.⁵⁴, 9 from Jiang et al.⁵⁵, 16 from Li et al.⁵⁶, 15 from Xing et al.⁵⁷, 10 from Zhu W et al.⁵⁸, and 20 from Dai et al.⁶¹. Now, we took the union of extracted hub genes and got 214 hub genes as meta-hub genes. At the same time, we also computed the frequency of each meta-hub gene depending on how many studies got that gene as hub gene and selected 52 significant meta-hub genes because their frequency was more than 3. These selected 52 significant meta-hub genes were utilized for the determination of key genes.

Table 3 Formation of metadata by listing hub genes from existing studies.

Full size table

Key candidate genes identification

Eight central hub genes were identified from five methods (degree of connectivity, MNC, MCC, closeness, and betweenness), 6 hub module genes from potential hub modules, and 52 significant meta-hub genes from meta-hub genes. Six overlapping genes were identified using the Venn diagram from these three gene identification methods, which is presented in Fig. 7. These six genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) were considered as key genes, which can be easily classified into the subjects as HCC and healthy.

Validation of key candidate genes

Discriminative power analysis using ROC curve

Six key or core genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) were validated using AUC, computed from ROC curves. We compared the performance of two independent test datasets (GSE76427 and TCGA-LIHC) with one of our train datasets (GSE57957) in order to show the precision of the selected key candidate genes. The ROC curves of six key genes as well as their heatmap for both training and independent test datasets were illustrated in Fig. 8.

The ROC curve of six key candidate genes with their AUC values for the training dataset (GSE57957) was displayed in Fig. 8a: TOP2A (AUC: 0.936, 95% CI 0.871–1.000), CDC20 (AUC: 0.917, 95% CI 0.838–0.996), ASPM (AUC: 0.919, 95% CI 0.851–0.987), PRC1 (AUC: 0.938, 95% CI 0.871–1.000), UBE2C (AUC: 0.803, 95% CI 0.703–0.904), and NUSAP1 (AUC: 0.930, 95% CI 0.895–1.000). As displayed in Fig. 8c, the AUC values of six key or core genes were more than almost 0.780. The AUC values of six key or core genes for the GSE76427 dataset were: TOP2A (AUC: 0.900, 95% CI 0.851–0.949), CDC20 (AUC: 0.887, 95% CI 0.883–0.941), ASPM (AUC: 0.893, 95% CI 0.844–0.942), PRC1 (AUC: 0.931, 95% CI 0.889–0.975), UBE2C (AUC: 0.792, 95% CI 0.723–0.863), and NUSAP1 (AUC: 0.881, 95% CI 0.831–0.933).

Similarly, the ROC curves of six key candidate genes with their AUC values for the TCGA-LIHC-independent test dataset were presented in Fig. 8e. As presented in Fig. 8e, it was observed that six key candidate genes were provided the AUC values of more than 0.900 and their individual AUC values were as follows: TOP2A (AUC: 0.961, 95% CI 0.939–0.984), CDC20 (AUC: 0.968, 95% CI 0.949–0.986), ASPM (AUC: 0.960, 95% CI 0.938–0.983), PRC1 (AUC: 0.967, 95% CI 0.948–0.987), UBE2C (AUC: 0.965, 95% CI 0.946–0.985), and NUSAP1 (AUC: 0.919, 95% CI 0.889–0.949). Therefore, these six key genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) showed strong discriminative power to classify HCC patients from healthy controls. These validations would be supported our findings and provided them more robust.

Survival analysis

In this work, we adopted survival analysis of six key candidate genes (TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C) using univariate Cox regression in R and its results are presented in Fig. 9. As shown in Fig. 9, we observed that our identified six key candidate genes for HCC patinets such as TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C were strongly associated with the survival status of HCC patients ($\hbox {p}<0.05$). So, the over-expression levels of TOP2A, CDC20, ASPM, PRC1, NUSAP1, and UBE2C had poor survival periods compared to lower expression levels of that key candidate genes.

Discussion

In this work, we assessed three datasets, namely GSE36376, GSE39791, and GSE57957, to detect the DEGs for HCC patients. We determined 699, 428, and 413 DEGs using “limma” from the GSE36376, GSE39791, and GSE57957 datasets, which were illustrated in Fig. 2. Moreover, we implemented SVM to determine the DEDGs from individual datasets (see in Fig. 3) and selected overlapping or shared 75 DEDGs among the identified DEDGS from GSE36376, GSE39791, and GSE57957 datasets, which were clearly shown in Fig. 4. At the same time, enrichment analysis was executed on overlapping or shared DEDGs to clear understand their better exploration and molecular mechanism (see in Table 1). We found that the potential BP functional categories were strongly related to the development and progression of HCC patients. Retinol and retinoid metabolic processes have been linked to a variety of liver diseases, including fatty liver disease, which leads to HCC^68,69. The rest of the BP categories were also enriched with common DEDGs, which also coincided with existing studies, like cellular response to cadmium ion^42,57,70, cellular response to copper ion^36,70, and steroid catabolic process⁴².

The top 5 GO terms were significantly enriched with common DEDGS, which were also consistent with previous results, such as extra cellular region^35,37,38,57, extracellular exosome^37,38, extracellular space^37,38,57, high-density lipoprotein particle⁵⁷, and apical plasma membrane⁵³. In the case of MFs, common DEDGs were also enriched with top five GO terms. Existing studies supported these enrichment factional categories, including retinol dehydrogenase activity¹⁴, and oxidoreductase activity^37,38,42. We also analyzed KEGG pathways and chose five pathways that were closely related to our overlapping DEDGs (see in Table 2). Different existing studies supported our findings, such as retinol metabolism^{35,37,38,40,43,70}, metabolic pathways^37,38, tryptophan metabolism^38,42,70, steroid hormone biosynthesis^42,70, and drug metabolism-cytochrome P450^35,42,70.

A PPI network was built with shared DEDGs using Cystoscape (see in Fig. 5a and then eight central hub genes (NUSAP1, TOP2A, CDC20, PRC1, UBE2C, ASPM, PNPLA7, and MT1E) were identified from five hub gene selection methods, which were presented in Fig. 5b. The potential modules were identified using MCODE scores and module 1 was identified due to having the highest MCODE scores. We selected six hub module genes from module 1 as well as constructed their PPI network (see in Fig. 6). In addition, we examined 52 papers and took the hub genes from earlier studies^{8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58} in order to make metadata. At the same time, we listed 214 meta-hub genes by taking the union of extracted hub genes, which were presented in Table 3. We selected 52 significant meta-hub genes from the list of meta-hub genes whose frequency was greater than 3. Finally, we identified the six shared genes (TOP2A,CDC20, ASPM, PRC1, UBE2C, and NUSAP1) by intersecting central hub genes, hub module genes, and significant meta-hub genes, extracted from the earlier studies, known as key relevant or candidate genes, which were clearly depicted in Fig. 7. We validated these key relevant or candidate genes using AUC for one training and two independent test datasets (see Fig. 8). We observed that these six key relevant or candidate genes had high discriminative power for the differentiation of HCC patients.

TOP2A is a cell cycle-related gene that encoded a DNA topoisomerase which controls and alters the topologic states of DNA during transcription. TOP2A overexpression has been identified as a core or potential biomarker for ovarian cancers⁷¹, glioma⁷², and lung cancers⁷³. A study showed that TOP2A overexpression in HCC patients was significantly correlated with progression and poor prognosis^74,75. In the case of our study, TOP2A was also considered as a key or core gene for the progression and development of HCC. This finding was coincided with previous studies^{12,14,15,18,20,21,22,23,24,25,27,32,33,34,35,36,39,41,42,43,45,46,48,50,51,54,55,56,57,58,61}.

CDC20 is a vital regulator of cell division in humans^76,77. Overexpression or high expression of CDC20 has also been linked to lung cancer⁷⁸, colorectal cancer⁷⁹, breast cancer^80,81, and other cancers. Moreover, CDC20 was strongly correlated with poor prognosis in gastric cancer⁸², bladder cancer⁸³, and breast cancer⁸⁴. A study revealed that CDC20 over-expression was significantly associated with HCC⁸⁵. Another recent study demonstrated that there existed a strong relationship between CDC20 overexpression and the prognosis of HCC⁸⁶. Our findings also showed that CDC20 was a potential key biomarker that played an crucial or essential role for the development and progression of HCC. Different existing studies also supported our findings^{10,13,14,17,18,20,24,30,32,37,40,41,44,48,50,55,56,57}.

ASPM is a protein that have a major influence in the development of HCC. ASPM is located on chromosome 1 and band 1q31 and consists of 28 exons and 3477 amino-acid proteins⁸⁷. Lots of studies have identified ASPM as a hub gene or key biomarker for multiple cancers^88,89,90. Zhang et al.⁹⁰ reported that ASPM can be a promising therapeutic target for liver. Moreover, ASPM overexpression was strongly correlated with bladder cancer and consiered as promising predictor⁹¹. Our findings also illustrated that ASPM was a novel key biomarker for HCC, which was supported by the existing studies^{9,22,35,38,39,41,42,43,45,46,47,48,58}.

PRC1 is an essential protein that is the regulator of cytokinesis⁹². The higher expression level of PRC1 was found among HCC patients than healthy controls. The overexpression of PRC1 was associated with a poor prognosis for HCC patients⁹³. Our work also indicated that PRC1 was a promising or key biomarker for the development of HCC, which coincided with previous studies^{15,22,25,33,35,39,42,43,45,46,56,57,61}.

Similarly, we proposed UBE2C as a key or core predictor for development of HCC, which was supported by various existing studies^{10,18,33,36,41,44,58}. Xiong et al.⁹⁴ suggested UBE2C as a potential biomarker or gene for HCC. High expression of UBE2C was also found in HCC than healthy subjects⁹⁵. UBE2C is not play a crucial role HCC but also in variety of cancers: lung cancer, gastric cancer^96,97.

NUSAP1 is a protein associated with the nucleolar-spindle that have a vital role in spindle microtubule organization⁹⁸. overexpression of NUSAP1 was found in a variety of malignancies, including HCC^58,99, colon cancer^100,101, prostate cancer^102,103, and cervical carcinoma¹⁰⁴. Moreover, overexpression of NUSAP1 was strongly linked with poor prognosis of prostate cancer¹⁰³ and colon cancer¹⁰¹. Another study revealed that NUSAP1 is related to HCC¹⁰⁵. Roy et al.¹⁰⁵ illustrated that NUSAP1 expression might rise in HCC samples with low expression levels of miRNA 193a-5p, and that this overexpression was strongly associated with a shorter patient survival time. Our findings also illustrated that NUSAP1 was one of the key candidate genes that the highest expression levels were found in HCC subjects compared to healthy subjects. These findings were consistent with existing studies^{15,22,46,56,58,61}.

Moreover, two independent test datasets were also used to validate these six key candidate genes using AUC. A survival analysis was also performed of these six candidate genes for HCC patients. In both cases, our identified six key candidate genes (TOP2A, CDC20, ASPM, PRC1, UBE2C, and NUSAP1) showed significant association with the development and progression of HCC. This finding will provide evidence and new insight to physicians and readers in determining the diagnosis of HCC as well as the correlated pathway of HCC.

Materials and methods

Data acquisition and preprocessing

In this work, three publicly available microarray gene expression datasets with GEO accession: GSE36376⁶⁶, GSE39791¹⁰⁶, and GSE57957¹⁰⁷ with GPL10558 [Illumina HumanHT-12 V4.0 expression bead chip] were used to determine the key candidate genes. Another two independent test datasets were used to validate key candidate genes. One independent dataset was taken from the GEO database with accession number: GSE76427 with GPL10558 platform¹⁰² and another independent test dataset was taken from the Cancer Genome Atlas (TCGA) database. Microarray gene expression datasets were downloaded from the GEO database (www.ncbi.nlm.nih.gov/geo/) and TCGA-liver hepatocellular carcinoma (TCGA-LIHC) dataset was downloaded from the TCGA database (https://portal.gdc.cancer.gov/). The datasets underwent a log2 transformation and quintile normalization. Although these datasets were taken from the publicly available GEO repository, being human data, all methods were performed in accordance with the relevant guidelines and regulations. Table 4 presents a summary of the utilized datasets.

Table 4 Summary of utilized HCC datasets.

Full size table