A comprehensive bioinformatics analysis on multiple Gene Expression Omnibus datasets of nonalcoholic fatty liver disease and nonalcoholic steatohepatitis

Fatty liver disease is one of the leading causes of chronic damage in western countries. Approximately 25% of adults in the United States have fatty livers in the absence of excessive alcohol consumption, a condition termed nonalcoholic fatty liver disease (NAFLD). Little is known about the prevalence and genetic background of NAFLD or the factors that determine its development. In this study, we used the Gene-Cloud of Biotechnology Information bioinformatics platform to carry out a comprehensive bioinformatics analysis identifying differentially expressed genes (DEGs), key biological processes and intersecting pathways. We imported 3 Gene Expression Omnibus datasets (GSE66676, GSE49541, and GSE83452). Then, we assessed the expression of the DEGs in clinical samples. We found that CD24 was the only gene co-expressed in all 3 datasets. “Glycolysis/gluconeogenesis”, “p53 signaling pathway” and “glycine, serine and threonine metabolism” were 3 common pathways related to the fatty liver process. In NAFLD tissues, CD24, COL1A1, LUM, THBS2 and EPHA3 were upregulated, and PZP was downregulated. CD24 is a core gene among these DEGs and have not yet been studied of its impact on NAFLD. Co-expressed genes, common biological processes and intersecting pathways identified in the study might play an important role in NAFLD progression. Further studies are needed to elucidate the mechanism of these potential genes and pathways in NAFLD.

Gene-Cloud of Biotechnology Information (GCBI). GCBI (Shanghai, China, https://www.gcbi.com.cn) is an online comprehensive bioinformatics analysis platform that combines a variety of research findings, genetic information, sample information, data algorithms and bioinformatics to create a "gene knowledge base, " which involves GEO datasets. GCBI platform can systematically analyze GEO dataset-derived gene expression information 13 , including more than 120 million copies of genomic samples. In the present study, GCBI was used to identify DEGs between NAFLD/NASH liver tissues and normal liver tissues. In the Differential Gene Expression Analysis module on the GCBI platform, we identified DEGs with a fold expression change >5 at cut off values Q < 0.05 and P < 0.05. Venn diagrams were used to compare the top 100 DEGs from 3 cohorts by Venny (http:// bioinfogp.cnb.csic.es/tools/venny/index.html). Based on the DEGs, we further performed gene ontology (GO) analysis in terms of biological functions and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis in terms of pathway analysis. The top 20 biological functions and pathways are presented. Furthermore, a pathway relation network module was used to identify the core networks and pathway connections. Then, we applied the Gene Co-expression Network module on the GCBI platform to build gene co-expression networks for the DEGs to determine core genes in the networks.
Tissue specimens, RNA extraction and qRT-PCR analysis. 15 healthy liver tissues and 10 fatty liver tissues from liver donors were enrolled in our study to validate the expression levels of co-expressed DEGs. Prior patient consent and ethical approval from the ethics committee of the First Affiliated Hospital, Sun Yat-sen were obtained. All methods were performed in accordance with the ethics guidelines and regulations. We selected 8 co-expressed DEGs, including CD24, PZP, COL1A1, COL1A2, LUM, VCAN, THBS2 and EPHA3, for validation. All tissues were histologically diagnosed. Total RNA from the tissue specimens was isolated using TRIzol reagent (Invitrogen, Carlsbad, California, USA), and qRT-PCR was performed with SYBR ® Green dye (TaKaRa, Shiga, Japan), following the manufacturer's instructions. The primer sequences are provided in Supplementary Table S1. β-tubulin was used as a reference gene.
Ethics statement. The research protocol was reviewed and approved by the Research Ethics Committee of the First Affiliated Hospital, Sun Yat-sen. All experiments were conducted in accordance with approved guidelines of the First Affiliated Hospital, Sun Yat-sen University.

Rank
Probe Set ID Gene Symbol Gene Description Regulation  Ethical approval. All procedures performed in studies involving human participants were in accordance with the ethical standards of the ethical committee of the First Affiliated Hospital, Sun Yat-sen University were obtained and with the 1964 Helsinki Declaration and its later amendments or comparable. NO tissues were procured from prisoners. All the livers were procured in Organ Transplant Center, The First Affiliated Hospital, Sun Yat-sen University.
Ethical standards. Informed consent was obtained from all individual participants included in the study.
Statistical analysis. Data are presented as the mean ± SD for continuous variables. Student's t-test and analysis of variance were used to evaluate significant differences in demographic data. All P values were two-sided, and P < 0.05 was defined as statistically significant. Analyses were carried out by the Statistical Package for the Social Science (SPSS) 22.0 (IBM, USA).

Results
Study design. The flow diagram of our study design is shown in Fig. 1. Our initial aim was to identify core genes in the development of fatty liver. We used 3 GEO datasets (GSE66676, GSE49541, and GSE834521) in the GCBI bioinformatics analysis platform. We extracted gene expression data of NAFLD/NASH and normal liver tissues to identify DEGs between the two histological diagnosis types in these 3 cohorts. Co-expressed DEGs were identified based on these DEGs. Biological function and KEGG pathway analyses were then performed. Finally, we verified expression of the core co-expressed genes in clinical samples to confirm the results.   Major characteristics of samples in 3 datasets. GEO datasets GSE66676 (Cohort 1), GSE49541 (Cohort 2), and GSE834521 (Cohort 3) were enrolled in our study. All 3 datasets were available in the GCBI bioinformatics analysis platform. GSE66676 contained 33 NAFLD or NASH tissues and 34 normal liver tissues. GSE49541 contained 32 advanced NAFLD tissues and 40 mild NAFLD tissues. GSE834521 included 126 NASH tissues and 98 normal liver tissues.
DEGs between NAFLD/NASH and normal liver tissues. We identified 8503, 1538, and 94 potential DEGs in GSE66676, GSE49541, and GSE834521, respectively (Fig. 2a-c). The top 10 DEGs from 3 cohorts are shown in Tables 1-3. After removing duplicate genes and expression values lacking specific gene symbols, we used the top 100 DEGs from GSE66676 and GSE49541 and 93 DEGs from GSE834521 to create a Venn diagram. The intersection of these 3 datasets in Fig. 3 shows that CD24 was the only co-expressed DEG found in all 3 cohorts. Twelve genes were co-expressed in Cohort 1 and  investigation showed that the regulation of MT1M, SLC1A2, CYP3A43 and VIL1 was not consistent. Therefore, 13 genes were co-expressed in Cohort 1 and Cohort 3.

Biological process analysis.
In the present study, GO analysis was applied to investigate the biological function of the DEGs. The biological process analysis (Fig. 4a-c) revealed that "small molecule metabolic process" was a common biological function in 3 cohorts. "Cellular lipid metabolic process" and "cell adhesion" were the common biological processes in Cohort 2 and Cohort 3. Four common biological processes in Cohort 1 and Cohort 2 were "blood coagulation, transmembrane transport, positive regulation of transcription from RNA polymerase II promoter and signal transduction". Metabolic processes, such as "alpha-linolenic acid metabolic process, unsaturated fatty acid metabolic process, mRNA/RNA metabolic process" and other biological processes, were also vital in each cohort. Pathway analysis. KEGG pathway analysis were used to investigate the pathway based on the DEGs identified. Figure 5a-c shows the top 20 pathways involved in each cohort. Among them, glycine, serine and threonine metabolism; cytokine-cytokine receptor interaction; PI3K-Akt signaling pathway; p53 signaling pathway; and metabolic pathways were 5 common pathways related to the fatty liver process. Among the top ten pathways in the pathway relation network in 3 cohorts, "glycolysis/gluconeogenesis", "p53 signaling pathway" and "glycine, serine and threonine metabolism" were 3 intersecting pathways in the relation network (Table 4). MAPK signaling pathway, apoptosis, pathways in cancer, cell cycle, Wnt signaling pathway and pyruvate metabolism were the other intersecting pathways in pairwise comparisons.

The validation of core genes expression in clinical samples.
To further determine which genes might play a significant role in the progression of fatty liver, we used real-time qPCR to detect the expression of 8 DEGs using clinical samples, including CD24, PZP, COL1A1, COL1A2, LUM, VCAN, THBS2 and EPHA3. 15 healthy liver tissues and 10 fatty liver tissues from liver donors were enrolled. We noted that CD24, COL1A1, LUM, THBS2 and EPHA3 were usually upregulated in fatty liver tissues comparing to normal liver tissues, which is consistent with the results of bioinformatics analysis above (Fig. 6a-f). PZP was significantly downregulated in fatty livers compared with normal livers, which is also consistent with the co-expressed results (Fig. 6h).

Discussion
Non-alcoholic steatohepatitis is strongly associated with strong genetic component and dietary component 1 . A number of genes could be associated with the susceptibility and development of NAFLD and NASH 1,2 . In our present study, we imported three GEO datasets into the GCBI comprehensive analysis platform to extract gene expression data of NAFLD/NASH tissue comparing to normal liver tissue or NASH tissue comparing to NAFLD liver tissue. We identified co-expressed DEGs, common biological processes and pathways between NAFLD/ NASH and normal liver tissues through differential expression analysis in GCBI. We found that CD24, PZP, COL1A1, COL1A2, LUM, VCAN, THBS2 and EPHA3 were potential biomarkers for distinguishing NAFLD/NASH. Further validation by clinical samples, the expression of CD24, PZP, COL1A1, LUM, THBS2 and EPHA3 were significantly regulated. Moreover, CD24 is a core gene among these DEGs and have not yet been studied of its impact on hepatocyte steatosis. CD24 encodes a sialoglycoprotein that is expressed on mature granulocytes and B cells and modulates growth and differentiation signals to these cells 13 . Several Figure 6. (a-h) Real-time qPCR validation of 8 co-expressed DEGs in 10 NAFLD and 15 normal liver tissues. *P < 0.05, **P < 0.01, NS represents no significant difference, analysis by non-paired t tests.  studies have reported that CD24 is widely distributed, including on hematopoietic cells 14 and non-hematopoietic cells 15,16 . CD24 has been studied to be associated with various pathophysiological processes, including tracking divergent states of cells 17 , regulating CD8 (+) T cell activation 18 and participating in mutant-IDH1-dependent chromatin state reprogramming 19 . Accumulating evidence showed that this protein is overexpressed in many types of cancers, resulting in cancer cell growth, proliferation and metastasis 20 . Previous studies have reported that CD24 is overexpressed in nearly 70% of human cancers, and intracellular CD24 disrupts the ARF-NPM interaction and enables mutational and viral oncogene-mediated p53 inactivation 21 . CD24 is significantly correlated with tumorigenesis genes, such as non-coding RNAs 22 . It was previously demonstrated that a subpopulation of adipocyte progenitor cells was identified with the expression of the cell surface molecule CD24 being critically important for reconstitution of white adipose tissue function in vivo 23 . CD24 was shown to be important in the reconstitution of white adipocyte (WAT) function in vivo, as well as a specific regulator of adipogenesis in vitro 23,24 . Furthermore, it was demonstrated that losing CD24 in male mice leads to a generalized reduction of WAT and metabolic disturbances 24 . Fairbridge et al. 25 . reported that the global absence of CD24 affects adipocyte cell size in vivo in a sex-and diet-dependent manner, as well as causing metabolic disturbances in glucose homeostasis and free fatty acid levels. However, the precise function of CD24 and the underlying mechanisms of its activity in NAFLD/NASH progression remain unclear. This is the first study to identify the prominent correlation between CD24 and NAFLD/NASH. Insulin resistance (IR) is central in the pathogenesis of NAFLD. NAFLD often significantly impacts glucose and lipid metabolism by exacerbating hepatic IR 26 . CD24 can regulate lipid raft occupancy and may affect glucose uptake by regulating lipid raft protein localization 27 . Additional studies will be required to identify whether there is a potential mechanism among CD24, glucose uptake, and insulin resistance. IR is a major factor for hepatic fat accumulation 28 . Several studies only focused on a single protein, which ignored the various potential pathways in this complex disease 29 . In our study, we found that "glycolysis/gluconeogenesis", "p53 signaling pathway" and "glycine, serine and threonine metabolism" were 3 intersecting pathways in the relation network. Several studies have reported that glycolysis/gluconeogenesis 30 , the p53 signaling pathway 31 , and glycine metabolism 32 were each associated with IR. The biological process "small molecule metabolic process" was reported to be vital in cancer 33 , hepatic metabolism disorder like IR 34 . We supposed it to be a key biological process in fatty liver. Further studies will be required to determine whether the co-expressed DEGs including CD24 could activate one of these pathways or other mechanisms to induce NAFLD/NASH in humans.

Rank
In summary, we first used the GCBI bioinformatics analysis platform to identify DEGs between NAFLD/ NASH tissues and normal liver tissues, which showed that CD24 is the hub gene and identify 3 intersecting pathways in the relation network. Then, using qRT-PCR analysis, we concluded that the mRNA expression of CD24 is upregulated in fatty liver. Further studies are required to elucidate the function and underlining mechanisms of this potential biomarker in the progression of hepatocyte steatosis.