Bioinformatics analysis and machine learning approach applied to the identification of novel key genes involved in non-alcoholic fatty liver disease

Non-alcoholic fatty liver disease (NAFLD) comprises a range of chronic liver diseases that result from the accumulation of excess triglycerides in the liver, and which, in its early phases, is categorized NAFLD, or hepato-steatosis with pure fatty liver. The mortality rate of non-alcoholic steatohepatitis (NASH) is more than NAFLD; therefore, diagnosing the disease in its early stages may decrease liver damage and increase the survival rate. In the current study, we screened the gene expression data of NAFLD patients and control samples from the public dataset GEO to detect DEGs. Then, the correlation betweenbetween the top selected DEGs and clinical data was evaluated. In the present study, two GEO datasets (GSE48452, GSE126848) were downloaded. The dysregulated expressed genes (DEGs) were identified by machine learning methods (Penalize regression models). Then, the shared DEGs between the two training datasets were validated using validation datasets. ROC-curve analysis was used to identify diagnostic markers. R software analyzed the interactions between DEGs, clinical data, and fatty liver. Ten novel genes, including ABCF1, SART3, APC5, NONO, KAT7, ZPR1, RABGAP1, SLC7A8, SPAG9, and KAT6A were found to have a differential expression between NAFLD and healthy individuals. Based on validation results and ROC analysis, NR4A2 and IGFBP1b were identified as diagnostic markers. These key genes may be predictive markers for the development of fatty liver. It is recommended that these key genes are assessed further as possible predictive markers during the development of fatty liver.


Methods and materials Workflow
The RNASeq data of fatty liver patients and clinical features were downloaded from the GEO dataset (GSE126848 and GSE48452).Filtering and normalization were performed as preprocessing, and the data quality was controlled using Principal Component Analysis (PCA).Before classification, feature selection was implemented using Relief-based algorithms to calculate the higher score for each feature.Then Penalize machine learning technique was used to detect the most important biomarkers.Eventually, the candidate genes were validated by other datasets.

Differential expression analysis (preprocessing)
Gene expression data were screened by filtering, and the zero expressions were eliminated; then, data were normalized with limma in R 4.1 software.The adjusted p < 0.05 and − 1.5 <|Log2FC (fold change) |< 1.5 were identified for subsequent analysis as significant genes.After that, Principal Component Analysis (PCA) which is a statistical procedure for visualizing whether the sample groups (control and patients) were separable and correlated was applied..

Identifying Important genes and correlation between clinical/demographic factors with fatty liver
The effect coefficient of all factors on the fatty liver was calculated using Regularization regressions (LASSO, () Ridge, and Elastic Net) models.These models will be described as follows.Before the modeling, Relief-based feature selection was implemented.Weight by Relief is applied to calculate the weights of the attributes in the polynomial dataset.Chi-square and One way-ANOVA also were used to evaluate the relationship between clinical variables and disease, and Kolmogorov-Smirnov was used for normality test distribution.The binary correlation of some variables was examined using a correlation matrix.R4.1.andEVIEWS12 software was utilized for analysis.

Regularization regression
In statistics and machine learning, Regularization regression is a type of regression analysis for variable selection and is used when train and test data are varying.To better manage many parameters or Multicollinearity between variables and reduce complexity, a "penalty" is added to cost function (Regularization) for the best fitting of training data.This reduce the variance of the test data, prevent over-fitting and enhance the prediction accuracy.Here are briefly introduced three Regularization regressions methods.

Least Absolute Shrinkage and Selection Operator (LASSO) regression
The term Lasso stands for "least absolute shrinkage and selection operator".Lasso uses shrinkage by shrinking data values to a central point such as mean.In this model, the regularization method is based on the absolute value of loss function.As a result, the target function in "Lasso Regression" is written as follows

Protein-protein interaction network
The online string tool (https:// www.string-db.org/) was performed to analyze DEGs' protein-protein interaction with a score of 0.4.Moreover, all the networks were depicted using R software.

GO pathway analysis
The enrichment GO analyses were performed to detect the molecular function of DEGs in NAFLD using Go package, nrichGO, and gseGO package.

Validation of biomarkers gene expression
The expression levels of candidate genes in patients were verified by using Gene Expression Omnibus (GEO) dataset (GSE89632 and GSE63067).The validation datasets consisted of data from patients with fatty liver, which were downloaded from this web tool, and the pre-processing was performed.

Combine ROC curve
The receiver operating characteristic (ROC) curve was performed to evaluate the efficacy of the diagnostic model.Specificity, sensitivity, area under the ROC curve, positive predictive value, negative predictive value, and cut-off value were assessed for each gene and their combination.All the procedures were analyzed by package combioROC in R.

Data description
Figure 1A shows the overall workflow.Tables 1 and 2 show the mean and standard deviation of the quantitative variables.The frequency and percentage of attributes in the study are also mentioned.The result of PCA indicated the discrimination between patients and healthy samples (Fig. 1B and C).

Weight by Relief
The weight of the variables of the two datasets can be seen in Fig. 2. The data show a significant correlation between DEGs and fatty liver.

Comparison of three methods for identifying important coefficients (GSE126848)
Three methods of Regularization regression, including LASSO, Ridge, and Elastic Net, were candidate to identify the effect coefficient of variables on fatty liver.Each of the color lines belongs to the coefficient of one variable, which with increasing Lambda parameter, the number of non-zero coefficients decreases, and the size of the coefficients becomes smaller and approaches zero.After fitting the model, with five k-fold cross-validation, the optimal value of the Lambda parameter was determined, and the results of the final model were reported.The model's cross-validation results were plotted in a graph containing different values of Lambda versus Train/ Test error, which shows the Train/Test Error related fitted models in different Lambda sizes (Fig. S1).Among the three implemented methods with five k-fold cross-validation for evaluation, the Elastic Net method had the highest performance (Lambda at minimum error: 11.87, R 2 = 0.999 and alpha = 0.5, l1 Norm = 1.31).The area under the curve was approximately 0.99 with a confidence interval (0.95,1).The Elastic Net is an extension of the lasso robust to extreme correlations among the predictors.The results of Elastic Net method for identifying important factors can be seen in Table 3.

Comparison of three methods for identifying important coefficients (GSE48452)
The three methods of Regularization regression were used to identify candidate genes that may be used to identify the effect coefficient of variables on fatty liver.Each of the colored lines represents the coefficient of one variable, which with increasing Lambda parameter, the number of non-zero coefficients decreases, and the size of the coefficients becomes smaller and approaches zero.After fitting the model, with five k-fold cross-validation, the optimal value of Lambda parameter was gained, and the results of the final model were reported.The results of cross-validation of the model were plotted in a graph containing different values of Lambda versus Train/ Test error, which shows the Train/Test Error related fitted models in different Lambda sizes (Fig. S2).Among the three implemented methods with five k-fold cross-validation for evaluation, the Elastic Net method had the highest performance (Lambda at minimum error: 0.00, R 2 = 0.999 and alpha = 0.5, l1 Norm = 213.66).The area under the curve was approximately 0.99 with a confidence interval (0.95, 1).The Elastic Net is an extension of the lasso robust to extreme correlations among the predictors.The results of Elastic Net method for identifying important factors can be seen in Table 3.

Comparison of three methods for identifying common genes between two datasets
After normalization with significant p-value and log fold change, the common genes between GSE126848 and GSE48452 were 155, which were used to identify the most important candidate genes using Lasso Machine Learning technique.For GSE126848 dataset with 57 samples, among the three implemented methods with five k-fold cross validation for evaluation, the Lasso method had the highest performance (Lambda at minimum error: 1.451, R 2 = 0.999 and alpha = 1, L1 Norm = 15.96)(Fig.S3).For GSE48452 dataset with 73 samples, among the three implemented methods with five k-fold cross-validation for evaluation, the Lasso method had the highest performance (Lambda at minimum error: 0.01388, R 2 = 0.999 and alpha = 1, L1 Norm = 15.96) (Fig. S4).

Identification of dysregulate expression genes (DEGs)
The GSE48452 chip contained 14 NAFLD, 18 NASH, and 27 obese samples, among which 15,000 genes and 1400 DEGs were identified.Moreover, the GSE126848 chip had 15 NAFLD, 16 NASH, and 12 obese 9540 genes, and 843 DEGs were found in this dataset based on specific criteria (Table 2).Furthermore, the commonality of novel genes between two datasets was assessed after normalization.Then Penalize machine learning technique was used to detect the most important common genes between two data sets.The results indicated that eighty-eight genes were common between two datasets (Table 3).

PPI network construction
As seen in Fig. 3, the PPI interaction network of DEGs was analyzed and depicted by String, and the interaction score was set at 0.4.As we can see in the network analysis, the KATA6A and KAT7 genes were strongly correlated, as well as, a significant correlation was detected between the SART3 and RNPS1 genes.

Validation using validation datasets
The five common genes between two datasets, GEO126848 and GEO48452, were validated by two other datasets, consisting of GSE89632 and GSE63067.The results indicated the five most important novel genes in fatty liver, including NR4A2, ZEB2, IGFBP1b, AKR1B10, DHRS2, and UGT2B17 (Table 4).

GO pathway analyses
Enrichment analysis results showed that the molecular function of shared DEGs was mainly enriched in structural molecule activity.The biological processes were peptide biosynthetic process and translation.Moreover, the main involved cell components were ribonucleoprotein complex and ribosome.Reactom pathway analysis revealed that metabolism of RNA and cellular responses to stress and stimuli were the most significant dysregulated pathways in fatty liver (Fig. 3).

ROC curve for identification of diagnostic markers
Our finding showed that NR4A2 alone (AUC of 0.92, 95% CI with a sensitivity of 1.00and specificity of 0.71), and also, its combination with ZEB2 (AUC of 0.92, 95% CI with a sensitivity of 0.90 and specificity of 0.85) had the highest rank of ROC analysis and can be considered as diagnostic markers (Fig. S5 and Table S1).Moreover, our data revealed that IGFBP1b alone (AUC of 0.90, 95% CI with a sensitivity of 0.89 and specificity of 0.87), and its combination with AKR1B10, DHRS2, IGFBP1, and UGT2B17 with AUC of 0.96, 95% CI with a sensitivity of 0.94 and specificity of 0.95, also had the highest rank (Fig. S6 and Table S2).

Association between Clinical/Demographic factors and fatty liver
A significant relationship was obtained between fat, fibrosis, BMI, inflammation, and fatty liver.

Investigation of the binary correlations of Clinical/Demographic influence variables on fatty liver
Using the correlation matrix, we examined the correlation between pairs of variables.The results are shown in Fig. 4. Note that a correlation coefficient of less than 0.3 is considered weak, the coefficient between 0.3 and 0.6 is moderate, and a coefficient greater than 0.6 is considered strong.Coefficients with a P-value less than 0.05 are also significant.As we concluded from Fig. 4, BMI, Lar, Leptin, Fat, and Nas have correlated significantly with the disease in positive direct and Adiponectin correlated with fatty liver negatively.

Discussion
For the first time in the present study, we have used machine learning approaches to compare the gene expression profile of individuals with NAFLD, NASH, and obesity with healthy individuals.Firstly, we analyzed GSE126848 and GSE48452 datasets separately, and the results detected 9540 and 1400 DEGs genes in the two datasets, respectively.We reported genes with higher coefficients in each dataset.Six genes, including ABCF1, SART3, APC5, NONO, KAT7, and ZPR1 were identified in GSE48452 datasets, as well as four genes, including RABGAP1, SLC7A8, SPAG9, and KAT6A were detected in GSE126848 dataset with a different expression between NAFLD and healthy samples.Subsequently, we identified six common genes between the two datasets and validated them in other datasets.Further analysis demonstrated that two genes, including NR4A2 and IGFBP1b with higher AUC, sensitivity, and specificity, were diagnostic biomarkers in fatty liver.
ABCF1, also named ABC50, is a member of the ABC transporter superfamily protein localized on the cytosol and endoplasmic reticulum (ER), which transport different molecules, including carbohydrates, amino acids, and ions.Furthermore, ABCF1 is critical in regulating innate immune and inflammatory responses 27,28 .This protein  is considered an oncofetal protein significantly expressed in the fetal liver, not healthy adult cells.Fung et al.
showed that the expression of ABCF1 was increased in hepatocellular carcinoma (HCC), and was associated with chemoresistance 29 .Cheung et al. demonstrated that upregulated ABCF1 gene is associated with poor recurrencefree survival (RFS) in liver cancer 30 .A significant association between other members of the ABC family and NAFLD has been proven in previous studies.ABCB1 plays a crucial role in transporting phospholipids and cholesterol into the liver cells.An animal study exhibited that the level of transporter proteins such as ABCB1, ABCC1-6, and ABCG2 increased during the progression of NASH 31 .The ABCB1 is overexpressed in liver diseases such as cholestatic, biliary cirrhosis, and obstructive jaundice [32][33][34] .The SART3 and RNPS1 are the genes with the highest score in the advanced stage of NAFLD; moreover, the result of PPI revealed that there is a strong correlation between SART3 and RNPS1, both of them are members of the post-splicing complex.SART3 is known as tumor-associated antigens detected in HCC and makes hepatocytes sensitive to immunotherapy 35 .A previous study used two datasets of GEO (GSE33814 and GSE89632) and showed that RNPS1 is one of the top genes overexpressed in NAFLD cells compared to the control group.RNPS1 is a member of the post-splicing complex role in RNA processing and apoptosis 36 .One of the other key genes detected in our investigation was APC5, a subunit of the anaphase-promoting complex (APC).Zhang et al. showed APC5 plays a critical role in activating the cell cycle during adipose tissue proliferation 37 .A study showed that after feeding, the expression of NONO gene significantly increased to uptake glucose.Furthermore, the results revealed that the deficient-NONO gene in mice reduces triglyceride storage and increases hepatocyte lipid catabolism 38 .In a current study, Wu et al. indicated that the expression of NONO gene was highly elevated in NAFLD mice 39 .our result indicates that CNTNAP1 is upregulated in NAFLD, which agrees with the previous study.CNTNAP1 has a positive role in triglyceride metabolism 40 .KAT7 gene, also known as HBO1, belongs to the lysine acetyltransferase family, which is a key factor in forming a replication complex, regulating the immune system and developing embryonic development.Information confirmed that the expression of KAT7 in mRNA and protein levels elevated in HCC cells leads to the proliferation and invasion of tumor cells.Zhong et al. reported that silencing the KAT7 gene using short hairpin RNA (shRNA) and CRISPR/Cas9 in the xenograft HCC model inhibited tumorigenesis 41 .ZPR1 is a zinc finger family member, and Wo et al. showed patients with severe NAFLD had ZPR1 rs964184 polymorphism.we hypothesized that this polymorphism could be associated with high expression of ZPR1 in patients 42 .The analysis of the GSE126848 dataset revealed that the expression of RABGAP1 gene is associated with NAFLD.The previous studies showed Rabgap1 expression raised in perirenal fat and brown fat in Gpr21 knockout mice when fed with a high-fat diet 43 .Rabgap1 GTPase Activating protein which transited the cells from metaphase to anaphase.SLC7A8 and SPAG9 are two novel DEGs identified in our study.SLC7A8, the light-chain subunit solute carrier family 7, member 8, is a vital gene in inducing hypertrophy in adipose tissue and inflammation.Pitere et al. reported that the SLC7A8 deficiency in mice with diet-induced obesity decreases lipid accumulation in the liver 44 .SPAG9 is expressed explicitly in the testis and has a vital role in fertility.A study on chicken illustrated that the samples that overexpressed the SPAG9 gene have more fat content on the abdominal and liver tissues 45 .Furthermore, SPAG9 increases the proliferation of HCC cells through the interaction with MAPK/Jun pathway 45 .KAT6A is another member of the lysine acetyltransferase family, which epigenetically regulates the transcription of different genes involved in DNA repairing systems, cell cycle, metabolism, and autophagy.Many studies confirmed the overexpression of KAT6A related to HCC progression and chemoresistance 46,47 .
Our result revealed a significant relationship between clinical and demographic data, including fat, fibrosis, body mass index (BMI), inflammation, and fatty liver.In many studies, BMI is announced as a critical index for increasing the risk of fatty liver.The BMI score of patients is a 4 to 14-fold change higher than healthy individuals.Fan et al. reported that 73% of patients with NAFLD were obese and overweight 48 .BMI measurement is a helpful and non-invasive marker for predicting fatty liver.They suggested triple approaches comprising examining the lipid panel, BMI measurement, and radiological techniques 49,50 .Inflammation and fibrosis are the major pathological consequences of NAFLD.Fibrogenesis is stimulated by the activation of hepatic stellate cells and Kupffer cells, resulting from high plasma levels of glucose and lipids 51 .The activated hepatic stellate cells express different myogenic and pro-inflammatory markers such as myocyte enhancer factor-2 (Mef2), c-myb, and TGF-β.www.nature.com/scientificreports/Moreover, inflammation results from increasing the level of reactive oxygen species (ROS) and cytokines in liver tissue 52,53 .The result of a meta-analysis revealed that the fibrosis stage significantly correlates with the risk of mortality in NAFLD 54 .
We reported NR4A2 and IGFBP1b as novel diagnostic biomarkers in fatty liver.Insulin-like growth factor binding protein (IGFBP) binds to insulin-like growth factors (IGFs) and regulates cellular metabolism.Hepatocytes largely produce IGFBP and secrete it into the serum.Previous studies are in line with our results, Pan NR4A2 is a transcription factor that plays a pivotal role in regulating fatty acid beta-oxidation.Therefore, the dysregulation of NR4A2 causes fat accumulation in the liver 56 .Chen et al. showed that NR4A2 overexpression prevents Hepatic stellate cell (HSCs) proliferation which plays a key role in liver fibrogenesis 57 .Previous evidence confirmed that novel approaches, including machine learning, are promising strategies for diagnosing, preventing, and managing diseases.Wu et al. compared four machine learning algorithms in predicting fatty liver disease, and they showed that the random forest model has a higher performance in the early diagnosis of fatty liver 58 .The result of a cross-sectional investigation showed that machine learning is a predictive model of NAFLD.They revealed that this method enhances clinical decisions and reduces end-stage disease 59 .Furthermore, previous studies used machine learning methods for identifying novel biomarkers in various conditions, such as cancer [60][61][62] , cardiovascular diseases 63,64 , pulmonary diseases 65,66 , and neurological disorders 67,68 .
In conclusion, using a bioinformatic approach; twelve key genes were detected that are significantly related to the fatty liver.It is recommended that these key genes are assessed further as possible predictive markers during the development of the fatty liver.

Figure 1 .
Figure 1.(A) The flow of work; The result of Principal Component Analysis (PCA) indicated the discrimination between patients and healthy samples in (B) GSE126848, and (C) GSE48452 datasets.

Figure 3 .
Figure 3. Gene ontology (GO) functional annotation of top DEGs enrichment terms in fatty liver disease; molecular function (MF) of DEGs was mainly enriched in histone acetyltransferase activity, peptide − lysine − N − acetyltransferase activity, histone binding, and peptide N − acetyltransferase activity.The biological process (BP) consisted of RNA splicing, hematopoietic stem cell proliferation, and histone H3 acetylation.The cell component (CC) was detected in nuclear speck and H3 histone acetyltransferase complexes.

Figure 4 .
Figure 4. Correlation matrix for showing significant relationship between clinical/demographic influence variables in fatty liver disease.

Table 1 .
The clinical characteristics of datasets.The ensemble ID was converted to gene name by Biotool.fr and the ID ref of GEO was converted by g: profiler.All the none genes were deleted from the study.

Table 2 .
Association between Clinical/Demographic factors and fatty liver.

Table 3 .
The most important genes coefficients on the fatty liver) GSE126848 and GSE48452).

Table 4 .
Common genes between GSE126848 And GSE48452 validated in other datasets.