Identification of useful genes from multiple microarrays for ulcerative colitis diagnosis based on machine learning methods

Ulcerative colitis (UC) is a chronic relapsing inflammatory bowel disease with an increasing incidence and prevalence worldwide. The diagnosis for UC mainly relies on clinical symptoms and laboratory examinations. As some previous studies have revealed that there is an association between gene expression signature and disease severity, we thereby aim to assess whether genes can help to diagnose UC and predict its correlation with immune regulation. A total of ten eligible microarrays (including 387 UC patients and 139 healthy subjects) were included in this study, specifically with six microarrays (GSE48634, GSE6731, GSE114527, GSE13367, GSE36807, and GSE3629) in the training group and four microarrays (GSE53306, GSE87473, GSE74265, and GSE96665) in the testing group. After the data processing, we found 87 differently expressed genes. Furthermore, a total of six machine learning methods, including support vector machine, least absolute shrinkage and selection operator, random forest, gradient boosting machine, principal component analysis, and neural network were adopted to identify potentially useful genes. The synthetic minority oversampling (SMOTE) was used to adjust the imbalanced sample size for two groups (if any). Consequently, six genes were selected for model establishment. According to the receiver operating characteristic, two genes of OLFM4 and C4BPB were finally identified. The average values of area under curve for these two genes are higher than 0.8, either in the original datasets or SMOTE-adjusted datasets. Besides, these two genes also significantly correlated to six immune cells, namely Macrophages M1, Macrophages M2, Mast cells activated, Mast cells resting, Monocytes, and NK cells activated (P  <  0.05). OLFM4 and C4BPB may be conducive to identifying patients with UC. Further verification studies could be conducted.

www.nature.com/scientificreports/ Data processing. Based on the training and testing groups, we further analyzed the data. Firstly, we used R package preprocessCore (version 1.56.0) for quantile normalization. This process was composed of (1) transferred the primary dataset into a fixed data type of "matrix"; and (2) the function of "normalize. quantiles" was used for quantile normalization. Secondly, we screened the differently expressed genes (DGEs) in both UC patients and healthy subjects of the training set. Then, the functional analysis of DGEs was conducted through Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, Disease Ontology (DO) enrichment analysis, and Gene Set Enrichment Analysis (GSEA). Moreover, six machine-learning algorithms, namely LASSO, SVM, NN, GBM, RF, and PCA, were used to establish the models. Regarding the testing group, we divided two subgroups for verification, including one individualized set and all datasets of the testing group. Finally, a correlation analysis between identified genes and immune cells was performed.
General statistical consideration. Statistical analysis was conducted with R software (version 4.1.0; https:// www.r-proje ct. org/) and the basement of RStudio (version 1.4.1717). For continuous variables, the independent Student's t-test was adopted if the variables met Gaussian distribution, if not, the Wilcoxon test was used. For categorical variables, the chi-square test was used, and the Wilcoxon test was used for signed-rank variables. The Pearson or Spearman coefficients were adopted in the correlation analysis. A two-sided p value < 0.05 was considered as significant criteria. MLs for the development of predictive models. We adopted six machine-learning algorithms (LASSO, SVM, PCA, RF, GBM, and NN) for the identification of significant predictive genes in DEGs. LASSO was performed to screen the candidate genes with binomial deviance and 10-fold cross-validation for the discrimination of UC patients and healthy subjects with glmnet (version 4.1) package. To avoid overfitting, the SVM algorithm was also used to adjust the premium genes with e1071 R packages (version 1.7-6) and the core of "svmRadial". Specifically, the R pakage e1071 was applied to the SVM with the function of "rfe" with "sizes" from 2 to 40 with step size of 3, "rfeControl" with "functions" of "caretFuncs" and "cv", and the "methods" was "svmRadial". The SVM code in RStudion was setting as the followed: " rfe (x = data, y = as.numeric (as.factor (group)), sizes = c (seq (2, 40, by = 2)), rfeControl = rfeControl (functions = caretFuncs, method = "cv"), methods = "svmRadial")". The R package randomForest (version 4.6-14) was adopted for the RF algorithm with 100 trees. And the PCA algorithm was performed by psych package (version 2.2.3). The variable importance in projection (VIP) values of PCA was used to estimate the importance of genes. The neuralnet (version 1.44.2) was used for the NN algorithm with 3 hidden. The h2o (version 3.36.0.3) was adopted for the GBM algorithm with 100 trees. The overlapping genes that existed in these two algorithm groups were included and the expression levels of candidate genes were further validated in the testing group. Furthermore, the SMOTE with R package DMwR (version 0.4.1) was used to expand the sample size when an imbalanced sample size appeared between the two groups. To estimate the prediction value for UC diagnosis, we used the pROC package (Version 1.17.0.1) in R. AUC of ROC was calculated to judge the accuracy of the predictive model. The greater value of AUC presents the higher accuracy of the predictive model. Additionally, the error rate was added for the assessment of accuracy among various MLs, and the lowest value of error rate could be indicated a better classification capacity.

Correlation analysis.
To quantify the relative proportions of immune cells from the gene expression profiles, CIBERSORT (https:// ciber sortx. stanf ord. edu/), a bioinformatics algorithm, was conducted for correlation analysis. The putative abundance of immune cells was estimated using a reference set with 22 types of immune cell subtypes with 1,000 permutations. Correlation analysis and visualization of these 22 types of immune cells were performed using the corrplot R package (version 0.84). Using the vioplot R package (version 0.3.5), violin plots visualized the differences of immune cells between the UC group and the healthy control cohort. The Spearman's rank correlation analysis in R software was performed. The ggplot2 R package (version 3.3.5) was adopted to visualize infiltrating associations between various immune cells.

Results
Search. According to the inclusion and exclusion criteria, a total of ten microarrays (526 individuals), including GSE48634, GSE6731, GSE114527, GSE13367, GSE36807, GSE3629, GSE53306, GSE87473, GSE74265, and GSE96665 were included for subsequent analysis. Based on the random ratio of 6:4, we developed the training set with 6 microarrays (201 UC patients and 106 healthy subjects), including GSE48634, GSE6731, GSE114527, GSE13367, GSE36807, and GSE3629, and further established the predictive model. Meanwhile, we developed the testing set with four microarrays (186 UC patients and 33 healthy subjects), including GSE53306, GSE87473, www.nature.com/scientificreports/ GSE74265, and GSE96665, and further tested the results of the model. In addition, as the microarray GSE87473 included the largest sample size in the testing set, we also selected it as a separate group for the model testing.
With the datasets, we further processed the following analysis just shown in the workflow (Fig. 1).

Identify DEGs.
Among the training group, we identified a total of 87 DEGs with biological significance.
Details are provided in Appendix 1. Compared to the healthy control, there are 81 genes presented as up-regulated and 6 genes shown as down-regulated in the UC patients. Generally, the bigger absolute value of LogFC and adjusted P value of Log10 indicates a greater difference between the two groups. Therefore, OLFM4, C4BPB, and CLDN8 distributed in the margin of the heatmap indicated an obvious difference between the two groups ( Fig. 2). Six machine-learning algorithms for candidate genes. In this study, six predictive models, including LASSO, SVM, PCA, RF, NN, and GBM were successfully established, respectively (Fig. 4). We identified 27 candidate genes through the 10-fold cross-validations of binomial deviance (Fig. 4A) and minimum lambda 0.01162648 of the LASSO algorithm (Appendix 2). In comparison, 16 candidate genes were identified based on the SVM algorithm ( Fig. 4B) with svmRadial function. Furthermore, the PCA analysis ( Fig. 4C-D) indicated that the two groups of UC patients and healthy control were distributed in different quadrants with obvious discrimination among 2 and 3 dimensions (Fig. 4C,D). With the increasing trees of RF analysis, the error rate presented decreased (Fig. 4E). In GBM (Fig. 4F), the various important genes indicated that OlFM4, HLA-DMA, and C4BPB showed a dominant weight proportion. As four MLs (SVM, RF, NN, and GBM) were used for classification, we adopted the calculated error rate for model evaluation. The SVM presented the lowest value among the four MLs (Table 1). Moreover, we calculated the top 20 weighted genes in a total of six MLs (Table 2). According to the criteria of counts frequency > 4, there     www.nature.com/scientificreports/ were six genes finally selected, including C4BPB, OLFM4, NMI, CLDN8, S100P, and RARRES3. All of them presented significance in UC group and healthy group (Fig. 5).

Functional enrichment analysis.
Evaluation of the models. We adopted the ROC curve and AUC values to assess the diagnosis value of the model. When we set the 6 genes (as mentioned above) into the ROC curve (Appendix 3), the results showed that the AUC values of OLFM4 and C4BPB were higher than 0.8 in the training group, testing group, and individual GSE87473 testing group (Fig. 6A1-F1). In addition, aim to reduce the potential bias from imbalanced sample size in different groups, we selected the SMOTE technique for our analysis. For the training group, we expanded the 1:1 ratio for the UC patients (n=201) and healthy controls (n=201), while in the testing group, we have 197 UC patients vs 198 healthy controls. Regarding the GSE87473, we expanded to 107 patients and 105 healthy controls. As indicated in Fig. 6A2-F2, both the primary datasets and SMOTE datasets showed a good AUC >0.8.

Correlation analysis.
To analyze the relationship between ten microarrays and 22 immune cells, we demonstrated the relative percentage of immune cells among 526 samples of ten microarrays (Fig. 7A). Then, the significant immune proportions of the UC and healthy groups indicated that there were 7 types of immune cells, including B cells naive, T cells CD4 memory resting, T cells follicular helper, Macrophages M0, Macrophages M1, and Macrophages M2, Neutrophils (Fig. 7B). Moreover, we analyzed the correlations between 22 immunize cells and the two important genes of OLFM4 and C4BPB through the Spearman analysis (Appendix 4). The significant results were presented in the following 6 types of immune cells, including Macrophages M1, Macrophages M2, Mast cells activated, Mast cells resting, Monocytes, and NK cells activated (Fig. 8). www.nature.com/scientificreports/

Discussion
In this study, we included ten microarrays with 526 samples for data analysis, and further selected a total of 87 DEGs. Based on the six MLs, we successfully established the predictive models and identified two useful genes in the diagnosis of UC, namely OLFM4 and C4BPB. The AUC values of C4BPB (average 0.873) were 0.812, 0.899, and 0.907 in the training group, testing group, and individual GSE87473 testing group, respectively. Compared to the previous studies, C4BPB presented the diagnosis value in Crohn's disease 34 . But this study extended its scope to UC. Additionally, the AUC values of OLFM4 (average 0.865) were 0.804, 0.897, and 0.895 in the training group, testing group, and individual GSE87473 testing group, respectively. In previous studies, OLFM4, as a cancer stemness gene induced by IL-22, was highly distributed in primary sclerosing cholangitis-associated ulcerative colitis 35 . It was also overexpressed in the active IBD 36 . In this study, we added a new result in terms of diagnosis values of OLFM4 for UC patients.
Regarding the correlations analysis, these two genes (e.g., OLFM4 and C4BPB) presented significant associations with 6 types of immune cells, including Macrophages M1, Macrophages M2, Mast cells activated, Mast cells resting, Monocytes, and NK cells activated. Some scholars have found that the UC patients presented an increasing percentage of CD23 B naive cells, compared to the normal individuals 5 . Regulatory T cells were also a key factor that exacerbated UC through immune imbalance 37 . Compared to Crohn's disease, the colonic mucosa samples in UC patients showed the expansion of IL17A+ CD161+ effector memory T cells and IL17A+ T-regulatory cells, expansion of HLA-DR+CD56+ granulocytes, and reductions in type 3 innate lymphoid cells 38 . The regulation of T cells for UC patients based on Bcl-6 and IL-21 could help to avoid the occurrence and development of IBD 39 . In a previous survey that included a global immune cell landscape of UC patients' tissue, the results identified the increasing number of neutrophils, T CD4 memory-activated cells, active dendritic cells, and M0 macrophages, and decreasing number of T CD8, Tregs, B memory, and M2 macrophages 40 .
Through the functional enrichment analysis, we found three pathways, including Chemokine signaling pathway, Epstein-Barr virus infection pathways, and Rheumatoid arthritis pathways, might be closely related to the progress of UC. The previous study had emphasized the potential role of the Chemokine signaling pathway in the up-regulation of UC patients and further proposed the CXCL8-CXCR137 (a type of chemokine) in the treatment of UC 41 . Although Epstein-Barr virus infection might trigger several immune dysfunctions, such as natural killer/T cell lymphoma arising, hemophagocytic lymphohistiocytosis, and malignancies, few studies focused on UC previously [41][42][43] . Furthermore, Epstein Barr Virus might be useful in the development of vaccines and immune cell therapy for EBV-Associated diseases, especially for several immune-related diseases 44 . The prognosis of Epstein-Barr virus infection in UC was less paid attention to 45 . In this study, the KEGG pathways results indicated that more studies of the Epstein-Barr virus infection pathway in UC could be conducted 46 . Particularly, IBD patients have a higher risk to develop autoimmune and inflammatory diseases, such as rheumatoid arthritis 47 .
Actually, there are many limitations to using machine learning in a clinical setting. As MLs include multiple factors, especially in statistics, clinical practice, and bioinformatics. To improve the study design and to facilitate the explanation of results from ML analysis, it is recommended to include a variety of experts of authors/researchers in a study. Individuals with rich clinical experience and MLs technique background are also conducive to  www.nature.com/scientificreports/ playing an important role in clinical MLs studies. In this study, there are some limitations. Firstly, insufficient verification is a common type of limitation in bioinformatics studies. Although we designed testing groups to assess the stability of the predictive model based on AUC values, and included ten microarrays to increase the sample size in this study, more research works, either in clinical trials or animal experiments, should be conducted to obtain solid verifications for these predictive results. Secondly, the machine-learning model itself includes some limitations, such as the black box phenomenon 48 , particularly in the NN method which includes many layers, such as an input layer, an output layer, and hidden layers (count fluctuating) 49,50 . Among them, the characteristics of the hidden layers are hard to identify 51 . Thirdly, we have limited information about the clinical features, such as the patient's age 30 , ethnicity and race 52,53 and stage of UC. Generally, some detailed information impacts the algorithm bias. Thus, further subgroup analysis could be included to identify more useful results in future research.

Conclusion
In this study, we found two useful genes of OLFM4 and C4BPB which may help to identify UC patients. Further verification studies could be conducted.