Identification of Novel Genes in Human Airway Epithelial Cells associated with Chronic Obstructive Pulmonary Disease (COPD) using Machine-Based Learning Algorithms

The aim of this project was to identify candidate novel therapeutic targets to facilitate the treatment of COPD using machine-based learning (ML) algorithms and penalized regression models. In this study, 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 GOLD stage I and 12 GOLD stage II) were included (n = 133). 20,097 probes were generated from a small airway epithelium (SAE) microarray dataset obtained from these subjects previously. Subsequently, the association between gene expression levels and smoking and COPD, respectively, was assessed using: AdaBoost Classification Trees, Decision Tree, Gradient Boosting Machines, Naive Bayes, Neural Network, Random Forest, Support Vector Machine and adaptive LASSO, Elastic-Net, and Ridge logistic regression analyses. Using this methodology, we identified 44 candidate genes, 27 of these genes had been previously been reported as important factors in the pathogenesis of COPD or regulation of lung function. Here, we also identified 17 genes, which have not been previously identified to be associated with the pathogenesis of COPD or the regulation of lung function. The most significantly regulated of these genes included: PRKAR2B, GAD1, LINC00930 and SLITRK6. These novel genes may provide the basis for the future development of novel therapeutics in COPD and its associated morbidities.

Module identification. Normalized gene expression data was used for module identification in the SPD algorithm. In total, 576 modules were identified. Three modules were biologically more related to the progression and phenotype of COPD including, 119, 242 and 324. The minimal spanning trees obtained from the SPD algorithm are shown in Fig. 2. All the genes involved in COPD progression are presented in Table 2 and then included in machine-learning and statistical modeling approaches. From these three selected modules, gene expression   within two of the modules (Fig. 2a,b), associated with COPD-progression, was increased in SAE cells. In contrast, gene expression within the third module (Fig. 2c), associated with COPD-progression, was decreased in SAE cells. In Fig. 2d, classification of samples was shown based on the disease stage (dark blue = healthy non-smoker, light blue = healthy smoker, light brown = COPD stage I and dark brown = COPD stage II).
Gene selection and prediction. Based on the machine-learning and statistical penalized algorithms, and after adjustment of the effect of pack per year of smoking, elastic-net logistic regression had the highest AUC (82%), sensitivity (85%), specificity (51%) and lowest misclassification error rate (25%). In reverse, decision trees method has lowest AUC (57%), sensitivity (69%), specificity (43%) and highest misclassification error rate (39%) than other algorithms. Based on the elastic-net logistic regression, the most important selected genes included, THSD4, PPP4R4, JAKMIP3, LINC00930, DNHD1, TMCC3, CCDC37, PRDM11, GLI3, ABCC3, ADH7, SAMD5, RASSF10, USP27X, GAD1, CYP1A1, NR0B1, CYP1B1, PLAG1, PIEZO2, SCGB1A1, LOC100507560. Consequently, 44 candidate genes identified here are associated with either the occurrence or progression of COPD, or lung function (Table 3). According to the results of each computational method, 44 were selected and the computational methods were hierarchically clustered, simultaneously (Figs 3 and 4). Of these 44 genes, 27 have been previously reported in the literature to be associated with COPD, lung function (FVC, FEV 1 or the FEV 1 /FVC ratio) or other lung diseases. These 27 genes also include the genes of THSD4, PPP4R4, SCGB1A1, and NRG1, already detected in GWA studies to determine single nucleotide polymorphisms (SNPs) specifically for COPD (Table 4). Furthermore, in our study, SNPs within 4 additional genes have been detected in GWAS studies carried out previously in lung-related studies including: PRDM11 and AHRR FVC, smoking 15,16 , CYP1A1 childhood bronchitis 17 and CYP1B1 lung cancer 18 . In this study, we have identified 17 genes which have not previously been detected in COPD studies, these include: LINC00942, REEP1, C6orf164, LINC00589, JAKMIP3, LINC00930, DNHD1, TMCC3, ADH7, PRKAR2B, GAD1, LOC338667, CYB5A, PIEZO2, SLITRK6, KCNA1 and LOC100507560 (Table 4). These genes may represent novel biomarkers in the diagnosis and prognosis of COPD. Figure 5 depicts the functional protein-association networks for the 44 selected genes, as shown by STRING. Investigation of the differential expression of genes in healthy non-smokers (HNS; control subjects), healthy smokers, COPD patients, and COPD Stage I and II patients. In this study, we also investigated the differential expression of our 44 candidate genes in healthy non-smokers (HNS; control subjects; n = 53), healthy smokers (HS; n = 59), COPD patients (n = 21), and COPD stage I (COPD I; n = 9) and II (COPD II; n = 12) patients, respectively. We investigated the differential gene expression between HNS and HS and found significant differences in expression in 39/44 (88.6%) of all genes. In addition, 16/17 (94.1%) of the genes, not previously detected associating with COPD or lung function, were differentially expressed (Table 5; column HS v HNS). We then investigated the differential expression of these 44 genes in HS and COPD patients. Here, 24/44 (54.5%) of all genes studies were significantly regulated. Furthermore, 10/17 previously undetected genes in COPD/lung function were differentially regulated (Table 5; column COPD v HS). Finally, we investigated the regulation of these 44 genes in COPD Stage I and II patients compared with HS (Table 5; columns stage I v HS and stage II v HS). Here, we observed that 5/44 (11.4%; COPD stage I) and 16/44 (36.3%; COPD stage II) were differentially regulated. Among the previously undetected genes in COPD/lung function, 10/17 (58.8%) and 6/17 (35.3%) were significantly different in COPD stage I and II, respectively, compared with HS. A number of genes were significantly different in all four analyses (HS v HNS; HS v COPD; HS v COPD I; HS v COPD II), including: USP27X, AHRR, CYP1A1 and CYP1B1. Interestingly, of these genes, not previously identified to associate with COPD/lung function, PRKAR2B and GAD1 were significantly different in all four analyses. Therefore, this study reveals for the first time the potential role of PRKAR2B and GAD1 in COPD and smoking-related dysfunction in lung.

Investigation of the gender effect on differential gene expression in HNS, HS and COPD (Stage I and II) patients.
Here we examined the effects of gender on the expression of our 44 candidate genes. We demonstrated that the expression of 40/44 (90.9%) of these genes is significantly different in HS men compared to HNS men (Table 6; HS v HNS). In addition, 15/17 (88.2%) of the novel genes previously undetected in COPD/ lung function had significantly different expression levels ( Table 6; HS v HNS) in men. Investigation of the expression levels of the 44 candidate genes in men with COPD versus HS revealed that 21/44 (47.7%) of genes were significantly different (Table 6; COPD v HS) and 10/17 (58.8%) of previously undetected genes in COPD/lung function were also significantly different. When HS were compared to COPD Stage I and II patients, respectively, 4/44 (Stage I; 9.0%) and 7/44 (Stage II; 15.9%) of the total candidate genes were significantly different in male HS compared to HNS. Of the 17 novel genes detected in this study, 1/17 (Stage I; 5.9%) and 3/17 (Stage II; 17.6%) were significantly different in males compared to HS (Table 6; Stage I or Stage II v HS). A number of the 44 candidate genes were significantly different in males across all four analyses, these included USP27X, AHRR, and the novel gene, J AK MI P3.

1.
In COPD patients who consumed ≥50 packs/year, we observed a significant change in gene expression in 22/44 (50%) candidate genes and 9/17 (52.9%) novel genes compared with HS (Table 8; ≥50 packs/year; COPD v HS). We also investigated differential gene expression in Stage I and II COPD patients who consumed ≥50 packs/year and determined that 7/44 (15.9%; Stage I) and 10/44 (22.7%; Stage II) candidate genes were significantly different compared to gene expression in HSs. In our cohort of 17 novel genes, we determined that gene expression in 2/17% (11.8%; Stage I) and 4/17% (23.5%; Stage II) was significantly different in COPD patients who consumed ≥50

Discussion
Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory disease characterized by airway obstruction and is predicted to be among the first three causes of death worldwide 1,2 . A significant degree of clinical heterogeneity has been observed in COPD patients. In functional terms, all COPD patients experience a loss in lung function, as measured using FEV1 and FVC. However, these clinical parameters are not optimal and FEV1 has been shown to correlate weakly with clinical outcome and health status 19,20 . Currently, there is an unmet clinical need to identify novel biomarkers that will facilitate improved diagnosis and prognosis in COPD.
To date, COPD has been shown to develop in 30% of smokers, with smoking being one of the main risk factors associated with the development of COPD 3 . The aim of this project was to identify candidate novel biomarkers, which may provide future novel therapeutic targets, in order to facilitate the treatment of COPD using machine-based learning algorithms and penalized regression models. In this study, 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 GOLD stage I and 12 GOLD stage II) were included (n = 133). 20,097 probes were generated from SAE microarray data obtained from these subjects previously 14 . Consequently, 44 candidate genes were identified to be associated with the occurrence or progression of COPD, or lung function. Of these 44 genes, 27 have been previously reported in the literature to be associated with COPD or lung function (FVC, FEV 1 or the FEV 1 /FVC ratio). In this study, we also identified 17 genes not previously detected in COPD studies that may represent novel biomarkers in the diagnosis and prognosis of COPD. In our analyses among healthy non-smokers and healthy smokers and COPD patients (GOLD stage I and II), the most significantly regulated novel genes were: PRKAR2B, GAD1, LINC00930 and SLITRK6. PRKAR2B is a protein kinase type II-beta regulatory subunit dependent on cAMP and encoded by the PRKAR2B gene in human 21 . In our overall analyses, expression of PRKAR2B was significantly downregulated in healthy smokers and in COPD patients (and in COPD stage II) compared to healthy non-smokers. Furthermore, in males, PRKAR2B expression was also significantly downregulated in healthy smokers and in COPD patients compared to healthy non-smokers. In females, these differences were less pronounced. In subjects less than 50 years, PRKAR2B expression was significantly downregulated in healthy smokers and in COPD patients compared to healthy non-smokers. In patients over 50 years, these differences were less pronounced. With regards to smoking exposure, COPD patients who smoked more than 50 packs per year had significantly lower PRKAR2B gene expression than healthy non-smokers. This decrease was not evident in COPD patients who smoked less than 50 cigarette packs per year. Thus, we hypothesise that PRKAR2B may represent a previously unknown factor both in pathogenesis of COPD and smoking exposure. PRKAR2B is an important protein kinase in cAMP signaling, and other researchers have demonstrated that cAMP is a protective factor in the lung and COPD. Furthermore, cAMP has been shown to attenuate pro-inflammatory responses whilst concomitantly increasing anti-inflammatory responses in a number of innate immune cells 22 . The reduced PRKAR2B gene expression observed in this study may reveal PRKAR2B as a novel target in the treatment of COPD.
In contrast, in this study we also observed a significant upregulation of the novel gene, GAD1, in healthy smokers and COPD patients (and in stage II) compared to healthy non-smokers. In male only subjects, this pattern was replicated. The increase in expression of GAD1 in healthy smokers and COPD patients compared to healthy non-smokers was marginally less significant than in male subjects, as expected. In subjects younger than 50 years, there was a more significant increase in GAD1 expression in healthy smokers and COPD patients compared to subjects over 50 years and also the non-smokers. Smoking exposure only significantly increased GAD1 levels in healthy smokers and COPD patients who smoked more than 50 packs per year. There were no significant changes in GAD1 expression in subjects who smoked less than 50 packs per year. Other studies have shown that levels of γ-aminobutyric acid (GABA) and glutamic acid decarboxylase 1 (GAD1), the enzyme that synthesizes GABA, are significantly increased in neoplastic tissues 23 . Furthermore, other researchers have shown that the GAD1 promoter is hypermethylated in a number of cancer cells. This effect was shown to lead to the production of high levels of GAD1, as opposed to gene silencing which one would expect. The GAD1 promoter contains a number of CpG island motifs which facilitate this hypermethylation. In this study, we hypothesise that the increased levels of GAD1 detected following smoking exposure and COPD could mean that GAD1 is an important target in the treatment of this smoking-related disease. Previous studies have demonstrated that patients with COPD are at an increased risk for both the development of primary lung cancer, as well as poor outcome after lung cancer diagnosis and treatment 24 . Targeting the knockdown of GAD1 in COPD may attenuate the increased risk of lung cancer in COPD patients.
LINC00930 was an additional novel gene detected in this study using machine-based learning. This is a "long intergenic non-protein coding (linc) RNA 930" (https://www.genecards.org/cgi-bin/carddisp.pl?gene=LINC00930). Interestingly, some other novel genes including LINC00942 and LINC00589 were also detected in this study. However, they were not as significantly regulated by smoking exposure or in COPD patients. In general, lincRNAs are found in between coding genes rather than antisense to them or within introns. Although the specific function of lincRNAs is not well known, they are thought to contribute to RNA stability in cells and hence gene expression. In our study, LINC009030 expression was significantly increased in healthy smokers overall, and in males and females, compared to healthy non-smokers. LINC009030 expression was significantly increased in smokers less than 50 years when only compared to non-smokers. In the over-50 age group, COPD patients had significantly less LINC009030 expression compared to healthy non-smokers. Regarding the smoking exposure, COPD patients and Stage II COPD patients who smoked less than 50 packs per year had significantly less LINC009030 expression compared to healthy non-smokers. In this study, LINC009030 was the only novel gene whose expression was significantly upregulated in smokers while significantly downregulated in COPD patients. Additional experimentation is required for elucidating the mechanism underlying this result in order to evaluate the therapeutic potential of targeting LINC009030 in smoking-related morbidities and in COPD patients.
SLITRK6 was the last novel gene, which we determined to be significantly regulated in smoking and in COPD in this study. SLITRK6 is a member of the SLITRK family of neuronal transmembrane proteins that was discovered as a bladder tumor antigen using suppressive subtractive hybridization 25 . Using immunohistochemistry, SLITRK6 has been shown to be extensively expressed in multiple epithelial tumors, including lung, bladder and breast cancer, as well as in glioblastoma 25 . In our study, we demonstrated that SLITRK6 was significantly downregulated in smokers and COPD patients (including stage II) compared to healthy non-smokers. Male smokers and COPD patients also had significantly lower SLITRK6 expression compared to non-smokers. In females, SLITRK6 expression was significantly less in smokers, but not COPD patients, compared to non-smokers. Smokers and COPD patients (including stage II) aged less than 50 years had significantly lower SLITRK6 expression compared to non-smokers. These effects were not evident in subjects over 50 years. Concerning the smoking exposure, COPD patients (including stage II) who smoked more than 50 packs per year had significantly lower SLITRK6 expression compared to non-smokers. This effect was also evident in COPD patients (including stage II) who smoked less than 50 packs per year. The highlight of this study was that the expression of the oncogenesis-promoting enzyme, GAD1, is significantly increased in response to smoking exposure and in COPD. Although SLITRK6 is considered a tumourogenesis promoting factor, it was significantly decreased in this study in responses to smoking exposure and in COPD. Here, we hypothesise that GAD1 may represent a better target in order to attenuate the incidence of cancer following COPD. However, further investigations are needed to explain the exact function of SLITRK6 in smoking-associated morbidities and in COPD.  More recently, machine-based learning algorithms have gained increasing attention in bioinformatics and biology research 26,27 . In contrast, regularization-based regression models (e.g. LASSO logistic regression) have already been used widely in microarray analysis 28 . Microarray analysis has a number of limitations including overfitting and multi-collinearity. In order to address these issues, regularization of parameters is required 29 . In this study, the area under the receiver operator characteristic curve (AUC), the sensitivity and specificity, and the misclassification error rate were quantitated for machine-based learning algorithm and penalty-based statistical method used. In this study based on repeated 5-CV, the elastic-net, random forest, and LASSO regularized logistic regression models were found to perform better than the naive Bayes, ridge, gradient boosting machines, adaptive boosting classification trees, extension of LASSO, artificial neural network, support vector machines, and decision tree models, respectively. Elastic-net regularization produced a sparse model with good prediction accuracy and good grouping-capability. This result is in keeping with those from previous studies, which demonstrated that elastic-net frequently performs better than ridge and LASSO for model selection consistency and prediction accuracy in microarray datasets 28,30,31 . Therefore, the results of this study are in agreement with those from previous ones.
In summary, we employed machine-based learning algorithms and penalized regression models in order to identify 44 candidate genes, whose expression was significantly regulated by smoking exposure and/or COPD. We also identified 17 novel genes which were not previously determined to be associated with smoking exposure or COPD. We determined that four of these novel genes, namely PRKARB2, GAD1, LINC00930 and SLITKR6, were the most significantly regulated by smoking exposure or in COPD. We also determined that elastic-net logistic regression in our dataset had a higher accuracy rate compared to the other algorithms. Therefore, in microarray data, elastic-net logistic regression may provide a useful methodology for future studies in the discovery of  novel diagnostic-and prognostic-biomarkers, and novel therapeutic targets in the treatment of COPD and other smoking-related diseases. The strengths of this study include the use of modern and accepted computational methods, the address of potential sources of bias, validation of all of the results by literature review, the use of appropriate cross-validation method (repeated 5-CV), enrichment analysis and the use of STRING networks. The main limitations of the study are the small sample sizes and that this is a case-control study only. Specifically, this study does not establish the respective associations between the 44 candidate genes and any COPD outcomes (e.g. lung function changes, exacerbations or mortality). Therefore, future studies by us will investigate and validate the candidacy of our 44 novel genes as novel therapeutic targets in COPD using larger patient cohorts. Furthermore, these studies, will investigate the association between these 44 novel genes and the change in lung function over time (FEV 1 or FEV 1 /FVC ratio), incidence of exacerbations and mortality in COPD patients.

Methods
Study subjects and dataset. In this study, 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 of stage I, GOLD I and 12 of stage II, GOLD II) were included (Total: n = 133). Subjects were predominantly male (n = 95; 71.4%) and Caucasian (n = 48; 36.1%; Table 1). Pulmonary function tests from COPD patients revealed that forced vital capacity (FVC), forced expiratory volume 1 (FEV1) and the FEV1/FVC ratio were significantly lower in these patients compared with healthy smokers and healthy non-smokers (Table 1). From these subjects, the raw data of gene expression architecture in the small airway epithelium (SAE) cells of COPD was used 14 . 20,097 probes from 133 subjects were generated. Genome-wide gene expression analysis (GWAS) was performed using HG-U133 Plus 2.0 array (Affymetrix, Santa Clara, CA) 14 . Overall microarray Table 8. Comparison of relative expression of 44 candidate genes between healthy controls (smokers and nonsmokers) and COPD smoker patients (stage I and stage II) in number of pack of cigarette per year, separately. The Adj. P is based on the marginally adjusted values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range. (2) It is good for simultaneous estimation and eliminating trivial genes; (3) Coefficients being easy to implement is another of the merits.
(1) It is not good for grouped selection; (2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than LASSO; (3) This method has shown to not always provide consistent variable selection; (4) Its estimators are biased always; (5) Its efficiency depends greatly on the number of dimension of genes.

Adaptive Least Absolute Shrinkage and Selection Operator
Adapt. LASSO (1) This method has all of advantages of the LASSO.
(2) This method uses adaptive weights to penalize coefficients differently; (3) Adaptive LASSO provides a more consistent solution than LASSO.
(1) It is not good for grouped selection; (2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than adapt. LASSO; (3) Its estimators are biased always; (4) Its efficiency depends greatly on the number of dimension of genes.
Elastic net regularization Elastic net (1) This method selects groups of correlated variables together, shares nice properties of both the LASSO and ridge; (2) It can be considered for situations with p > n, it allows the number of selected features to exceed the sample size; (3) This method has predictive performance better than LASSO and ridge.
(1) It can only apply to two-class feature selection problems, it cannot resolve multi-class feature selection problems directly; (2) Its estimators aren't robust against outliers

Ridge Logistic Regression Ridge
(1) It handles the multi-collinearity problem (2) Ridge regression can reduce the variance (with an increasing bias); (3) Can improve predictive performance than ordinary least square approach.
(1) It is not able to shrink coefficients to exactly zero; (2) It cannot perform variable selection; it includes all of predictors (e.g. genes) in the final model; (3) It cannot handles the overfitting problem.

Support Vector Machines SVM
(1) It has a regularization parameter for avoiding overfitting; (2) It uses the kernel trick; (3) It is defined by a convex optimization problem (no local optimization); (4) It is a powerful classifier that works well on a wide range of classification problems, in other words, it is very good when we have no idea on the data; (5) It can apply for high dimensional and not linearly separable situations.
(1) Choosing a good kernel function is not easy; (2) It has several key parameters that need to be set correctly to achieve the best classification results for any given problem; (2) It works well in the situation with a lot of main and interaction parameters; (3) It can automatically select variables; (4) It is robust to outliers and missing data; (5) It can handle the numerous correlated and irrelevant variables problems; (6) It is an ensemble learning.
(1) Long training time for large datasets; (2) Difficult to understand and interpret the model; (3) Prone to overfitting.

Naive Bayes NB
(1) It is easy to implement as a single learning; (2) If the its conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models (e.g. logistic regression); (3) It needs less training data than other algorithms; (1) Class conditional independence assumption for all of variables (e.g. genes); (2) It is defined by a local optimization problem.

Random Forest RF
(1) It can apply for high dimensional situations; (2) It is robust to outliers and missing data; (3) It has less variance than a single decision tree; (4) Training each tree perform independently.
(1) It is complex; (2) It requires more computational resources and are also less intuitive; (3) Its prediction process using random forests is time-consuming than decision trees; (4) It assumes that model errors are uncorrelated and uniform.
Artificial Neural Network ANN (1) It is easy to implement; (2) It can approximate any function between the independent and dependent variables; (3) It handles all possible interactions between the dependent variables; (4) It does not require any assumptions, in other words, it is very good when we have no idea on the data.
(1) It solved for local optimization; (2) Parameters are hard to interpret; Gene expression analysis. Raw data (.CEL format) files were qualified, normalized, statistical comparison, removing batch effects and other unwanted variation were performed by "Affy, " "Limma" and "SVA" R packages, respectively. The cutoff of false discovery rate and fold-change for differentially expressed genes was considered at level of 0.10, and more than 2, respectively.
Module identification. Sample progression discovery (SPD) as a novel unsupervised computational approach to identify patterns of biological progression underlying microarray gene expression data. SPD assumes that individual samples of a microarray dataset are related by an unknown biological process, and that each sample represents one unknown point along the progression of that process. SPD aims to organize the samples in a manner that reveals the underlying progression and to simultaneously identify subsets of genes that are responsible for that progression. This method does not depend on prior knowledge and only uses gene expression information 32 . In this method, divisive/consensus k-means as a clustering gene algorithm was used (200 iterations in each consensus k-means partitioning, and 0.7 threshold for module coherence). Also, least number of genes in each modules was 10. SPD analysis was done by MATLAB 7 software.
Machine learning (ML) algorithms. Various ML algorithms including AdaBoost classification trees, decision tree, Gradient Boosting machines, Naive Bayes, neural network, random forest, support vector machine were performed in order to find genes associated with occurrence and progression of COPD, and the best ML method which has best accuracy and performance to predict COPD. All ML methods were applied using "adabag", "CART", "gbm", "naivebayes", "neuralnet", "'randomForest", and "e1071" R packages.
Adaptive LASSO, elastic-net, and ridge logistic regression. The ridge regression uses an L 2 penalty to regularize parameters, all of the estimated coefficients are nonzero, and hence no gene selection is performed. But, LASSO regression use the L 1 penalty instead, and hence provide automatic gene selection. In other hand, ridge penalty tends to shrink the coefficients of correlated variables toward each other, good for multi-collinearity, grouped selection. But, the lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables. Therefore, LASSO is good for simultaneous estimation and eliminating trivial genes but not good for grouped selection. Elastic-net is introduced as a compromise between these two techniques, and has a penalty which is a mix of L 1 and L 2 penalty, combine strength between ridge and lasso 33 . In adaptive LASSO regression where adaptive weights, inverse absolute value of LASSO coefficient was used for each variable as its weight in adaptive LASSO, are used for penalizing different coefficients in the L 1 penalty. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Unlike to the LASSO, the adaptive LASSO is consistent for gene selection 34 . The mentioned penalized logistics regression methods were done by "glmnet" R package (https:// cran.r-project.org/package=glmnet). In Table 9, the summary of each machine-learning and penalized statistical methods with some of advantages and limitations were mentioned.
Gene set enrichment analysis. Gene set enrichment analysis is a method to identify classes of genes that are over-represented in a large set of genes and may have an association with disease phenotypes (e.g. occurrence of COPD). The Comprehensive gene set enrichment analysis web server 2016 update called "Enrichr" was applied 35 .
Cross-validation, stability and accuracy. K-fold cross-validation scheme (k-cv) is a very commonly employed technique used to evaluate classifier performance. K-CV estimation of the error is the average value of the errors committed in each fold. Thus, the K-CV error estimator depends on two factors: the training set and the partition into folds. Sensitivity analysis was performed to changes in the training set and sensitivity to changes in the folds 36 . The bootstrap (or subsampling) is another way to bring down the high variability of cross-validation, to aims stability selection 37 . Repeated cross-validation is a good strategy for (a) optimizing the complexity of regression models and (b) for a realistic estimation of prediction errors when the model is applied to new Gene Selection Method Name

Gene Selection Method Acronym Main Advantages Main Limitations
Decision Trees RT (1) Easy to interpret and explain as a single learning; (2) It is very fast; (3) Its estimators are robust against outliers; (4) Can be combined with other decision techniques; (5) It handles missing values and filling them in with the most probable value.
AdaBoost Classification Trees (Adaptive Boosting) ABCT (1) It can be less susceptible to the overfitting problem than most learning algorithms; (2) It combines a set of weak learners in order to form a strong classifier and selection of weak classifier is easy; (3) It is a machine learning meta-algorithm.
(1) It can be sensitive to noisy data and outliers; (2) Requirement of a large amount of training data and long training time.  38,39 . In the present study, the algorithms split the data set by using repeated random 100 times sub-sampling in 5-fold cross-validation, permuting the sample labels every time. Cross-validated performance was summarized by observed sensitivity and specificity, and misclassification error rate. Furthermore, the area under the Receiver Operator Characteristic (ROC) curve (AUC), was used to calculate of classifiers performance 40,41 . Also, in order to assessing literature validation for any results, literature mining was used in the PubMed databank. Interactive cluster heatmaps was applied by "heatmaply" R package (https://cran.r-project.org/web/packages/heatmaply) 42 . A heatmap is a popular graphical method for visualizing high-dimensional data. A static heatmap, as an interactive heatmaps, was used to represent biological data, in which colors are used to represent the values (importance index) in a matrix where columns and rows are the machine learning and statistical methods (instances) and genes selected (attributes), respectively. Rows and columns are sorted using a hierarchical clustering technique 43 . The study's follow chart is shown in Fig. 1.