Novel gene signatures for stage classification of the squamous cell carcinoma of the lung

The squamous cell carcinoma of the lung (SCLC) is one of the most common types of lung cancer. As GLOBOCAN reported in 2018, lung cancer was the first cause of death and new cases by cancer worldwide. Typically, diagnosis is made in the later stages of the disease with few treatment options available. The goal of this work was to find some key components underlying each stage of the disease, to help in the classification of tumor samples, and to increase the available options for experimental assays and molecular targets that could be used in treatment development. We employed two approaches. The first was based in the classic method of differential gene expression analysis, network analysis, and a novel concept known as network gatekeepers. The second approach was using machine learning algorithms. From our combined approach, we identified two sets of genes that could function as a signature to identify each stage of the cancer pathology. We also arrived at a network of 55 nodes, which according to their biological functions, they can be regarded as drivers in this cancer. Although biological experiments are necessary for their validation, we proposed that all these genes could be used for cancer development treatments.

As GLOBOCAN reported in 2018, lung cancer was the first cause of deaths and new cases by cancer worldwide 1 . Squamous cell carcinoma of the lung (SCC) is one type of lung cancer which comprises approximately 30% of all lung cancer cases. The available molecular targets for use in the treatment of SCC of the lung are behind of other types of cancer [2][3][4] . Recent advances in the treatment have been achieved using immunotherapy as nivolumab and pembrolizumab and some clinical trials are being conducted to test molecular targets 3,4 . Some efforts to understand the basis of the disease have been made using gene expression profiles, DNA sequencing and SNP arrays 2 . However, there are few preclinical murine models, some SCC of the lung cell lines have errors in their classification and molecular targets usually found in other types of lung cancer as lung adenocarcinoma are rarely present in SCC of the lung 4 . Lung cancer is classified into two wide groups as follows: Small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLC represents 85% of all lung cancer cases. From this group the most prevalent are the adenocarcinoma and the squamous cell carcinoma of the lung 2,5 . Lung cancer survival is less than 5% after 5 years and most of them metastasize. Most of the time lung cancer is detected in advanced stages in which treatment is less effective. The best treatment is surgery although the effectivity of the treatment is linked to early stages of the disease [6][7][8][9] . Smoking is considered as a risk factor associated to lung cancer development 2 . Network analysis is widely used in different areas including biological sciences with a wide variety of results. There are different metrics that can be obtained from networks as the hubs which are commonly referred as the most connected nodes which lead to network instability if they are perturbed [10][11][12][13] . Besides, other network measures as betweenness and multivariate entropy have been used to analyze cancer networks to find putative potential targets for cancer disease 14,15 . We previously identified a set of nodes which due to its biological and network properties we called them network gatekeepers 16 . The latter was done by visual inspection. Gatekeepers have few nearest-neighbor interactions with other proteins, but these proteins have plenty of interactions. Gatekeepers might not be detected by standard differential gene expression analyses.
In this work, we use clustering centrality as a metric for a better and quicker identification of gatekeepers 16 . Machine learning algorithms have been applied to a wide variety of phenomena 17 . Health sciences have a special interest in the applications of these techniques due to the vast data publicly available with the objective to achieve better diagnosis and treatments of diseases. Some of the analyzed data with these approaches include analysis of histopathological images [18][19][20] . In this article, we make an analysis of the carcinogenic process of the squamous cell carcinoma of the lung using cutting-edge techniques as network and machine learning analyses to obtain sets of genes which could function as a signature to aid in the classification of patient tumor samples into one of the

Results
Carcinoma-stage classification model derived from machine learning. Data collected in GSE33479 was used to train a supervised machine learning algorithm. A logistic regression model was trained to classify the eight stages of the small cell lung carcinoma. Logistic regression models have been shown to provide accurate non-linear classification models of complex data 21 . A parameter reduction procedure was applied to the trained model. For this, the parameters of the model were standardized so that the parameters follow a standard normal distribution. The parameters whose value was beyond 0.78 from the mean were selected as relevant parameters. The procedure of model training and parameter selection was applied two times. On first parameter reduction, a total of 800 genes out of 41,067 were selected; on the second round a total of 15 relevant genes were selected. When using the subset of 800 genes a logistic regression model was trained and tested with records of all 122 patients records with all correctly classified, when considering the set of 15 genes the trained model was able to correctly classify the healthy stage and the first stages of small cell carcinoma and presented 7 cases of misclassification on later stages Fig. 1a. A neutral control was designed by considering a total of 500 random sets of 15 genes, on each random set a logistic regression model was trained, and its accuracy was measured using the Jaccard index and compared with the Jaccard index of a model trained with the selected set of 15 genes. The Jaccard index measures the proportion of correctly categorized cases by the model Fig. 1b. The Jaccard index from the set of genes derived from the parameter reduction method was of 0.92 whereas for the random sets the maximum Jaccard index was 0.29. When considering the set of 15 selected genes coupled with the 26 genes identified from previous analysis on PPI networks resulted in a trained model with a Jaccard index of 1. APID PPI data was used for network analysis of the results from the implemented machine learning technique for the first glance results of approximately 800 genes. APID was used due to better coverage of most part of the 800 genes.
Differential gene expression analysis. The first step was to carry out an exploratory network analysis which is shown in Figs. 2a,b. These networks are obtained from joining the results from the Differential Gene Expression (DGE) to Mentha network database and then the application of Eq. 1 to highlight the network gatekeepers. Figure 2a shows in the inset the color scale, which was applied in Fig. 2a,b. The minimum degree (number of connections of a node) value is 1 which is yellow in color, the most connected nodes are in navy blue purple whose value is 75. Figure 2a shows in red the connections that every gatekeeper has, and they are marked by bigger yellow circles. It can be observed that all of them are connected to other nodes, but they, at first glance, do not appear to be of importance because of the few connection they have. In Fig. 2b can be observed in red, not only the gatekeeper's connections but also the connections of the first connected nodes and how they have much more connections which comprises most of the network. The nodes at which gatekeepers are connected are hubs due to the highly connections they have. www.nature.com/scientificreports/ In Fig. 3, a zoom of the graph of Fig. 2 is presented and each node is tagged with its HGNC name tag. It can be observed to which nodes some of the gatekeepers are connected. For example, a connection (red) to MYC protein (purple) can be observed. This protein is considered as an oncogene frequently associated with poor outcomes; a gatekeeper is linked to other proteins as MEOX2 whose possible function in some cancers is to be a suppressor gene 22,23 . Every one of the gatekeepers are linked to highly connected nodes which have relevant biological functions. Table 1 summarizes in a list the results of the network analysis and the application of the machine learning algorithm in the GEO data set. A list of 26 gatekeepers' proteins is displayed in the first column which were the proteins with a clustering centrality of 1 obtained by the network analysis. This set of genes were used as an input list for the machine learning algorithm in which the results showed that they can be used to identify each carcinogenic stage with great accuracy. Second and third columns are two lists that contain a reduced set obtained only from the machine learning algorithm to classify each sample into its corresponding stages. Second column contains the probe tag used by the chip. Using two different methods we obtained two list of potentially gene sets that could be used as an aid to help classify samples and whose biological functions denote their potential use as targets for therapy. Further experiments are needed to probe its potential use.
An enrichment test was performed using the gatekeepers list to discover the main pathways associated with them as shown Table 2. The first characteristic is that every category is overrepresented, which means that in each presented category there are more genes from the input list than it can be expected (using as reference the Homo sapiens REFLIST) and most of the processes are involved in cell cycle-related specially in mitosis.
The next step was to search in distinct databases the list of genes obtained by the machine learning algorithm. We selected two pathway databases: the reactome pathways and the KEGG pathways. In Table 3, it can be observed a list of 8 genes for which information was available. The first column corresponds to its name, the second column to the related pathways in Reactome and the third to KEGG pathways. Some of the related pathways are usually altered in some types of cancer as Beta-catenin independent WNT-signaling, SMAD2/ SMAD3, tight junction, ABC transporters, etc. [24][25][26][27] .
A network was made based in the results obtained from the machine learning algorithm first glance which comprised approximately 800 genes. It was observed a big component (when a significant proportion of the nodes in a graph are connected) created by some nodes as seen in Fig. 4a.
An interesting characteristic of the identified network by the machine learning algorithm was that some of the network gatekeepers identified by the DGE analysis were connected to this big component as seen in Fig. 4b. Some biological functions of some nodes in this network are well known to be relevant for cancer progression as PTEN which is a tumor suppressor altered in some types of cancer, as well as others like MCL1 which is an anti-apoptotic protein altered in some types of cancers. Also, MCL1 is being studied as a target for cancer patient treatment in small cell lung cancer 28,29 . FAR1 is observed to play an essential role in the production of ether lipids/ plasmalogens whose synthesis requires fatty alcohol. ABCA1 catalyzes the translocation of specific phospholipids from the cytoplasmic to the extracellular/luminal leaflet of membrane coupled to the hydrolysis of ATP. In cancer it was observed that its inhibition plays an important role for cancer survival due to an increase of mitochondrial www.nature.com/scientificreports/ cholesterol. The function of DMRT3 is not clear. It is thought to function as a transcription factor. In a study of lung cancer, the dysregulations of DMRT3 along with other two proteins were considered specific for lung squamous cell carcinoma 30 . AAK1 is a kinase that participates in the regulation of clathrin-mediated endocytosis. It was discovered that in β-Catenin-dependent WNT signal a negative feedback loop is created by its expression. ASF1B Is a histone chaperone which facilitates histone deposition, exchange, and removal during nucleosome assembly/disassembly; in cervical cancer it was observed that it functions as an oncogene accelerating cancer cells proliferation. APOC1 functions as an inhibitor of lipoprotein binding to the low-density lipoprotein (LDL) receptor. In gastric cancer it was proposed as a potential diagnostic and prognostic biomarker; in colorectal cancer evidence points out to have a promoting role in carcinogenesis. ADRA1B is an alpha-adrenergic receptor whose action is mediated by association with G proteins; in gastric cancer it was found a methylation promoter and it could be frequently involved in development and gastric cancer progression. These mentioned proteins are other examples of gene protein products whose functions are or could be involved with cancer disease [30][31][32][33][34][35][36][37] .

Discussion
Lung cancer is the deadliest type of cancer, most of the diagnosed cases are made in the last stages of the disease and there are little available treatment options which could have an important effect. Small cell carcinoma of the lung comprehends a great part of all lung cancers. Our present results provide a better comprehension of the underlying components of the disease. The detection of genes and proteins that could be implicated in the carcinogenic process is urgently needed to provide better options for treatments and diagnosis. Herein, we performed a thorough search of genes and proteins that could be used to offer better treatments and diagnosis options. We made a comprehensive analysis of all the carcinogenic process and observed that some set of genes could be used as an aid for small cell carcinoma of the lung stage classification. We employed two pathways to identify relevant genes for diagnosis. The first was based in a classic method as DGE analysis with the aid of more www.nature.com/scientificreports/ recent techniques as network analysis with a novel concept as the network gatekeepers which are encountered by using clustering centrality. DGE analysis was used as an exploratory analysis to look for possibly patterns in the gene expression for each stage. Although a general panorama of the carcinogenic process was obtained, we wanted to summarize it into a small meaningful set of genes with high involvement in cancer development. To make this possible we used the output data of the DGE as the input for the network analysis and then search for the network gatekeepers. The other pathway was based in another cutting-edge technique, machine learning algorithms. Hitherto, machine learning applications on cancer have been for assessing cancer prediction and prognosis 38 . These results are based on the analysis of a wide set of variables including biomarkers and clinical factors such as age, location and type of cancer, and size of tumor [39][40][41] . The results presented in here are not intended for cancer prevention or survivability directly, rather they provide a set of specific genetic biomarkers whose analysis can lead to an immediate diagnosis about the stage of development of small cell carcinoma in a patient. The analysis of the proposed genetic biomarkers can differentiate even the earliest stages of cancer development and lead a physician to administer the required treatment when the probabilities of survivability of the patient are higher. For each method we found a set of genes which we proposed to be useful as an aid for stage classification and due to the important biological roles in which they are involved they could also be useful for further validation as possible targets for treatment. The biological roles of the gatekeepers proposed set are marked as cell cycle regulation, DNA-repair breaks, nucleosome assembly and processes that occur in the mitotic phase. It was first observed in the gatekeeper network that most of them exhibit scarce connections but their first neighbor nodes which are directly connected are hubs (highly connected nodes). This along with the processes they are involved may permit an access for prior processes regulated by the hubs. It is known that network hubs are of high importance for network stability, but in this work, we are observing that it can be of great importance to use the network gatekeepers as a measure to find key components in a biological context. The machine learning algorithms are usually used in other fields to improve the understanding of a wide variety of processes. In the case of cancer its aim is to find new targets and possible key proteins that regulate cancer. We found a reduced set of genes that can be used for stage classification in a set of microarray data and this also can be done with the set of gatekeepers. The biological functions of each of the identified genes are relevant for normal stages and as previously observed for cancer development. For example, in the case of the reduced set Table 1. Gene list from network gatekeepers and machine learning algorithm. Some Id are labeled as ** which means is a Missing Id. The first column corresponds to Gatekeepers list with 26 genes and the second column to the probe tag id in the microarray chip for 15 genes found with the machine learning method. The third column are the HGCN tags for each probe id of the second column. Second and third columns list finished when blank fields were present.   www.nature.com/scientificreports/ obtained with machine learning, ARPC5 is a protein whose normal function is involved in EPH-Ephrin signaling and tight junction regulation and they are involved in cancer processes as adhesion, migrations, invasion or growth. This protein was recently proposed to be a prognostic biomarker for patients with multiple myeloma 42 .
In the case of ABCA1 is a protein whose inhibition promotes cancer progression 32 . Genes identified in the big component obtained from the machine learning first set was also analyzed and observed that some of them were previously studied in cancer and that their functions are involved with them. It is necessary to study these proteins in the context of squamous cell carcinoma of the lung, as it is known that the function is dependent of the type of tissue, microenvironment, and type of cancer. Our combined approach of DGE analysis plus the use of the metric of clustering centrality together with the application of machine learning algorithms, will facilitate the identification of relevant components in biological networks as the ones derived from cancer data.

Conclusions
We found a small set of genes possibly involved in the development of the disease. We propose two sets of genes which could help in the classification of tumor samples. These findings can increase the available options for experimental assays and molecular targets that could be used in novel treatment development. Although further experimental research is needed to validate their utility in the clinical setting.

Methods
A graphical flowchart that summarizes the methods and the data bases is shown in Fig. 5. Data collection was made by using various databases: Gene expression Omnibus for gene expression patients set, GEO accession: GSE33479, which comprises 122 patient samples representing the carcinogenic stages. Samples were divided as: 13 normal histology and normo-fluorescent, 14 with normal histology and hypo-fluorescent, those were grouped as the control group, 15 metaplasia samples, 13 mild dysplasia, 13 moderate dysplasia, 12 severe dysplasia, 13 carcinoma in situ, and 14 for squamous cell carcinoma of the lung. The gene expression platform was Agilent-014850 Whole Human Genome, Microarray 4 × 44 K G4112F. Processing and differential gene expression analysis were performed using R v3.5.1 software (http://www.R-proje ct.org). Processed data retrieval was performed by GEOquery R package. Hgug4112a.db R package was used to annotate each gene ID to the data 43 . Using limma package differential gene expression (DGE) analysis was used to compare each stage vs the normal. Limma package fits a generalized linear model before comparisons and then calculate a moderate t-statistic for each contrast 44,45 . A p-value is obtained which is adjusted based in Benjamini and Hochberg False Discovery Rate correction 44,46 . A list from the DGE was obtained for each comparison, results were merged to obtain a new list with all differentially genes. Full Human Interactome was downloaded from Mentha and APID database 47,48 . Protein-protein interactions (PPI) level 0 data (all reported proteins pairs) was obtained from APID. Cleaning process for both networks was made using Cytoscape software (Networks for Figs. 2 and 3 were created using this software) which comprised: deletion of repeated interactions, deletion of protein interactions detected in other organism, deletion of self-loop interactions in proteins 49 . Both databases are public and free to use. The merged list resulted from DGE (using a filter of p < 0.05 and Fold change < − 1.5 & > 1.5) from the microarray data were coupled with Mentha PPI dataset which allowed to create a new network of PPIs which were used as a template to identify the network gatekeeper's proteins using clustering centrality measure Eq. (1). Mentha PPI data was used due to the better coverage of the genes that appeared in list of DGE analysis. To calculate clustering centrality measure we used the following Eq. (1): where C i is the clustering coefficient of a node i and is defined as the fraction E i of existing connections among its k i nearest neighbors divided by the total number of possible connections. www.nature.com/scientificreports/