Machine Learning and Bioinformatics Models to Identify Pathways that Mediate Influences of Welding Fumes on Cancer Progression

Welding generates and releases fumes that are hazardous to human health. Welding fumes (WFs) are a complex mix of metallic oxides, fluorides and silicates that can cause or exacerbate health problems in exposed individuals. In particular, WF inhalation over an extended period carries an increased risk of cancer, but how WFs may influence cancer behaviour or growth is unclear. To address this issue we employed a quantitative analytical framework to identify the gene expression effects of WFs that may affect the subsequent behaviour of the cancers. We examined datasets of transcript analyses made using microarray studies of WF-exposed tissues and of cancers, including datasets from colorectal cancer (CC), prostate cancer (PC), lung cancer (LC) and gastric cancer (GC). We constructed gene-disease association networks, identified signaling and ontological pathways, clustered protein-protein interaction network using multilayer network topology, and analyzed survival function of the significant genes using Cox proportional hazards (Cox PH) model and product-limit (PL) estimator. We observed that WF exposure causes altered expression of many genes (36, 13, 25 and 17 respectively) whose expression are also altered in CC, PC, LC and GC. Gene-disease association networks, signaling and ontological pathways, protein-protein interaction network, and survival functions of the significant genes suggest ways that WFs may influence the progression of CC, PC, LC and GC. This quantitative analytical framework has identified potentially novel mechanisms by which tissue WF exposure may lead to gene expression changes in tissue gene expression that affect cancer behaviour and, thus, cancer progression, growth or establishment.

Welders inhaling WFs in large quantities over a long period run a significantly elevated risk of developing certain types of cancer 1,2 . These metastatic diseases involve uncontrolled or neoplastic growth of cancer cells that arise after the accumulation of genomic mutations, but other factors with powerful effects on cancer behaviour and growth include genetic factors and environmental factors the suffer is exposed to 4 . Environmental factors include inhaled toxic fumes that affect the lungs and enter the circulation to reach many tissues, and which can affect cellular gene expression of cancer cells and thereby their behaviour, survival, growth and invasiveness. Thus, influences such as WF inhalation affects the progression of many types of cancers, including those focused on in this study, specifically CC, PC, LC and GC, which are among the cancers most commonly linked with WF exposure [5][6][7] . The aim of this study is therefore to identify mechanisms through which WFs may increase cancer incidence.
LC is one of the most lethal types of cancer and globally is a leading cause of death 1,2,8 . WFs contain toxic metallic oxides and silicates that directly affect the sensitive tissues of the lung when inhaled, the manner of exposure (by inhalation) makes this the cancer with the highest risk for welders 9 . CC arises in the colon and the rectum and has a typical 5-year survival rate of about 60%. It damages colon or rectum by uncontrollable and invasive cell growth 10 . Iron, aluminum and magnesium oxide of the welding fumes are known to affect the incidence of CC 9 , although this is not well understood. PC affects prostate, the gland which produces seminal fluid and controls the transportation of sperm 11 . Nitrogen oxides, carbon dioxide and phosgene are risk factors for prostate neoplasms that are found in WFs 9 . GC (gastric or stomach cancer) 12 is linked to exposures to nickel, beryllium and cobalt oxides which are all present in WFs 9 .
In this study, we developed a systematic and quantitative network-based approach to investigate the effects of WFs on gene expression and how these effects may give a clue as to how they encourage the incidence and progression of cancers through affecting pathways and pathway genes that are also altered in these cancers. Thus, we compared gene expression effects of WF exposure with the altered pattern of gene expression seen in CC, PC, LC and GC. This involved, firstly, analyzing differentially expressed gene profiles, then filtering these genes through gene-disease association networks, signaling and ontological pathways, and protein-protein interaction networks. We also investigated the importance of genes and pathways thus identified by using the gold benchmark databases dbGaP and OMIM to identify evidence to support the involvement of these genes in pathological processes such as cancer development. Moreover, we analysed patient survival and its association with the genes that are dysregulated in both the WF-exposed tissue and the four types of cancers. The influence on cancer patient survival of these identified genes provides evidence for their involvement in WF effect on cancer progression.

Methods and Materials
Overview of the analytical approach. We applied an analytical approach to identify links between WF exposure and the incidence of the cancers by employing selected microarray datasets shown in the block diagram of the applied analytical approach shown in Fig. 1. This quantitative approach used genes differentially expressed in WF exposure, and identifies those that are also common to the differentially expressed genes observed in each cancer study. Further, these shared or common differentially expressed genes were used to construct gene-disease (diseasome) association network, identify signaling and ontological pathways, protein-protein interaction (PPI) network and survival function analysis. This approach also used gold benchmark databases OMIM and dbGaP validate genes and pathways identified in our study as showing possible disease associations.

Datasets employed in this study.
To identify the gene expression dysregulation that is common to WFs and the four types of cancers under investigation, we analyzed gene expression microarray datasets from the National Center for Biotechnology Information (NCBI). We examined five different microarray datasets with accession numbers GSE62384, GSE25071, GSE55945, GSE10072 and GSE2685 13-17 . Dataset GSE62384 was produced using human upper airway epithelial cells (RPMI 2650) exposed to spark generated WFs. These data were generated from cells exposed to WFs for 6 hours continuously at low (85 μg/m 3 ) and high (760 μg/m 3 ) concentrations. The CC dataset (GSE25071) consists of microarray data taken from 17 colorectal cancer sufferers who had late-onset CC (mean age 79 years) and 24 patients with early-onset CC (mean age 43 years). The PC dataset (GSE55945) is a microarray data on RNA taken from radical prostatectomy tissue from prostate cancer patients at the Beth Israel Deaconess Medical Center which compared tissue from PC sufferers (Gleason score 6 or 7) with normal prostate tissue. The LC dataset (GSE10072) contained microarray data comparing normal lung tissue and lung adenocarcinoma tissue collected from 26 former smokers, 20 non-smokers (who never smoked) and 28 current smokers; gene expression data are reported by comparing 49 non-tumor and 58 tumor lung tissues. The GC dataset (GSE2685) contains microarray data from 22 gastric cancer and 8 non-cancerous gastric tissues.
To analyze the patient survival association of the altered genes that are common to WFs and the four types of cancers under investigation, we retrieved clinical and RNAseq data for CC, PC, LC and GC from the cBioPortal 18,19 23 . We employed six clinical factors (ethnicity, anatomical site of cancer, histological grade of cancer, primary tumour site, and neoplasm status with tumour) to analyze the survival of the altered genes that are common to WFs and the four types of cancers under investigation. The summarized description of the datasets is shown in Tables 1 and 2. www.nature.com/scientificreports www.nature.com/scientificreports/ Analysis methods. Microarray-based gene expression analysis is a global and sensitive method to identify and quantify possible molecular mechanisms that underlie human disorders 24 . We used these approaches to analyze the gene expression profiles of CC, PC, LC and GC to find the genetic effects of WFs that may influence the development of these cancers. To allow comparisons of the mRNA expression data generated using different platforms and to avoid complications arising from the different experimental systems employed in the original studies, we normalized the gene expression data by means of Z-score transformation (Z ij ) for each type of cancer tissue gene expression profile using = − Z ij where SD denotes the standard deviation, g ij denotes the value of the gene expression i in sample j. After this transformation gene expression values of different diseases at different platforms can be directly compared. We applied unpaired t-tests to find differentially expressed genes of each disease over control data and selected significantly dysregulated genes. We have chosen a threshold of at least 1 log 2 fold change and a p-value for the t-tests of < = × − 1 10 2 . We employed the neighborhood-based benchmark and the multilayer topological methods to find gene-disease associations. We constructed a gene-disease network (GDN) using the gene-disease associations, where the nods in the network represent either gene or disease. This network can also be recognized as a bipartite graph. The primary condition for a disease to be connected with other diseases in GDN is they should share at least one or more significant dysregulated genes. Let D is a specific set of diseases and G is a set of dysregulated genes, gene-disease associations attempt to find whether gene ∈ g G is associated with disease ∈ d D. If G i and G j , the sets of significantly dysregulated genes associated with diseases D i and D j respectively, then the number of shared dysregulated genes n ( ) ij g associated with both disorders D i and D j is as follows 25 : The common neighbours are the based on the Jaccard Coefficient method, where the edge prediction score for the node pair is as 26 : where G is the set of nodes and E is the set of all edges. We used R software packages "comoR" 27 and "POGO" 28 to cross check their genes-disease associations.
To investigate how molecular determinants from the WF exposed tissues relate gene expression alterations in the cancers, we analyzed pathway and gene ontology using Enrichr 29,30 . We used KEGG, WikiPathways, Reactome and BioCarta databases for analyzing signaling pathway [31][32][33][34] . We used GO Biological Process and Human Phenotype Ontology databases for ontological analysis 35,36 . We also constructed a protein-protein interaction sub-network for each CD, using the STRING database, a biological database and web resource of known and predicted protein-protein interactions 37 . Furthermore, we examined the validity of our study by employing two gold benchmark databases OMIM and dbGaP.
To determine the patient survival association of the altered genes that are common to WFs and the four types of cancers under investigation, we employed Cox PH model for univariate and multivariate analysis 38,39 . The Cox PH model can be written as follows: Here | h t X ( ) i is the hazard function conditioned on a subject i with covariate information given as the vector X i , h t ( ) 0 is the baseline hazard function which is independent of covariate information, and β represents a vector of regression coefficients to the covariates correspondingly. We have calculated the hazard ratio (HR) based on the estimated regression coefficients from the fitted Cox PH model to determine whether a specific covariate affects patient survival. The HR for a covariate x r can be expressed by the following simple formula exp β ( ) r . Thus, the HR for any covariate can be calculated by applying an exponential function to the corresponding β ( ) r coefficient. The survival status of a patient can be estimated by calculating PL estimator 40 of the survival function can be defined as follows:  www.nature.com/scientificreports www.nature.com/scientificreports/ Here Ŝ t ( ) j is estimated survival function at time t j , d j is the number of events occurred at t j , and n j is the number of subjects available at t j . After estimating survival function, two or more groups can be compared using a log-rank test. We used Log-rank tests to detect the most significant genes in the case of patient's survival time in altered versus normal (non-altered) groups in context of gene expression. The null hypothesis for this test can be symbolically explained as follows:

A altered normal
Here H 0 is survival functions that are the same for altered and normal gene and H A is survival functions that are not the same for these two groups.
If the survival function of a specific gene is different among altered and normal groups then we include it to the combined Cox PH model. This approach is efficient for learning the effect of a specific gene of interest on patient survival in the presence of the clinical factors.

Results
Gene expression analysis. To identify and investigate the gene expression effects of WFs that may influence the behaviour of various types of cancer, we analyzed the gene expression microarray data collected from the National Center for Biotechnology Information (NCBI). We observed that WFs have 903 differentially expressed genes obtained by adjusted < = . We also employed a cross-comparative analysis to find the common genes with altered expression between WFs and each CD. We found that WF treated cells share a number of differentially expressed genes with for CC (36 dysregulated genes), PC (13 genes), LC (25 genes) and GC (17 genes). To identify the significant associations among these cancer types with the effects of WF exposure, we constructed two separate gene-disease association-ship networks for up and down-regulated genes using Cytoscape plugins 41 , centered on the WF data as shown in Fig. 2(a,b). The necessary condition for two diseases to be associated is they must have at least one or more common differentially expressed genes in between them. Notably, two particular significant genes, C2orf88 and IGFBP5 were differentially expressed among WF exposure, CC and PC; and three significant genes, FCGBP, IQGAP2 and HPGD are common among WF exposure, CC and GC. One gene, FGFR3, is commonly dysregulated among WF exposure, CC and LC.
Pathway and functional association analysis. Pathways are constituted by a series of interactions at the molecular level in a cell, and are a vital key to understand the internal changes of an organism. Pathway-based analysis can be used to identify molecular or biological mechanisms that underlie the development of complex diseases 42,43 . We analyzed pathways of the commonly altered expression genes seen in WF exposure and in the cancers using Enrichr, a comprehensive web-based gene set enrichment analyzing tool 29,44 . Signaling pathways of the commonly altered expression genes of WF exposure and each type of cancer examined were analyzed using four globally recognized databases includes KEGG, WikiPathways, Reactome and BioCarta. We considered signaling pathways from the selected four databases and identified the most significant signaling pathways of each CD after applying several statistical analysis. Notably, we found 6, 7, 5 and 7 signaling pathways are associated with CC, PC, LC and GC, respectively, as shown in Fig. 3.
Gene ontological analysis. The Gene Ontology (GO) refers to a universal conceptual model for representing gene functions and their relationship in the domain of gene regulation. It is constantly expanded by accumulating the biological knowledge to cover the regulation of gene functions and the relationship of these functions in terms of ontology classes and semantic relations between classes 45 . We analyzed ontological pathways of the commonly altered expression genes seen in WFs exposed cells and each cancer type using two recognized databases including GO Biological Process and Human Phenotype Ontology. We considered ontological pathways from selected two databases and identified the most significant ontological pathways for each cancer type after applying several statistical analysis. We found 10, 11, 14 and 14 ontological pathways are associated with the CC, PC, LC and GC, respectively, as shown in Tables 3-6.
Protein-protein interaction analysis. A protein-protein interaction network refers to the binding of proteins in the cell formed by biochemical or complex biological functions. Protein-protein interactions are essential to understand the cell physiology in health and disease states. We constructed and analyzed protein-protein interaction networks of the significantly altered expression genes of each CD using the STRING database. We clustered protein-protein interactions of cancer types into four different groups as shown in Fig. 4. Survival analysis. Patient survival analysis using both gene expression and clinical data is a popularly used feature in research to predict and characterize gene signatures in cancer 46 . In this study, we estimated survival function for altered and normal groups of the significant genes that are common to WFs and the four types of cancers under investigation by employing Cox PH model and PL estimator analysis. We fitted both univariate and multivariate analysis of the Cox PH regression model. The significant genes of the four selected cancers with estimated coefficients (β), hazard ratios (HR) and p-values from those analyses are shown in Tables 7-10. After (2020) 10:2795 | https://doi.org/10.1038/s41598-020-57916-9 www.nature.com/scientificreports www.nature.com/scientificreports/ these analyses we selected the most significant genes for the four types of cancers by choosing a threshold ( < = . p 0 05) of the p-value. The survival curves of the most significant genes, comparing altered and normal groups had been obtained by using the PL estimator as shown in Fig. 5. Note that, from Fig. 5, we can see that those with altered expression of genes show lower survival compared to the normal group.

Discussion
In this study we investigated how WF exposure may influence a number of types of cancer whose development and growth is greater with exposure to WFs or the components of WFs. We compared the gene expression alterations that result from WF exposure in cells with that of the genes that have dysregulated expression in several cancer types. The idea behind this is similar to studies of comorbidities, where dysregulated genes (or more usually gene pathways) that are common to two diseases give clues to how those diseases interact when co-occurring www.nature.com/scientificreports www.nature.com/scientificreports/ in the same individual, even if we are unclear as to the reason for the altered expression of individual genes or pathways is unclear. Thus, genes or gene pathways altered in response to WF exposure and the cancers of interest can be means by which WF exposure encourages those cancers to develop. Note that WFs included components such as metal fumes that are absorbed by the lungs into the bloodstream, to expose many tissues around the body. Many of these fumes are carcinogenic, but cancer initiation is only one of a number of stages of cancer development and progression, and welders commonly have regular exposure to fumes over long periods. Unlike in other morbidities, some altered gene expression may arise in individual cancer cells due to mutations which will affect (a) Pathways associated with signiϐicantly common differentially expressed genes of the CC with WFs.
(b) Pathways associated with signiϐicantly common differentially expressed genes of the PC with WFs.
(c) Pathways associated with signiϐicantly common differentially expressed genes of the LC with WFs.
(d) Pathways associated with signiϐicantly common differentially expressed genes of the GC with WFs.  www.nature.com/scientificreports www.nature.com/scientificreports/ survival of those cells; if such altered expression the is detected in whole cancer tissue across many individuals (as in our studies) then the alteration may be affecting pathways that encourage survival and growth. Thus we have applied a systematic approach to identify pathways that WFs may affect the cancer behaviours.
For our analysis we employed gene regulation analysis, gene-disease association networks, signaling and ontological pathways, and protein-protein interaction networks. To identify pathways and genes that are important in WF interactions in the cellular processes that influence cancer progression, we examined gene expression microarray data from WF exposed cells, CC, PC, LC and GC, each with control datasets. This identified a large  Table 3. The most significant ontological pathways common to the WFs exposed cells and CC.    Table 6. The most significant ontological pathways common to the WFs exposed cells and GC. www.nature.com/scientificreports www.nature.com/scientificreports/ number of significant genes that were commonly dysregulated between WF-exposure and cancer profiles, and evident by simple gene expression comparisons. There were a number of dysregulated genes that were common between WF exposure responses and cancer types, which suggests that WF exposure may cause gene expression changes that could affect the behaviour of cancers. It should be noted that the cancer transcriptome datasets, such   www.nature.com/scientificreports www.nature.com/scientificreports/ as those employed here, contain transcripts from both cancer cells and the supporting stromal cells found in the tumors themselves. Thus, it should be noted that WFs may exert their effects on cancers either indirectly (through tumor stroma) or on the cancer cells themselves.

GO Term
We constructed two separate gene-disease association networks for up-and down-regulated genes showed strong evidence that WFs may indeed influence these cancers as indicated in Fig. 2(a,b). The pathway-based analysis is a technique to better understand the molecular or biological mechanisms underlying different complex diseases by determining common pathways that a stimulus (such as WFs) may influence cells of interest. We identified significant signaling and ontological pathways of the commonly dysregulated genes of each cancer.   www.nature.com/scientificreports www.nature.com/scientificreports/ These identified pathways indicated how WFs may affect these cancer types. Similarly, protein-protein interaction sub-networks of the commonly altered genes suggest that WFs affect several types of cancers. Note that if a pathway is a conduit for the effects of an important risk factor for a disease, this points to that pathway being particularly important to the pathogenesis of the disease and that reducing that pathways effects could be a way to attack the disease progression itself. It should be noted that these findings only point to possible ways that WF exposure may affect the cancers and cannot prove causation. However, when we investigated whether the gene Figure 5. Survival function for an altered and normal group of the most significant genes that are common to WFs and the four types of cancers under investigation. These include significant genes common to WFs exposed cells and CC (a-e), PC (f,g), LC (h,i) and GC (j-l). Here, the cyan colored line in the survival graphs indicates the altered and the red indicates the normal gene expression group.
www.nature.com/scientificreports www.nature.com/scientificreports/ expression patterns that we have observed could be associated with reduced survival of the patients (pointing to the importance of those gene expression levels either directly or indirectly) that is what we observed for several of significant genes that are common WF the cancer profiles under investigation as shown in Fig. 5.
It should be noted that the datasets employ a number of different cell types, which is commonly the case in this type of study. While gene expression patterns are, by definition, different in different cell types, here we were only concerned with expression alterations; certain responses to WFs may not occur in all cells so, while our approach cannot identify all pathways affected by WFs in nascent tumour cells, it will find some. Indeed, our data provides evidence to suggest the involvement of a number of genes in cancer behaviours that are linked to the noxious effect of WFs on cancer.
We used the gold benchmark databases OMIM and dbGaP for cross checking the validity of our outcome and found that there were some shared genes in between the WF exposure and cancer types as shown in Fig. 2(c). For validation purposes, we collected disease with associated genes from the dbGaP, OMIM Disease and OMIM Expanded databases using differentially expressed genes of WFs. After several steps of statistical analysis we selected only cancer related diseases. Interestingly, we found our selected four cancers among the list of cancers collected from the mentioned databases as shown in Fig. 2(c).
Moreover, we found our identified genes in Fig. 2

Conclusions
In this study, we considered gene expression microarray data from WFs exposure, CC, PC, LC, GC and control datasets to analyze and investigate the genetic links between WF exposure and the effects that they have on cancers. We analyzed gene expression, constructed gene-disease association networks, identified signaling and ontological pathways, analyzed protein-protein interaction networks and survival function of WFs exposed cells and cancers. The outcome of our study indicated that WFs can exert a strong influence on cancers. This kind of study will be useful for making more accurate disease prediction, and identifyi potentially better therapeutic approaches. This study will also be useful for assessing the dangerous effects of welding on the human body.