Network-based integrated analysis for toxic effects of high-concentration formaldehyde inhalation exposure through the toxicogenomic approach

Formaldehyde is a colorless, pungent, highly reactive, and toxic environmental pollutant used in various industries and products. Inhaled formaldehyde is a human and animal carcinogen that causes genotoxicity, such as reactive oxygen species formation and DNA damage. This study aimed to identify the toxic effects of inhaled formaldehyde through an integrated toxicogenomic approach utilizing database information. Microarray datasets (GSE7002 and GSE23179) were collected from the Gene Expression Omnibus database, and differentially expressed genes were identified. The network analyses led to the construction of the respiratory system-related biological network associated with formaldehyde exposure, and six upregulated hub genes (AREG, CXCL2, HMOX1, PLAUR, PTGS2, and TIMP1) were identified. The expression levels of these genes were verified via qRT-PCR in 3D reconstructed human airway tissues exposed to aerosolized formaldehyde. Furthermore, NRARP was newly found as a potential gene associated with the respiratory and carcinogenic effects of formaldehyde by comparison with human in vivo and in vitro formaldehyde-exposure data. This study improves the understanding of the toxic mechanism of formaldehyde and suggests a more applicable analytic pipeline for predicting the toxic effects of inhaled toxicants.

www.nature.com/scientificreports/ High-throughput technologies, such as microarray and RNA-sequencing, allow the analysis of large amounts of gene expression data. Gene Expression Omnibus (GEO), established by the National Center for Biotechnology Information (NCBI), is the largest public database that archives and distributes several types of high-throughput genomic data 9 . A comprehensive analysis of existing experimental data using bioinformatic tools can improve the understanding of biological phenomena. Various network strategies can be applied to information gathered from numerous independent studies for performing bioinformatic analysis to elucidate complex biological interactions 10,11 .
To synthetically explore the toxic mechanisms of formaldehyde inhalation, two microarray datasets (GSE7002 and GSE23179) derived from formaldehyde inhalation studies in rats under identical in vivo experimental conditions were collected from the GEO database. By integrating and expanding the interpretation of the original datasets 12,13 , we investigated the toxic effects of formaldehyde exposure on the respiratory system by the toxicogenomic approach. After data preprocessing, differentially expressed genes (DEGs) were obtained, and hub genes associated with the respiratory effect of formaldehyde exposure were identified through the networkbased bioinformatics approaches. To resemble airway in vivo exposure, we employed a three-dimensional (3D) reconstructed human airway mucosa model using a custom-made aerosol exposure system. The cytotoxicity of aerosolized formaldehyde was estimated, and the expression levels of selected hub genes were validated via air-liquid interface (ALI) culture. Our integrated analysis could improve understanding of the toxic effects of formaldehyde exposure on the respiratory system. An overview of the methodology applied in this study is shown in Fig. 1.

Materials and methods
Data collection and preprocessing. Formaldehyde inhalation gene expression profile datasets were obtained from the GEO database (https:// www. ncbi. nlm. nih. gov/ geo/). Two gene expression profiles (GSE7002 and GSE23179) derived from the GPL1355 platform [Rat230_2] Affymetrix Rat Genome 230 2.0 Array were downloaded. Among all datasets, samples with the high-exposure concentration (15 ppm) and different exposure periods (6 h, 5 days, 4 weeks, and 13 weeks) were selected to explore the pattern of time-dependent changes induced by formaldehyde. The selected dataset consisted of 37 control samples and 16 formaldehyde-exposed samples. The raw data were normalized using the robust multi-array average method in the affy package (version 1.62.0) in R software 14 . To remove non-biological batch effects generated in the data merging process, the ComBat function in the sva package (version 3.38.0) was used 15 . Clustering analysis and DEGs screening. After preprocessing, Principal component analysis (PCA) was conducted to examine the pattern of gene expression profiles derived from different datasets. DEGs between control samples and formaldehyde-exposed samples were filtered using the limma R package (version 3.40.6) 16 . The threshold criteria were |fold change|> 1.5 and false discovery rate adjusted p-value < 0.05. The expression patterns of the common DEGs in all exposure groups were visualized through the hierarchical clustering heatmap.  Network analysis and hub gene selection. Pathway Studio version 12.3, a commercial text miningbased biological network analysis software, was used to identify hub genes associated with the toxic effects of formaldehyde inhalation exposure 18 . Pathway Studio enables researchers to explore biological interactions extracted from a vast number of studies and visualize these interactions through the entity and connectivity of each relation. In Pathway Studio, "entity" signifies a node, such as a gene, disease, and cell process, and "relation" signifies a biological interaction between two entities. Each relation is expressed as a connectivity (edge) on the network and supported by the number of reference studies 19 . "Direct Interactions," "Expand Pathway," and "Shortest Path" algorithms were applied to construct the functional network of the common DEGs. Topological parameters between gene-gene interactions were calculated using the NetworkAnalyzer in Cytoscape software (version 3.7.2) 20 .
3D reconstructed human airway mucosa model and aerosolized formaldehyde exposure. A 3D reconstructed human airway mucosa model, SoluAirway, and its media were purchased from Biosolution Co., Ltd. (Seoul, Korea). SoluAirway was placed in a 6-well plate filled with culture media (0.9 mL/well) and pre-incubated at 37 ℃ and 5% CO 2 overnight. Formaldehyde (16%, methanol-free, ultrapure) was purchased from Polysciences, Inc. (Warrington, PA, USA) and diluted in distilled water. According to a previous study's protocol 21 , aerosol exposure to SoluAirway was conducted for 2 min using a medical nebulizer NE-U150 (OMRON Healthcare, Japan) in the experimentally designed chamber to treat 100 μL of formaldehyde and distilled water (control). Then, the tissues were incubated for 24 h. After the aerosol exposure, the apical surfaces of the tissues were washed four times with PBS (400 μL). Tissue viability was measured by the 3-(4,5-dimethylthiazol-2-yl)-2,5 diphenyltetrazolium bromide (MTT) assay. The tissues were transferred to 24-well plates filled with MTT solution (1.0 mg/mL, 200 μL), and the tissues were filled with MTT solution (100 μL). After incubation at 37 °C and 5% CO 2 for 3 h, MTT solution was eliminated, and the tissues were submerged in isopropanol (2.0 mL) for 3 h, protected from light. Extracted formazan was transferred to a 96-well plate (200 μL/well), and absorbance was measured at 570-nm wavelength.

Quantitative real-time reverse transcription PCR (qRT-PCR). The membranes of SoluAirway were
cut with a surgical blade, placed on a 6-cm plate, soaked in PBS, and the tissues were harvested using scrapers. Total RNA was extracted using an RNeasy kit (Qiagen, Germany), and the RNA quality was evaluated using the Agilent 2100 Bioanalyzer (Agilent Technologies, USA). cDNA was synthesized from extracted RNA (500 μg) using the ImProm-II Reverse Transcription System (Promega, USA), according to the manufacturer's instructions. qRT-PCR was conducted using TB Green Premix Ex Taq (Takara Bio, Japan) in Rotor-Gene Q (Qiagen, Germany) under the following thermal cycling conditions: initial denaturation at 95 °C for 5 min, 40 cycles of 3 steps, including 95 °C for 15 s, 60 °C for 15 s, and 72 °C for 30 s. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) was used to normalize the gene expression levels according to the 2 −ΔΔCT method 22 . The sequences of primers for qRT-PCR are described in Supplementary Table S1.
Statistical analysis. Statistical analyses were performed using R software (version 3.6.3). All experimental data are presented as the mean ± standard error of the mean (SEM). Significant differences between the formaldehyde exposure group and control were estimated by Student's t-test at p-value < 0.05.

Results
Clustering patterns of transcriptomic profiles and DEGs identification. The experimental conditions of the two GEO datasets (GSE7002 and GSE23179) were designed almost identically: 8-week-old male rats, whole-body inhalation exposure, 6 h/day of formaldehyde exposure, nasal epithelial cells. To acquire more robust gene expression profiles, we merged 37 control and 16 formaldehyde-exposed samples in the two GEO datasets and corrected batch effects arising from different sampling dates. PCA result indicated that formaldehyde-exposed samples were distinct from control samples, regardless of the GEO datasets (Fig. 2a). DEGs were identified for each formaldehyde exposure group compared to the control group (Fig. 2b). There were 815 genes significantly expressed at 6 h exposure (upregulated: 354, downregulated: 461), 2,308 genes at 5 days (upregulated: 1,342, downregulated: 966), 2,095 genes at 4 weeks (upregulated: 1,118, downregulated: 977), and 3,621 genes at 13 weeks (upregulated: 1,712, downregulated: 1,909). Among all exposure groups, 258 genes were identified as common DEGs (Fig. 2c). These common DEGs showed a mostly consistent up/downregulated expression pattern at each exposure period, except for 7 genes (Fig. 2d).
Enriched canonical pathways associated with formaldehyde exposure. To interpret the meaning of the observed gene expression changes, IPA was used to identify canonical pathways for the common DEGs filtered by human genes. The top 10 canonical pathways were selected based on the − log(p-value), and the activation z-score pattern of the pathways was considered (Fig. 3, Table 1). Among the canonical pathways, p53 Signaling (z-score = 1, p = 4.07 × 10 -4 ), ErbB Signaling (z-score = 1.13, p = 5.37 × 10 -4 ), and NRF2-mediated Oxidative Stress Response (z-score = 0.378, p = 4.47 × 10 -3 ) exhibited positive activity patterns, and Ferropto- Respiratory-related network associated with formaldehyde exposure. To identify the hub genes related to formaldehyde exposure, we used Pathway Studio to perform network analysis on the common DEGs that were most sensitive to formaldehyde exposure. We focused our analysis on the respiratory effects of inhaled formaldehyde exposure, so we filtered the genes with expression information in the major respiratory organs (from nose to lung) among the common DEGs (Fig. 4a). From these genes, we selected the top 24 genes with edge degree number ≥ 10 based on the betweenness centrality of gene-gene interactions derived from the Pathway Studio database information (Fig. 4b, Table 2). In addition, we extracted major cell processes and respiratory diseases associated with formaldehyde exposure based on the minimum number of references (≥ 3) (Fig. 4c). A biological network for respiratory effects of formaldehyde exposure was constructed with comprehensive consideration of the results of gene expression values, the centrality of gene-gene interactions, and chemical-genedisease associations (Fig. 4d). These genes were highly associated with diseases, including lung cancer, asthma, and pneumonia, and with oxidative stress, inflammatory response, and immune response, as cell processes. Finally, six hub genes (AREG, CXCL2, HMOX1, PLAUR , PTGS2, and TIMP1) and major entities were selected www.nature.com/scientificreports/ through careful examination of reference studies (Fig. 5a). Moreover, functional changes related to formaldehyde exposure were predicted; activation of cell signaling (NF-κB, ERK1/2), cytokine release, histone crosslinks, and degradation of antioxidant functions (superoxide dismutase, glutathione peroxidase).

Validation of hub genes in 3D air-liquid interface (ALI) system. To simulate in vivo inhalation
exposure, we employed the 3D reconstructed human airway model SoluAirway and exposed it to aerosolized formaldehyde using a commercially available medical nebulizer. The concentration of formaldehyde exposure was set by tenfold serial dilutions (14.7 mg/mL (≈ 500 mM) to 50 μM), based on a previous study 21 . The MTT assay result showed that the tissue viability was > 90% at concentrations up to 5 mM and dropped significantly at concentrations above 50 mM (Supplementary Fig. S1). Among the sub-cytotoxic concentrations (> 90% viability), 0.5 mM (500 μM) was determined for qRT-PCR, considering that the 15-ppm data of the GEO dataset used is almost equivalent to 500 μM. The expression levels of the six hub genes were upregulated in the DEG results of all exposure periods and were validated using qRT-PCR. The mRNA expression of these genes examined in the 3D ALI system tended to be upregulated except HMOX1 (Fig. 5b).

Discussion
Formaldehyde is widely used in various fields and poses a high toxicity concern. In the early 1980s, the carcinogenicity of 5.6 and 14.3 ppm formaldehyde was confirmed in rats and mice for 2 years of exposure 23 . These concentrations are very high and unlikely to be encountered in daily life 24 . Therefore, our results using 15-ppm exposure data reflect high-exposure scenarios, such as chemical accidents and chronic occupational exposure to formaldehyde. Inhaled formaldehyde mainly affects the upper respiratory system, but a certain amount can be deposited directly into the lower respiratory system during oronasal breathing 2 . In animal, epidemiologic, and pulmonary function test studies, inhaled formaldehyde caused respiratory symptoms, with asthma identified as one of the most common diseases to arise due to exposure in indoor and occupational environments [25][26][27][28][29] . However, detailed knowledge of the mechanisms of respiratory disease onset associated with genetic changes in response to formaldehyde exposure is still lacking. This study explored integrated transcriptomic changes resulting from formaldehyde inhalation exposure. Utilizing public genomic data, we acquired significant genetic profiles of formaldehyde exposure. We identified hub genes related to respiratory effects through network analyses and qRT-PCR validation via the 3D ALI aerosol exposure system (Figs. 4, 5). AREG, a ligand of epidermal growth factor receptor (EGFR, ErbB1), was the most strongly expressed, with a fold change > 20 in all exposure groups, and was enriched in ErbB Signaling. AREG activates cell signaling pathways, including Ras/Raf/MEK/ERK and PI3K/AKT, and cellular events, such www.nature.com/scientificreports/ as cell proliferation and apoptosis 30 . Elevated AREG expression is associated with airway inflammatory diseases, such as chronic obstructive pulmonary disease (COPD), asthma 31,32 , and several types of cancer 33 . HMOX1 is the major downstream antioxidant protein of the NRF2-mediated Oxidative Stress Response pathway and contributes to inhibiting Ferroptosis Signaling, an iron-dependent cell death accompanied by the accumulation of lipid reactive oxygen species 34,35 . HMOX1 is generally known to defend against oxidative and inflammatory damages in diverse diseases, and there are results of overexpressed HMOX1 in some respiratory patients [36][37][38][39] . However, the qRT-PCR result did not match the expression tendency of the microarray DEG results. This may be due to inherent differences in the tissues used in the bioassays as well as the exposure environment 40 . PTGS2 is not expressed in most tissues in the normal state, but its expression is induced by diverse stimuli, including cytokines, such as epidermal growth factor, interleukin-1, and tumor necrosis factor, and during an inflammatory response 41 . PTGS2 expression and cytokines were increased in the respiratory epithelium and alveolar macrophages of asthmatics 42,43 . Persistent increased PTGS2 expression is thought to be induced by the inflammatory response after formaldehyde exposure 44,45 . PLAUR has been identified as an asthma, pneumonia, and COPD susceptibility gene 46,47 . PLAUR expression was elevated in the bronchial biopsy of asthmatics and is required for epithelial wound repair 48 . TIMP1 mainly functions as an important endogenous inhibitor in regulating matrix metalloproteinases. An imbalance of the MMP9/TIMP ratio is associated with the pathogenesis of asthma and lung diseases 49 . Sputum of asthmatics showed increased mRNA expression of TIMP1 50 . CXCL2 regulates normal and asthmatic airway smooth muscle cell migration 51 . During pulmonary inflammation, the mRNA expression of Cxcl2 was increased in rats 52 . Taken together, selected hub genes could be important markers for respiratory diseases caused by inhaled toxicants. Furthermore, people with defects in these genes will be more vulnerable to formaldehyde exposure even if they are not exposed to high concentrations 53 . Additionally, we compared our common DEGs with in vivo (human) and in vitro formaldehyde exposure data (GSE27263: nasal biopsy of volunteers exposed to formaldehyde up to 0.7 ppm for 4 h/day over 5 days; GSE21477: nasal epithelial cells exposed to 200 μM formaldehyde for 4 h). Interestingly, NRARP was identified as the only gene that showed increased expression in all formaldehyde-exposed data used, and 9 and 39 common genes were identified in human in vivo and in vitro data, respectively. Apart from the network analysis results, the up-regulation of the NRARP gene was also validated ( Supplementary Fig. S2). Although the functions of NRARP and its association with formaldehyde have not been fully understood, it interacts with Notch and WNT signaling in angiogenesis 54 and presents dual pro-or antitumor activity 55 . The Notch1 gene positively regulates Nrarp 56 and is upregulated in the long-term exposure (5 days to 13 weeks) DEG results. NRARP stimulates cell proliferation by negatively modulating p21/Rb-dependent cell cycle arrest 57 . Although the significance between the NRARP gene and respiratory diseases was not analyzed in the network analyses of our study, respiratory effects and cell cycle regulation-related carcinogenic effects of NRARP would be valuable in future formaldehyde studies.
Traditionally, inhalation toxicity studies are based on a strict animal inhalation test that is costly, timeconsuming, technically challenging, and unethical. Many in vitro studies have relied on two-dimensional (2D) monocultures in submerged conditions that cannot reflect the real in vivo cellular environments and exposure conditions. Efforts to reduce the gap between animal testing and 2D cell culture models and reflect in vivo physiological conditions have led to advanced in vitro studies, such as 3D cell culture and microfluidics/microengineering technologies 58 . Additionally, studies incorporating a sophisticated aerosol exposure system (e.g., VITROCELL) and 3D reconstructed airway tissue to reflect actual exposure are considered ideal in studying inhalation toxicity 59 . For this reason, we employed a 3D reconstructed human airway model and exposed it to aerosolized formaldehyde to investigate the toxic effects of inhaled formaldehyde. Although consideration of interspecies differences and further studies on the reliability of 3D in vitro data will be needed, our integrated approach suggests an applicable analytic pipeline for predicting the toxic effects of inhaled toxicants.
This study performed a toxicogenomic in silico analysis based on in vivo data and validated the result in the 3D ALI system. However, our approach should be viewed as a prioritization method for further toxicity testing due to the limitations of data mining. For instance, the quality of data mining analysis is dependent on the databases of the software used 60 . Moreover, there could be false-positive results and missing interactions due to potential bias toward well-studied interactions 40,61 . Finally, further studies are needed to identify all possible www.nature.com/scientificreports/ molecular changes by formaldehyde exposure, considering other types of omics data and other factors, such as detailed dose-response relationships, duration, and sensitivity of the exposed individuals [62][63][64] .

Conclusion
In summary, through a series of bioinformatic analyses of DEGs derived from public in vivo studies, hub genes related to respiratory diseases associated with formaldehyde exposure were identified. Several studies have evaluated lung damages caused by formaldehyde inhalation in rats 65,66 . Formaldehyde toxicity is considered to affect the entire respiratory system, not just the upper respiratory system. Considering these points, the significance www.nature.com/scientificreports/ of our toxicogenomic approach is that it can identify the respiratory effects of inhaled toxicants, serve as an early predictive alarm before serious lung diseases occur, and provide the possibility for identified genes to be used as biomarkers for clinical diagnosis and therapy. In addition, our results will contribute to improving the understanding of the toxic mechanism of formaldehyde, such as the construction of an adverse outcome pathway: suggestion of putative key events or adverse outcomes at diverse molecular levels, based on canonical pathways, cell processes, and diseases; inference of causal interaction; predictive assessment of carcinogenicity 67,68 .

Data availability
The datasets used during the current study are available in the NCBI Gene Expression Omnibus (accession number: GSE7002, GSE23179, GSE27263, and GSE21477). The other data generated or analyzed during the current study are available from the corresponding author upon reasonable request.