Comparative transcriptome analyses reveal genes associated with SARS-CoV-2 infection of human lung epithelial cells

During 2020, understanding the molecular mechanism of SARS-CoV-2 infection (the cause of COVID-19) became a scientific priority due to the devastating effects of the COVID-19. Many researchers have studied the effect of this viral infection on lung epithelial transcriptomes and deposited data in public repositories. Comprehensive analysis of such data could pave the way for development of efficient vaccines and effective drugs. In the current study, we obtained high-throughput gene expression data associated with human lung epithelial cells infected with respiratory viruses such as SARS-CoV-2, SARS, H1N1, avian influenza, rhinovirus and Dhori, then performed comparative transcriptome analysis to identify SARS-CoV-2 exclusive genes. The analysis yielded seven SARS-CoV-2 specific genes including CSF2 [GM-CSF] (colony-stimulating factor 2) and calcium-binding proteins (such as S100A8 and S100A9), which are known to be involved in respiratory diseases. The analyses showed that genes involved in inflammation are commonly altered by infection of SARS-CoV-2 and influenza viruses. Furthermore, results of protein–protein interaction analyses were consistent with a functional role of CSF2 and S100A9 in COVID-19 disease. In conclusion, our analysis revealed cellular genes associated with SARS-CoV-2 infection of the human lung epithelium; these are potential therapeutic targets.


Scientific Reports
| (2021) 11:16212 | https://doi.org/10.1038/s41598-021-95733-w www.nature.com/scientificreports/ on corona virus sequences have been deposited in public repositories such as NCBI Gene Expression Omnibus (GEO) and EMBL Array Express 11,12 . Meta-analysis and mining of such data can aid in a) understanding the molecular impact of COVID-19, b) elucidating differences and similarities between SARS-CoV-2 and other respiratory virus infections, and c) identifying targets for drug development. In the current study, we performed comparative analysis of publicly available gene expression data related to human lung epithelial cells infected with a respiratory virus. The analyses identified genes specifically expressed by SARS-CoV-2 infections and those that are commonly altered due to infection of coronovirus-2 and/or other respiratory viruses. In particular, expression of CSF2 (colony-stimulating factor 2) appears to be involved in COVID-19 disease. Several COVID-19 clinical trials are currently focusing on inhibition of CSF2 [GM-CSF] [13][14][15] .

Comparative analysis of DEGs resulted in identification of SARS-CoV-2 infection-specific genes and those commonly affected by most of the viral infections.
In order to identify genes that are exclusively affected on SARS-CoV-2 infection, we compared common DEGs (98 protein-coding genes of 100) from RNA-seq studies with GEO2R analysis results of GSE47962, GSE17400, GSE48575, GSE49840, and GSE71766 (Supplementary Table S1). GEO2R results from each microarray study include lists of differentially expressed probes, gene symbols, fold change, and adjusted p-values. If multiple probes related to the same gene were differentially expressed, we considered the probe with highest fold change value. Comparative analysis showed 7 genes exclusively altered on SARS-CoV-2 infection, including 5 up-regulated genes (CSF2, S100A8, MRGPRX3, S100A9 and MAB21L4) and two down-regulated genes (CXCL14 and PCDH7) (Fig. 2b). However, microarray platforms showed the absence of probes related to MAB21L4 in all microarray studies (Table 1). Inflammation-related genes (such as IL6, IL1A, IL1B, CXCL2, CXCL6, CCL20, TNIP1, VNN1, TNFAIP3 and NFKBIZ) were commonly affected due to infection of SARS-CoV-2 and other SARS or avian/human influenza viruses (Supplementary Figure S1c).

Validation of SARS-CoV-2 exclusive genes in human bronchial organoid RNA-seq data. Pro-
cessing and analysis of RNA-seq data related to SARS-CoV-2-infected human bronchial organoids (HBO) led to identification of 1532 differential expressed genes (861 up-regulated and 671 down-regulated genes) (Supplementary Figure S1a). With the exception of MAB21L4, all SARS-CoV-2 exclusive genes from comparative transcriptome analysis showed the same expression pattern in HBO cells (Supplementary Figure S1b).

Protein-protein interaction analysis of altered genes on SARS-CoV-2 infection revealed hub genes.
The common 98 protein-coding, differentially expressed genes on SARS-CoV-2 infection were first queried in the STRING database to identify known/predicted interactions among them. The database returned a PPI network of 333 interactions (edges) between 72 genes (nodes) (Fig. 3). The PPI network was downloaded as a simple interaction format (SIF) file, visualized with Cytoscape, and analyzed with Cytohubba plugin to identify hub genes. The top 50 genes were obtained based on three network parameters: degree, closeness, and betweenness separately. The 43 genes featured in all three lists were considered as hub genes (  . 4); CSF2 and S100A9 were exclusive to SARS-CoV-2 infection.

Discussion
Microarray meta-analysis and comparative transcriptome analysis have been useful bioinformatic approaches for maximum utilization of publicly available gene expression data 16,17 . Since the advent of high-throughput technologies such as microarrays and RNA-seq, researchers have performed in-depth transcriptome analyses of various biological conditions, leading to various discoveries 18   www.nature.com/scientificreports/ of such data can aid researchers in understanding the molecular mechanisms of a disease and in discovering biomarkers [19][20][21] . We observed the availability of various microarray studies related to human respiratory viral infection and sensed the opportunity to compare the effect of SARS-CoV-2 and other respiratory viral infections on the human lung transcriptome.
The comparative transcriptome analysis led to identification of genes that were altered exclusively after SARS-CoV-2 infection. Among these genes, S100 calcium-binding protein A9 (S100A9) and S100 calcium-binding protein A8 (S100A8) are calcium-and zinc-binding proteins that are elevated in inflammatory lung disorders 22 . S100A8 and S100A9 form a heterodimer complex called Calprotectin (CLP). Elevated levels of CLP is found in bronchoalveolar lavage fluid (BALF), serum and lung tissue of pneumonia patients 23 . Serum CLP could be potential biomarker for COVID-19 and further research is required to test this hypothesis 24 . Colony stimulating factor 2 (CSF2) is a cytokine-coding gene associated with respiratory diseases such as pulmonary alveolar proteinosis 25 . CSF2 (GM-CSF) is known to be pro-inflammatory cytokine produced by wide variety of cells such www.nature.com/scientificreports/ as macrophages, T-cells, fibroblast, tumor cells, endothelial cells, and with primary production at the inflammation site 26,27 . GM-CSF influences activation and proliferation of immune cells such as macrophages, monocytes, dendritic cells, neutrophils, eosinophils 28 . Although GM-CSF plays important role in maintaining immune homeostasis, its over-expression in lung could lead to fibrotic reactions and severe immune cell infiltrations 29 . Recent reports show that a) COVID-19 patients requiring intensive care unit (ICU) have increased level of GM-CSF in China and b) drugs targeting CSF2 (GM-CSF) or its receptor (such as lenzilumab, namilumab, gimsilumab, and otilimab) are being evaluated in clinical trials with COVID-19 patients [13][14][15]30 . MAS-related GPR family member X3 (MRGPRX3), a member of the mas-related/sensory neuron specific subfamily of G protein coupled receptors, is down-regulated in human airway epithelial cells exposed to smoke from electronic cigarettes 31 . In oral cancers, lung cancers, and head and neck cancers, C-X-C Motif chemokine ligand 14 (CXCL14) functions as a tumor suppressor; it also induces growth of prostate and breast cancers [32][33][34][35][36] . Protocadherin 7 (PCDH7) is involved in cell-cell recognition and adhesion 37 . Mab-21-like 4 (MAB21L4) has no known association with lung disorders or respiratory virus infections. The PPI analysis of genes differentially expressed by SARS-CoV-2 infection identified 43 hub genes. CSF2 and S100A9 were the only hub genes to show a SARS-CoV-2-exclusive gene expression pattern. Almost all other hub genes were affected by infection of other respiratory viruses. In conclusion, both PPI analysis and comparative transcriptome analysis point to a role of CSF2 in the molecular mechanism of SARS-CoV-2 infections of human lung epithelium. The current study also highlights the exclusivity of known lung inflammation disorder genes such as S100A8 and S100A9 with respect to SARS-CoV-2 infection.
Our current study utilizes exclusively the transcriptomic data. Similar comparative analysis as well as integrative analysis with proteomics and other "omic" data will be useful as more data becomes public. The current bioinformatics approach makes use of available data to identify potential molecular targets for treatment of COVID-19. Future studies will focus on experimental validation to establish the exclusive association of CSF2, S100A8 and S100A9 with SARS-CoV-2 infections in lung epithelial cells.  Raw sequencing data related to selected samples were downloaded from Sequence Read Archive (SRA) using fastq-dump of sratoolkit v2.9.6 (http:// ncbi. github. io/ sra-tools/). First, raw sequencing reads were trimmed to remove adapter sequences and low-quality regions using Trim Galore! (v0.4.1) (http:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ trim_ galore/). Trimmed reads were subjected to quality control analysis using FastQC (https:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc/). Tophat v2.1 was used to map trimmed raw reads to the human reference genome (hg38) 39 . All bam files from multiple runs related to the same samples were merged and sorted using SAMtools (Version: 1.3.1) 40 . Finally, raw read counts were enumerated for each gene in each sample using GTF (gene transfer file) from Ensembl [Homo.sapiens_GRCh38.82.gtf] and HTSeq-count 41 .
Gene ontology enrichment analyses of the common Differentially Expressed Genes (DEGs) were accomplished by use of the Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 online tool 43 . Gene ontology (GO) biological processes with P-values < 0.05 and gene counts > 2 were considered as significantly enriched.

OR (Human bronchial epithelial) AND "Homo sapiens"[porgn] AND ("gse"[Filter] AND "Expression profiling by array"[Filter]))) AND (viral infection AND ("gse"[Filter] AND "Expression profiling by array"[Filter])) AND ("gse"[Filter] AND "Expression profiling by array"[Filter]) AND ("Expression profiling by array"[Filter]
). This led to 38 search results, three of which (GSE49840, GSE71766, and GSE48575) were selected for analysis. Table 3 provides sample, platform, and cell line details for all five studies. Since SARS-CoV-2 RNA-seq data included transcriptome profiling after 24 h of infection, in all microarray studies, we considered only samples after 24 h of viral infection.
Protein-protein interaction analysis. STRING, a database of known or predicted protein-protein interactions (PPIs) was used to obtain interactions between genes altered on SARS-CoV-2 infection 49 . Output from the STRING database was uploaded to Cytoscape v3.7.2 in simple interaction format, and the Cytohubba app was employed to identify hub genes [50][51][52] . The top 50 genes were obtained separately from the PPI network based on three network parameters (closeness, degree, and betweenness) then common genes among these were selected as hub genes. We also checked the DrugBank database to determine if a drug is available to target them 53 .