The Overlap of Lung Tissue Transcriptome of Smoke Exposed Mice with Human Smoking and COPD

Genome-wide mRNA profiling in lung tissue from human and animal models can provide novel insights into the pathogenesis of chronic obstructive pulmonary disease (COPD). While 6 months of smoke exposure are widely used, shorter durations were also reported. The overlap of short term and long-term smoke exposure in mice is currently not well understood, and their representation of the human condition is uncertain. Lung tissue gene expression profiles of six murine smoking experiments (n = 48) were obtained from the Gene Expression Omnibus (GEO) and analyzed to identify the murine smoking signature. The “human smoking” gene signature containing 386 genes was previously published in the lung eQTL study (n = 1,111). A signature of mild COPD containing 7 genes was also identified in the same study. The lung tissue gene signature of “severe COPD” (n = 70) contained 4,071 genes and was previously published. We detected 3,723 differentially expressed genes in the 6 month-exposure mice datasets (FDR <0.1). Of those, 184 genes (representing 48% of human smoking) and 1,003 (representing 27% of human COPD) were shared with the human smoking-related genes and the COPD severity-related genes, respectively. There was 4-fold over-representation of human and murine smoking-related genes (P = 6.7 × 10−26) and a 1.4 fold in the severe COPD -related genes (P = 2.3 × 10−12). There was no significant enrichment of the mice and human smoking-related genes in mild COPD signature. These data suggest that murine smoke models are strongly representative of molecular processes of human smoking but less of COPD.

in vitro and in vivo studies are required to disentangle mechanism and establish causality. Of the in vivo models, mice are commonly used to determine the effects of cigarette smoking in the pathogenesis of COPD (reviewed in references [9][10][11][12][13][14][15] ). Generally, 6 months of smoke exposure is used to induce histological and functional abnormalities in murine lungs that mimic those of human disease including emphysema, airway remodeling and pulmonary hypertension, though the changes are relatively mild compared with those observed in long-term human smokers 16 . However, more recent methods can replicate these features as well as the impairment of lung function in 8 weeks with nose-only exposure 17,18 . Shorter exposure times are generally used to model inflammatory mechanisms 11,19 . How gene expression profiles compare between short term and long term smoke exposure is currently not well understood. Moreover, although mice are commonly employed to model COPD, the extent to which murine experiments mimic the human condition is uncertain.
The availability of genome-wide transcriptomic signatures in lung tissue enables comparisons between human and murine models following short-and long-term cigarette smoke exposures. The aims of this study were to compare and contrast the molecular changes in murine models following short and long term exposures with the molecular changes in human lungs induced by cigarette smoke. Most importantly, we sought to determine, if the human "COPD" lung tissue gene expression signature is captured in murine lungs exposed to cigarette smoke.

Results
Murine gene expression signatures following short-term smoke exposure. The six murine studies involving short term smoke exposure are summarized in Table 1.
Principal component analysis (PCA) performed on the 10,634 common genes led to the exclusion of one sample of air exposed mice from study GSE55127, which was a clear outlier. The resulting PCA plot shows that the 6 studies were homogenous in terms of expression changes and demonstrated clustering based on smoke-exposure status (Fig. 1).
Gene expression analysis of the pooled dataset from the 6 murine studies identified 3,723 genes that were differentially expressed at an FDR cutoff of 10%. Of the differentially expressed genes, 3,519 genes had 3,687 human gene orthologs. The use of a more stringent FDR cutoff of 5% or 1% reduced the number of differentially expressed genes to 3,051 and 2,021, respectively. The most significant differentially expressed genes were cxcl1 (C-X-C motif chemokine ligand 1), gpnmb (glycoprotein nmb) and cd84 (Table 2).
Murine gene expression signatures following long-term smoke exposure. The gene expression analysis in the pooled dataset from the two murine studies which involved 6 months smoke exposure identified 3,106 genes that were differentially expressed at an FDR cutoff of 10%. Of these, 2,989 genes had 3,116 human gene orthologs. Table 2 shows the top 20 differentially expressed genes in the short and the 6 month smoke exposure experiments. Comparison of human versus murine lung gene expression profiles related to cigarette smoke exposure or smoking status. We next sought to evaluate the extent of overlap between murine and human smoking signatures. A total of 184 genes, representing 48% of human smoking signature genes, were shared between the human and short-term mice smoking exposure. Of those, 148 and 14 genes were up-and down-regulated in both datasets, respectively and 22 had an opposite direction of effect between the two datasets. A circos plot comparing the human lung tissue smoking signature of 386 genes (current vs. never smokers) with the 3,687 genes related to short term exposure in mice is shown in Fig. 2. When compared to the long-term exposure murine models (6 months), 168 genes demonstrated overlap with the human signature and of these, 146 and 9 genes were up-and down-regulated in both datasets, respectively and 13 had an opposite direction of effect. Comparing the human smoking signature to both short and long-term smoking exposure in mice, 139 genes overlapped in all three studies, and of these 121 and 8 genes were up and down-regulated in all three studies and 10 genes had an opposite direction of effect in the human dataset. A list of the top 20 (based on the P values in the human data) overlapping genes is shown in Table 3. The list of overlapping genes included aryl-hydrocarbon receptor repressor (AHRR), CYP1B1 cytochrome P450 family 1 subfamily B member 1 (CYP1B1), C-X-C motif chemokine ligand 16 (CXCL16), NAD(P)H quinone dehydrogenase 1 (NQO1) and serpin family D member 1 (SERPIND1). The 139 overlapping genes were enriched in numerous gene ontology processes related to defense and immune response, glycosphingolipid and ceramide catabolic processes ( Table 4).
Comparison of murine and human smoking signature with COPD lung-tissue signature. To gain insights into the translational potential of the smoking gene signature, we tested for overlap with published human COPD signatures in lung tissue. The Faner et al. dataset included 70 former smokers with COPD from GOLD grades 1 to 4 4 . Using this dataset we identified 4,071 genes that were differentially expressed between patients in GOLD 3/4 vs. GOLD 1/2. A total of 1,003 "smoking" genes (27%) from the short-term murine smoking experiments overlapped with the "severe COPD signature" from the Faner et al. Comparison of the human smoking and COPD signature showed that 116 "smoking" genes (30%) from the lung eQTL dataset overlapped with the "severe COPD signature" of Faner et al. (Fig. 3). Of the 3,116 "smoking" genes derived from the 6 month exposure model in mice, 1,958 (53%) overlapped with the "smoking" genes derived from the short-term smoke exposure in mice and 168 genes (44%) overlapped with the human smoking signature from the eQTL study, and 914 genes (22%) overlapped with the "severe COPD" signature in the Faner et al. study.
Overall, 48 genes were common to both smoking signatures in mice (short and long term exposure) as well as the human smoking and severe COPD signatures (Supplementary Table 1). All of these genes except one showed the same direction of effect across studies i.e. up-regulated in smoking and in COPD or vice versa. These 48 genes were enriched in a number of gene ontology processes that are summarized in Table 4 including antigen processing and presentation, pyridine and nicotinamide nucleotide metabolic process, catabolic processes, transmembrane transport, oxidoreduction coenzyme metabolic process, and carbohydrate and glucose catabolic processes.
An additional relevant question to this work was whether or not smoking signature of mice and human will show enrichment in mild COPD as opposed to severe COPD signature. To answer this question, we analyzed the transcriptome of lung tissue eQTL study comparing mild COPD cases to controls. At and FDR <0.1 cutoff,  this analysis identified 7 genes differentially expressed between mild COPD cases and controls (Supplementary Table 2).
To quantify the extent of overlap among the different studies, we used a Fisher's exact test to determine whether there was significant enrichment of the human smoking or disease signatures in murine smoking signatures (Table 5). Differentially expressed genes from all the studies showed an over-representation in the mice data. The strongest enrichment was observed between the short and long-term mice smoking signatures (5.5 fold enrichment, p = 1.6 × 10 −309 ). The results also showed almost 4-fold enrichment of human smoking genes in the mice smoking signature (P = 6.7 × 10 −26 ). Of the lung tissue disease signatures, the severe COPD signature from the study of Faner et al. was over-represented in the short term murine smoking signature (1.2 fold enrichment, P = 4.2 × 10 −05 ) and was also over-represented in the 6 month smoking models (1.4 fold, p = 2.3 × 10 −12 ). Interestingly though, the mild COPD signature was not over-represented in the mice or human smoking signatures.
Integrative genomics of smoking related genes common to both human and murine lungs. To extend the gene expression findings to large scale human genetic studies of lung function we investigated whether any of the genes whose expression was related to smoking in both human and murine lungs were under genetic control in human lung tissue (i.e. were lung expression quantitative trait loci [eQTLs]). We found that 60 of the 139 genes have significant eQTLs (10% FDR) with a total of 7834 eQTLs.
Next, we restricted the analysis to the most significant eQTL per probeset (based on the eQTL p values) which led to a final SNP list of 73 (some SNPs were top eQTLs for more than one probeset). The 73 eQTLs were tested for associations with lung function in publically available large-scale genome-wide association studies (GWASs) datasets: SpiroMeta 20 and the UKBilEVE studies 21 . The results for SNPs with p < 0.05 are shown in Supplementary  Table 3 for eQTLs that had p value < 0.05 for association with lung function in SpiroMeta or UKBiLEVE. The only SNP that was associated with lung function at FDR < 0.05 was rs1081512, which was an eQTL for CTSS (cathepsin S) gene. It was also strongly associated with FEV 1 in the SpiroMeta GWAS (P = 6.07E-05, FDR = 0.004).

Figure 2.
Circos plot of smoking related genes overlapping between human and murine lungs. Genes are shown based on their chromosomal positions (in the human genome) in the outer most circle. The first circle from the inside represents genes from the short-term smoke exposed mouse while the second circle represents genes from the long (24 weeks) term smoke-exposed mouse and the outer most circle represent the human smoking-related genes. Each line represents a gene: inward lines labeled in orange represent down-regulated genes while outward lines in red represent up-regulated genes. Gene symbols are colored accordingly with down and up-regulated genes depicted as orange and red, respectively. The length of the line is proportional to the -log10 p values for differential expression in human and for the -log10 FDR values in murine data. Gene symbols in black are genes that showed opposite direction of effect between mice and humans.

Discussion
Pre-clinical animal models represent a valuable tool for understanding the pathogenesis of COPD and identifying novel therapeutics and biomarkers. However, to date, there has been a scarcity of data that have directly compared molecular profiles in the lungs of smoke-exposed mice that have been used to model COPD against those of human lungs in order to determine how (and if) 'disease' in these animals is representative of the human condition. A recent study by Yun et al. reported the overlap of mice and human smoking signatures and identified many overlapping genes, but very few that were shared with COPD signature 22 . Earlier, Morissette et al. also investigated the overlap of genes differentially affected by smoking in both mice and human lung tissues 8 . They found an enrichment of genes that were significantly modulated by cigarette smoke in humans and in mice, and that the majority of biological functions modulated by cigarette smoke in humans were also affected in mice 8 . Both studies, however, did not compare short vs. long term smoke exposures of mice and did not identify genetic variants relating to the expression of genes of interest.
By directly comparing and contrasting the gene expression profiles of smoke-exposed (both long and short-term) murine lungs against a large number of human lungs of current and ex smokers across the full spectrum of COPD severity (and also versus former smokers), we have made several important observations. They include: (1) the identification of overlapping 3,723 and 3,106 genes that were differentially expressed in short and long-term smoke exposure in murine lungs, respectively (5.5-fold enrichment of short term signature in the long-term signature, P = 1.6 × 10 −309 ), suggesting that acute transcriptomic changes in the lungs related to cigarette smoking are largely retained over longer term, when morphologic appearances of emphysema, airway remodeling and mild pulmonary hypertension become measurable in mice; (2) a significant overlap of genes in smoke-exposed murine lungs (48% from short-term exposure and 44% from long-term exposures) with those of human lungs explanted from current smokers. There was a 3.8 fold enrichment of the human "smoking" lung signature in the murine lungs (p = 1.6-6.7 × 10 −26 ); and (3) a 1.4 fold enrichment of severe COPD gene expression signature in the human "smoking" lung signature (p = 3.5 × 10 −3 ), with a 1.2 fold enrichment in short-term smoke-exposed murine (p = 4.2 × 10 −5 ) and a 1.4 fold enrichment in long-term smoke-exposed murine lungs (p = 2.3 × 10 −12 ). There were 48 genes that were common to the lungs of both smoked-exposed mice and current smokers and severe COPD, suggesting that the long term smoking exposure of mice results in transcriptomic changes that are also found in severe COPD patients even following smoking cessation. Of these the association of the gene encoding for cathepsin S was also supported in large scale human genetics studies of lung function. Finally, the murine and human smoking signatures were not over-represented in mild COPD signature, suggesting that overall mice models are better representative of smoking but less so of COPD in humans.
The smoking genes that overlapped between murine and human lung tissue included aryl hydrocarbon receptor repressor (AHRR), cytochrome P450 family 1 subfamily B member 1 (CYP1B1), C-X-C motif chemokine ligand 16 (CXCL16), NAD(P)H quinone dehydrogenase 1 (NQO1) and serpin family D member 1 (SERPIND1), all of which were up-regulated in the lung tissues of smokers. The AHRR gene has been well studied and a large number of publications have reported changes in its methylation and expression related to smoking 23,24 . AHRR encodes a ligand-activated transcription factor that inhibits the aryl hydrocarbon receptor pathway, which, in turn, increases the expression of xenobiotic-metabolizing enzymes that break down environmental pollutants, such as polycyclic aromatic hydrocarbons contained in cigarette smoke 25 . CYP1B1 is a phase I enzyme that is  Table 3. Top 20 smoking-related genes overlapping between human and murine lungs.
involved in the conversion of procarcinogens in cigarette smoke to carcinogenic intermediates 26 . The expression of CYP1B1 was found to be up-regulated in a number of tissues including the lungs following cigarette smoke exposure 27 . NQO1 is an enzyme involved in the detoxification of mutagenic and carcinogenic quinones, by preventing electron transfer and the generation of free radicals and reactive oxygen species 28 and converting the intermediates to the less toxic hydroquinones 29 . SERPIND1 encodes the heparin cofactor II (HCII), which is an endogenous thrombin inhibitor that protects against vascular remodeling and atherosclerosis via its inhibition of thrombin in the vascular wall 30 . It may also play a role in enhancing cell motility and promoting metastasis in non-small cell lung cancer 31 . Almost half (48%) of genes making up the human smoking signature overlapped with those differentially expressed in the murine smoked lung. The overlapping genes were enriched in processes related to host defense and immune responses including those that involve glycosphingolipid and ceramide catabolic pathways. These processes are well known to be affected by smoking [32][33][34] . The significant enrichment of human smoking signatures in the murine lung following short and long-term smoke exposure suggests that mice models of smoking do, in fact, reflect molecular changes that occur with smoking in humans. There are some caveats, however. For instance, we found that for ~6% of the human "smoking" lung genes the change in gene expression in the murine lungs was in the opposite direction. This may be due to different molecular responses to smoking between human and mice lungs. Alternatively, it may reflect the duration of cigarette smoke exposure between humans and mice. The duration of smoke exposure for mice ranged from 6-24 weeks compared to years of smoking in humans. However, this can be considered representative in a mouse that have an average life span of 1.5-2 years.
Using integrative genomics we showed that 43% of the overlapping smoking signature genes were under genetic control in lung tissue. The SNP that showed the strongest association with lung function in large scale genetic studies was an eQTL for the cathepsin S gene (CTSS). The CTSS gene encodes an elastin-degrading proteinase which is highly expressed by macrophages and dendritic cells 35 and plays an active role in adaptive immune responses 36 . The major inhibitor of cathepsin S is cystatin C which was recently identified as a COPD causal gene using an integrative genomics approach 37 .
Our current study has some limitations. First, the sample sizes for the studies included may have led to false negative results. Second, the unit of analysis in this study was gene expression, yet translation and post translational modifications of proteins in lung tissue may also be similar or different between mice and human and between smokers with and without COPD. Third, mice have different pathophysiology compared to humans. For example, studies have shown that in humans, the loss of small airways proceeds the development of emphysema  before COPD is detectable with spirometry 38 . Finally, the cellular heterogeneity of murine and human lung tissue samples may have limited our ability to detect overlapping signatures.
In conclusion, the current study uncovered a strong similarity between short and long term smoking effects on lung transcriptome in mice and a strong overlap with the human smoking signature. The study additionally uncovered genes common to smoking and COPD signatures in mice and humans which warrants further study.

Methods
Data sources. Human Lung tissue eQTL and smoking signature study. To compare murine lung smoke exposure induced gene expression against human smoking gene expression signatures, we used a large human dataset that has been previously described. The lung expression quantitative trait loci study (eQTLs) profiled 1,111 human lung tissue from current and ex-smokers and non-smokers [39][40][41][42] . Briefly, non-tumour lung specimens were collected from patients undergoing lung surgery at three different sites: Institut Universitaire de Cardiologie et de Pneumologie de Québec (IUCPQ), Laval University (Quebec, Canada), University of British Columbia (UBC, Vancouver, Canada) and University of Groningen (Groningen, the Netherlands. Gene expression profiling was performed using an Affymetrix custom array (GPL10379), which contained 51,627 non-control probesets and data were normalized using RMA 43 . Genotyping was performed using the Illumina Human1M-Duo BeadChip array. Genotype imputation was undertaken using the 1000 G reference panel. Following standard microarray and genotyping quality controls, data from 1,111 patients were available including 409 from Laval, 339 from UBC and 363 from Groningen. Association testing for each variant with mRNA expression in either cis (within 1 Mb of transcript start site) or in trans (all other combinations) was undertaken separately for each study sample, after which the results were meta-analyzed using inverse variance weighting. A genome-wide 10% false discovery rate (FDR) was applied to this analysis. The smoking gene signature in the eQTL study has been previously published 7 and consisted of 386 genes that were differentially expressed between current vs. never smokers (henceforth referred to as "human smoking signature").  The lung eQTL study was also used to identify mild COPD signature. We performed differential expression analysis between mild COPD (FEV1 ≥80% predicted and FEV1/FVC <0.7) and controls (FEV1 ≥80% predicted and FEV1/FVC >0.7). The analysis was adjusted for age, sex and smoking status and the sample sizes were 58 mild COPD patients (12 from Laval and 46 from UBC) and 107 control subjects (11 from Laval and 96 from UBC). Results were combined using meta analysis using inverse variance weighting fixed effect model.
All methods were carried out in accordance with relevant guidelines and regulations. Study participants informed consent was obtained from all subjects, and data access and analyses protocols were approved by the University of British Columbia Office of Research Ethics.
Mouse gene expression data. Lung gene expression profiles of six publically available datasets (n = 48 samples) were obtained from the Gene Expression Omnibus (GEO) (accession numbers GSE33561, GSE33512, GSE52509, GSE17737, GSE55127, GSE18344) 44 . GSE33512, GSE55127, GSE52509 and GSE33561 datasets were pre-processed as described in the corresponding source publications. GSE17737 and GSE18344 datasets contained samples profiled on Affymetrix Mouse Genome 430 2.0 arrays. These arrays were normalised with frozen Robust Multi-array Analysis (fRMA), a procedure that allows microarrays to be pre-processed individually or in small batches and allows data to be combined into a single dataset for further analyses 45 . Since different profiling platforms contain different numbers of genes, we included 10, 634 genes in the analysis that were common to all platforms. A more detailed description is provided by Dvorkin-Gheva et al. 44 To enable comparisons with smoking signatures from longer duration of smoking, we included samples from two additional GEO datasets (GSE52509 and GSE17737) that evaluated murine lung tissue expression changes following 24 weeks of smoking exposure. GSE52509 dataset was preprocessed as described in the corresponding publication, while the samples from GSE17737 were normalized with fRMA 44 . Samples from both datasets were combined and the technical variation was removed by using Distance-Weighted Discrimination (DWD) method 46 .
Differential gene expression analysis. We used the "limma" package 47 to compare gene expression profiles of smoke-exposed mice from each dataset with those of control mice pooled across all experiments. T-statistics were followed by Benjamini-Hochberg adjustment for multiple testing 48 .
Lung tissue transcriptome signature of COPD severity. We used data from Faner et al. to determine which genes were differentially expressed across COPD disease severity 4 . Briefly, lung tissue samples were obtained from 70 former smokers with COPD who required thoracic surgery because of cancer or lung transplant. RNA samples were loaded onto an Affymetrix GeneChip Human Genome U219 Array Plate (Santa Clara, CA). The microarray data have been deposited in GEO (GSE69818) 4 . We identified 4,071 genes whose expression in lung tissue was different in patients with moderate or severe COPD (i.e. Global Initiative for Chronic Obstructive Lung Disease (GOLD) grades 3, 4) and those with mild COPD (defined by GOLD grades 1, 2). These differentially expressed genes will henceforth be referred to as "severe COPD signature".
Overlap of murine and human genes. In order to compare results across murine and human studies, we restricted the analyses to murine genes that had a human ortholog using the BioMart-Ensembl database (release 88, March 2017 http://www.ensembl.org/info/about/publications.html). We retained only those human genes on chromosome 1 to 22, or on chromosome X or Y, based on the position information from the BioMart-Ensembl database.
Enrichment of gene signatures. A hypergeometric (Fisher's exact) test was used to test for significant over or under-representation of common genes from two different studies.