Gender specific airway gene expression in COPD sub-phenotypes supports a role of mitochondria and of different types of leukocytes

Chronic obstructive pulmonary disease (COPD) is a destructive inflammatory disease and the genes expressed within the lung are crucial to its pathophysiology. We have determined the RNAseq transcriptome of bronchial brush cells from 312 stringently defined ex-smoker patients. Compared to healthy controls there were for males 40 differentially expressed genes (DEGs) and 73 DEGs for females with only 26 genes shared. The gene ontology (GO) term “response to bacterium” was shared, with several different DEGs contributing in males and females. Strongly upregulated genes TCN1 and CYP1B1 were unique to males and females, respectively. For male emphysema (E)-dominant and airway disease (A)-dominant COPD (defined by computed tomography) the term “response to stress” was found for both sub-phenotypes, but this included distinct up-regulated genes for the E-sub-phenotype (neutrophil-related CSF3R, CXCL1, MNDA) and for the A-sub-phenotype (macrophage-related KLF4, F3, CD36). In E-dominant disease, a cluster of mitochondria-encoded (MT) genes forms a signature, able to identify patients with emphysema features in a confirmation cohort. The MT-CO2 gene is upregulated transcriptionally in bronchial epithelial cells with the copy number essentially unchanged. Both MT-CO2 and the neutrophil chemoattractant CXCL1 are induced by reactive oxygen in bronchial epithelial cells. Of the female DEGs unique for E- and A-dominant COPD, 88% were detected in females only. In E-dominant disease we found a pronounced expression of mast cell-associated DEGs TPSB2, TPSAB1 and CPA3. The differential genes discovered in this study point towards involvement of different types of leukocytes in the E- and A-dominant COPD sub-phenotypes in males and females.


Contents Tables and Figures
Cases were classified as emphysema dominant, airway disease dominant, mixed and mild based on quantitative CT analysis, and each group was subdivided in those without or with the most extreme features of the CT measures (see Subramanian et al for details). The noCT validation samples were not used for discovery of DEGs but only for validation controls > 0.75 are controls who have an FEV1/FVC ratio of greater than 0.75 such that they lung function is clearly separated from the COPD specific cut-off level of 0.70. Only these controls were used.          All FASTQ files produced from sequencing of RNAseq samples were aligned to a human reference genome using CASAVA software version 1.7 to assess sequence quality. Bronchial brushing and bronchoalveolar lavage samples coming from the same patient were compared based on SNPs called from their RNAseq data in order to ensure proper assignment. Next we selected the samples based on the number of reads (a minimum of 10M paired-end reads mapped to genes) and the average quality in order to avoid biases due to sample pooling.
The selected RNAseq samples were then aligned with the GEMTools RNAseq pipeline v1.7 (http://gemtools.github.io), which is based on the GEM mapper 1 . The pipeline aligns all reads in a sample in three phases, mapping against the reference genome, against a reference transcriptome, and against a de-novo-transcriptome, which was generated from the input data to detect new junction sites. The transcriptome was generated from version 15 of the Gencode annotation. After mapping, all alignments were filtered to increases the number of uniquely mapping reads. The filter criteria contained a minimum intron length of 20, a minimum exon overlap of 5 and a filter step against the reference annotation checking for consistent pairs and junctions where both sides align to the same annotated gene. Quantifications and read counts were calculated using the Flux Capacitor 2 . This tool quantifies RNAseq reads on isoform level and the resulting counts were summed up to create gene-level read counts that were used for the differential expression analysis.

Supplement: LASSO method
For LASSO regression the glmnet R package with family='binomial' was used. The input expression dataset contained the 40 DE genes between control and cases for males and the 73 DE genes between control and cases for females. Cross-confirmation was performed with controls and cases with CT scan.
For prediction (male dataset), 36 controls and 36 cases without CT scan were used. We run 100 iterations each time randomly selecting 36 controls for prediction and taking out these 36 controls from the crossconfirmation. The predictions were done with lambda 1 sd, For the female dataset, the same methodology was applied but with 17 no CT cases and 17 randomly selected controls for prediction. The negative and positive predictive values were calculated for each iteration and then averaged for all iterations. Genes with coefficients >0 and appearing >= 90 times out of 100 iterations were selected in the male dataset for plotting. For the female dataset, genes with coefficient >0 and appearing >= 65 were selected for plotting. Supplement: RT-PCR Method RNA isolated using the AllPrep DNA/RNA Mini Kit (#80204, Qiagen, Hilden, Germany and 100 ng RNA was reverse-transcribed using MuLV Reverse Transcriptase (#N808-0018, Thermo Fisher Scientific), RNase Inhibitor (#N808-0019, Thermo Fisher Scientific), and oligo(dT) (N808-0128, Thermo Fisher Scientific) as primer according to the manufacturer's instructions.
Using the LightCycler 2 system (Roche Diagnostics, Mannheim, Germany), semiquantitative PCR was performed with primers given in Table S7. In brief, 2µl cDNA was used with the Fast Start SYBR Green I Master Mix (#04707516001, Roche) with the following settings: denaturation for 10 min at 95°C; 40 cycles of annealing for 10 sec at 60°C, elongation for 25 sec at 72°C, melting for 5 sec at 95°C; and a final melting curve from 65 to 95°C with an increase of 0.2°C/sec. Performing genomic DNA for PCR, 9 ng DNA was used with the following settings: denaturation for 10 min at 95°C; 50 cycles of annealing for 5 sec at 58°C, elongation for 5 sec at 72°C, melting for 10 sec at 95°C; and a final melting curve from 65 to 95°C with an increase of 0.2°C/sec. (1994)) were treated with H2O2 (# H1009, Sigma-Aldrich) for 4 hs, cells were harvested by using trypsin-EDTA buffer (Sigma #T4049) and RNA was isolated from Tri-reagent (Sigma #T9424) lysates using chloroform extraction according to manufacturer's instruction and RNA reverse transcribed and amplified as above. RNA was isolated and RT-PCR for CXCL1 versus alpha-enolase was performed. Results are normalized to alpha enolase and expressed as fold increase over untreated cells, n=3, *p<0.05, Mann-Whitney U Test. Supplement: Analysis of differential genes versus controls in male and female COPD.
For males the gene expression levels for the 40 genes in the individual samples are illustrated in Figure  S1A and this indicates that many cases show a strong increase of the upregulated genes and simultaneously a strong decrease of the down-regulated genes. For confirmation we analysed a separate group of n=36 male COPD patients, who were also recruited under EvA but without CT data and who were not included in the discovery group of n=173 (see Table S1). The cases from the confirmation group (blue top bar in figure 1A) also showed the same up-and downregulation pattern of the respective genes.  Table S3 cases and controls were grouped as red: male cases n= 173, green: male controls n= 145, blue: male confirmation group n= 36. The confirmation group consisted of patients, who have not had a chest CT and had not been included in the discovery group of samples.
When using the LASSO approach (least absolute shrinkage and selection operator) to define uncorrelated genes with the strongest predictive value, we found CA12, CCDC81 and SCGB1A1 as top predictors and these were able to separate cases and controls (Fig S1C). Here the confirmation cases (blue) were congruent with the discovery group of cases.
For females the levels of gene expression for the 73 genes in the individual samples is illustrated in Figure  S1B and this indicates that many cases show a strong increase of the upregulated genes and simultaneously a clear decrease of the down-regulated genes when compared to controls. For confirmation we analysed a separate group of n=17 female COPD patients, who were also recruited under EvA but without CT data and who were not included in the discovery group of n=86 (see Table S1). The cases from the confirmation group (blue top bar in Figure S1B) also showed the same pattern of up-and downregulation of the respective genes. We did not obtain a significant p-value (Spearman corr=0.01, p-value =0.27 for the rank preservation test between DEGs for male and female COPD. Hence, the pattern of DEGs is clearly different between males and female also by this approach. Of note, CYP1B1 is one of the top DEGs in female COPD. While this cytochrome is involved in metabolism of steroids we found no association of glucocorticoid use (51.5% of females on inhaled glucocorticoids) and CYP1B1 expression (p=0.39, Wilcoxon rank test).  Table S3 cases and controls were grouped as red: female cases n= 86, green: female controls n= 73, blue: female confirmation group n= 17. The confirmation group consisted of patients, who have not had a chest CT and had not been included in discovery group of samples. Figure S1A, B was produced using the R package pheatmap, pheatmap: Pretty heatmaps [Software] R Kolde, URL https://CRAN. R-project. org/package= pheatmap. When using the LASSO approach to define genes with the strongest uncorrelated predictive value for the female DEGS, we found Serpin B7, VGLL3 and CYP2A13 as top 3 predictors and these were able to clearly separate cases and controls (Fig S1D). Here the confirmation cases (blue) showed a pattern similar to the discovery group of cases. Supplement: RNAseq transcript levels for CEACAM5 and CA12 for males and females and RT-PCR confirmation of gene expression data in COPD We then took a closer look at the genes with increased expression in COPD and analysed the top genes shared between males and females, i.e. CA12 and CEACAM5. For CA12 the RNAseq expression values in females were slightly lower but the increase of the median in COPD compared to controls was similar in both males and females at around 3-fold (see Figure S2A). A similar pattern was seen for the expression of CEACAM5 (p<0.0001, for all case control comparisons). In the separate confirmation group of patients, we confirmed the significantly higher levels compared to controls for CA12 and CEACAM5 in both males and females (blue boxes in Figure S2A). For experimental confirmation of the RNAseq expression pattern, we used RT-PCR on the RNA of the very same male samples that had been subjected to RNA sequencing. As shown in the supplementary material in Figure S2B there was a positive correlation between the two types of assessment both for CA12 and for CEACAM5. B (1-4) Samples from cases and controls that had been used for RNA sequencing analysis were used for reverse transcription and PCR amplification using primers given in Table S2. Expression levels for CEACAM5 (A,B) and for CA12 (C,D) were normalized to the levels of alpha-enolase. Spearmann's R test, , p <0.001, rs =0.733 for CEACAM5, p <0.001, rs =0.915 for CA12. 3 Among the males 20 and among the females 22 genes were also found by Steiling et al ( 3 ) and there was an overlap of altogether 27 genes (see Figure S3). Of note there are substantial differences in the patient cohorts in that the EvA patients contained no current smokers and current smoking may induce additional inflammatory genes. When analyzing our data set with males and females combined, i.e. comparing 259 cases to 265 controls then we identify 41 differential genes. Of these, 22 genes were also found by Steiling et al 3 (see Table S3). Hence, also in the combined male female analysis we detect 19 additional differential genes. Conversely, Steiling et al found 76 DEGs not detected in our combined analysis (see Table S3) Supplement: Strategy for identification of DEGs unique for emphysema-dominant and airway disease dominant COPD in male brush samples We compared gene expression in controls (n= 145) to the sub-phenotypes using permutation statistics. With a cut-off of at least a 1.5 fold change of the median expression level we detected 45 DEGs with a significantly different median for controls versus E and 92 DE genes for control versus A (see Table S4A). When we subtract the genes in common for controls versus E and controls versus A at the median (Table  S4B) then we have at the median 4 DEGs unique for controls versus E and 21 DEGs unique for controls versus A (Table S4C ctl vs E and A male female unique DE genes, column M) and here the delta casecontrol is at least 1.5-fold higher in E compared to A. The same approach was applied to the analysis at the lower and upper quartile in order to detect additional genes, which are differentially down or up-regulated in a subset of patients. When combining all of these genes then we found a total of 15 unique DEGs for E and 43 unique DEGs for A (Table S4C ctl vs E and A male female unique DE genes, column AK). Also, we compared gene expression in controls to the Eex and Aex sub-phenotypes. While this reduces the power by 50% it has the potential to identify additional genes because of the more pronounced and more specific pathology in these cases. Here we found at the median 18 unique DE genes for Eex and 17 unique DE genes for Aex (see Table S4C, column AZ). Combining DEGs identified at lower quartile, upper quartile and median for the extreme samples we find a total of 43 DEGs for Eex and of 54 DEGs for Aex (column BX, Table S4C) and this includes 35 additional genes found for Eex and not for E and 47 additional genes found for Aex and not for to A. These data illustrate that the analysis of the extreme phenotypes generates additional informative genes that are characteristic of sub-phenotypes of COPD.

Supplement comparison to Steiling et al
When combining all of the data then we find at total of 51 unique DEGs for controls versus E and Eex (Column CB, Table S4C) and of these 39 DEGs were upregulated (Column CE, Table S4C). For controls versus A and Aex there was a total of 90 unique DEGs with 50 upregulated in cases.
Also in females, 18 DEGs were detected only with the Eex approach and 10 only with the Aex approach demonstrating that the analysis of the extreme phenotypes can lead to detection of additional genes also in females.
This study has focused on the impact of COPD and its sub-phenotypes on the airway transcriptome in males and females. Differences in airway gene expression in healthy males versus females have not been analyzed at this point but are likely to exist 4 .
Additional variables like age and packyears may impact on the DEGs in COPD and it sub-phenotypes, but our analysis did not detect for high and low age and packyears a difference in expression of top DEGs DEGs CA12, MT-CO2, CSF3R, CD36, CPA3 and BST2 (Wilcoxon rank test, p>0.05). Still, a more in depth analysis based on GLM might be able to detect a contribution of these parameters. Supplement Impact of statistical power: With the comparison of controls versus E-dominant and controls versus A-dominant cases we have been able to detect 51 DE genes unique for the emphysema-dominant phenotype and 90 DE genes for the airway-disease dominant phenotype in males. For the genes found unique for E we have asked whether there are many among them for which the p-value between controls versus E is just below the cut-off while for control versus A it is just above the cut-off such that an impact of statistical power would be likely.
However, we found only 2 of the 51 E-dominant genes for which the FDR-value for A was less than 10-fold lower in E (CTXN1, IGFBP5) genes marked "y" in Table S4D). Hence, for 49 of the 51 DEGs the significant level of E was more than 10-fold lower compared to the non-significant p value in A. This argues that the difference in power has little impact on the genes identified. Also, the control versus A analysis has revealed also twice as many DE genes compared to controls versus E, in spite of the lower power with only 32 A samples compared to 50 E samples for males.
Likewise, when looking at female samples only 3 of 72 genes unique for E had a less than 10-fold lower pvalue in E compared to A, such that in females power appears to have little impact on the DEG discovery as well (Tables S4D). On the other hand, the failure to detect significant differences by direct comparison of the E and A subphenotypes using permutation analysis may be explained by the low statistical power based on the low number of samples with the two CT phenotypes. When analysing data from the perspective of a larger control population we found many genes only differential in comparison to A or only differential in comparison to E. These genes show a differential gene expression pattern between E and A, but here they are not significantly different in direct comparison after correction for multiple testing. Supplement: DEGs shared by emphysema-dominant and airway disease dominant sub-phenotypes In males we found n=86 DEGs to be differential for both controls vs E and controls vs A sub-phenotypes (see Table S4B) and of these 29 (34 %, tagged yellow in Table S4B) had also been detected in the case versus control analysis (Table S3). For the 86 DEGs shared by E-dominant and A-dominant cases the GO term "regulation of response to stress" is apparent with altogether 16 DE genes and there is "regulation of response to stimulus" with 25 genes. The latter term includes leukocyte associated genes like FCGR2A, IL8 and CXCR4 (see Table S5).
In females there were 37 genes found for both E and A sub-phenotypes (Table S4B) and 24 (65% tagged yellow in Table S4B) had also been detected in the case versus control analysis (Table S3). The dominant biological process found among the 37 DEGs in common for the female E-and A-dominant COPD cases was "cell migration" (see Table S5).

Supplement: haemoglobin
We have taken a closer look the E-associated top gene haemoglobin-beta (HBB), which showed a 5.2-fold higher median expression level in cases compared to controls. For confirmation of the RNAseq derived gene expression data we performed RT-PCR on the very same RNA samples, which initially had been used for RNA sequencing. Here we found a significant correlation of expression levels between RT-PCR and RNAseq in the airway brush samples ( Figure S4).

Figure S 4 Confirmation of HBB RNAseq expression levels by RT-PCR.
Samples from controls and case with either E-dominant or A-dominant COPD that had been used for RNA sequencing analysis were used for reverse transcription and PCR amplification using primers given in Table S2. Expression levels for HBB were normalized to the levels of alpha-enolase. (Spearmann's R test, p < 0.001,rs =0.822) While this upregulation of HBB in bronchial brush samples of E-dominant COPD is surprising, HBB in cells other than the erythrocyte lineage has been reported earlier and this includes macrophages, lens cells, alveolar type II cells, club cells and bronchial epithelial cells 5 , 6 , 7 , 8 . Also we have detected HBB transcripts in the A549 human lung cell line by RT-PCR (data not shown). In our set of cases there was no correlation between bronchial HBB and pO2 level in blood and the role of this gene in the pathophysiology of Edominant COPD remains to be elucidated.
The finding of HBB expression in lung epithelial cells and cell lines and in bronchial brush samples, supports the notion that HBB is a genuine product of bronchial cells. On the other hand, it is unlikely that the HBB transcripts are derived from contaminating erythrocytes, because only immature red blood cells express low level mRNA and because the red cell marker gene glycophorin A was not among the DEGs in male emphysema in our study.
Supplement: Interaction analysis of DEGs upregulated in male brush samples for COPD sub-phenotypes Figure S5 Interaction analysis for genes upregulated in E and A sub-phenotypes of male COPD. DEGs unique for A-dominant (S5A) and E-dominant (S5B) COPD were analyzed for physical and functional protein interactions were constructed using the string 10.5 database.

Supplement: RT-PCR confirmation of gene expression data for MT-CO2
For confirmation of the RNAseq-derived mitochondrial gene expression data, we performed RT-PCR on the very same RNA samples, which initially had been used for RNA sequencing. For this, we focussed on MT-CO2 and as shown in Figure S6, the pattern for RT-PCR was similar to the results obtained with RNAseq and there was a strong correlation between the two types of gene expression analyses. Figure S6 Confirmation of RNAseq expression levels by RT-PCR. Samples from controls and case with either Eex or Aex COPD that had been used for RNA sequencing analysis were used for reverse transcription and PCR amplification using primers given in Table S2. Expression levels were normalized to the levels of alpha-enolase. (Spearman's test, p < 0.005, rs =0.619) Supplement: Un-supervised hierarchical clustering of MT-genes, macrophage associated genes and neutrophil associated genes in COPD sub-phenotypes Genes either selected based on string analysis (MT-genes) or using GO profiler analysis (neutrophil and macrophage associated "stress" genes) where analysed in hierarchical clustering of male samples. Eex and Aex samples gave a clear pattern ( Figure S7), while samples from the noCT confirmation group was not informative (data not shown) Unsupervised hierarchical clustering of Eex and Aex samples revealed a group of samples with low MT genes, which goes along with increased macrophage genes, while high neutrophil genes are associated with high MT genes (see Figure S7). Figure S7 Un-supervised hierarchical clustering of MT-genes, macrophage associated genes and neutrophil associated genes in Eex and Aex COPD sub-phenotypes. A group of samples (mainly Eex) showed high MT-gene levels and high neutrophil associated DEGs CSF3R, S100A8 and MNDA, while a group of mainly Aex samples showed low MT-gene expression levels and high macrophage associated gene expression. The figure was produced using the R package pheatmap, pheatmap: Pretty heatmaps [Software] R Kolde, URL https://CRAN. R-project. org/package= pheatmap.
The association of neutrophil genes and MT genes in these samples (see also Figure 5) suggests that combining these genes might improve the predictive power for molecular identification of emphysemadominant disease. However, the product of the normalized values for MT-CO2 and CSF3R did not correlate with TLCO/Va in the no CT confirmation group (data not shown). When evaluating the top 5 samples with the highest product of MT-CO2 and CSF3R expression levels then this gave no significant difference (p= 0,178, Mann-Whitney-U-Test) for TLCO/Va, when compared to the bottom 10 samples for this value, quite in contrast to the findings for the set of 8 MT genes alone (see Figure 3C). A larger cohort of patients will be required to address the question of predictive use of neutrophil genes.
Supplement String interaction analysis of DEGs in female sub-phenotypes Figure S8 Interaction analysis for genes upregulated in E and A sub-phenotypes of female COPD. DEGs unique for E-dominant (S8A) and A-dominant (S8B) COPD were analyzed for physical and functional protein interactions were constructed using the string 10.5 database. Figure S9 Venn diagram of male versus female E and A associated genes Supplement: DEGs in mixed and normal CT sub-phenotypes of male COPD Patients with both high wall area and low lung density in chest CT were defined as mixed cases (see 9 ). When comparing gene expression in such cases with a mixed phenotype then we found 14 DEGs (Table  S4A) and none of these was unique to either E or A, (Table S4C). Analysis of mixex cases revealed no DE genes.
Furthermore, cases with low wall area and high lung density were defined as normal-CT cases. These formed the largest group of 81 male cases in our cohort. For these we found 119 differential genes for normal-CT plus normal-CT-ex analysis (Table S4A) and this included 34 genes that had been found to be unique to E-or to A-COPD sub-phenotypes (Table S4C). Looking at the most extreme normal-CT cases, i.e. those clearly separated from cases with abnormal CT phenotype than we picked up 24 of these 34 DEGs unique to either E or A. This indicates that patients with mild disease and normal CT measures of COPD can already express genes, which are characteristic of more advanced dichotomized COPD. Supplement: DEGs in mixed and normal-CT sub-phenotypes of female COPD For females we had only 7 mixed cases but we still could detect 38 DE genes (Table S4A) and of these 1 was unique to A and 9 were unique to the E sub-phenotype (Table S4C). Analysis of mixex cases revealed no DE genes. When looking at normal-CT cases then we found a total of 195 DEGs compared to controls (Table S4A) and of these 51 were also detected among the E-and A-sub-phenotype (Table S4C). Of these 51 genes only 7 were shared with males. Finally, the normal-CT-ex female cases, i.e. those with a clearly normal CT phenotype, also expressed 29 DEGs found to be unique to either E or A (TableS4C) and of these only 2 were shared with males. Taken together similar to males, females with mild disease and normal CT measures of COPD can already express genes, which are characteristic of more advanced dichotomized COPD.
It is, however, a matter of debate as to whether xCell is adequate for this type of analysis, since it covers 64 cell types and excludes genes expressed by carcinoma cells. Among the 64 cell types are -in addition to various immune cells -23 other cell types including adipocytes, hepatocytes, skeletal muscle cells, chondrocytes, melanocytes and astrocytes. To this set of cells, the following strategy was applied: "For each data source independently we identified genes that are overexpressed in one cell type compared to all other cell types." (Aran et al. Genome Biology (2017) 18:220). With this approach, typical macrophage genes like CD36 are associated with melanocytes and not with macrophages. Also, typical genes induced in inflammation like IL-2 and TNF are not covered, such that information on cytokines characteristic for T cells and macrophages are not considered. We speculate that an amended version of xCELL, dedicated to leukocytes only, might result in significant results for all of the differential cell types found in our study.
In any event, additional studies, including single cell sequencing, are required to confirm the presence of neutrophils in E-dominant and not in A-dominant disease and the presence of macrophages in A-dominant and not in E-dominant disease in male COPD patients.