Introduction

Chronic obstructive pulmonary disease (COPD) is characterised by persistent respiratory symptoms (shortness of breath, cough, sputum production) and airflow limitation originating from a mixture of small airways disease and parenchymal destruction (emphysema)1. The disease is caused by the exposure to noxious particles and gases with cigarette smoking as the main risk factor, especially in high-income countries1,2. In 2015, 3.2 million people died of COPD, making it the 3rd leading cause of death worldwide after ischaemic heart disease and stroke3,4. Despite the global burden of COPD, which is predicted to increase further in the following years5,6, knowledge regarding its pathogenesis and underlying molecular mechanisms remains limited. A better understanding of these mechanisms could potentially lead to new targets for prevention, treatment and prognosis of COPD.

One possibility to investigate differences in molecular mechanisms between COPD patients and a control population without COPD is to perform gene expression analyses of clinically collected lung samples. Such studies have previously been performed both on samples from the whole lung tissue (WLT)7,8,9,10,11,12 and the airway epithelium (AE)13,14,15 of COPD and control subjects, revealing different sets of differentially expressed genes (DEGs) associated with the disease. One difficulty of expression analysis is the large resulting number of DEGs, which normally ranges between 100 and 300, making an in-depth analysis of every single gene too challenging. Common ways to select key important COPD-associated DEGs are to focus on the strongest regulated genes or to select them based on an association with biological functions which potentially are disease-relevant. A recent study by Mostafaei et al. pre-selected 90 genes associated with the progression of COPD. To further reduce the set of genes, different machine learning-based and statistical methods were tested and reduced the set to a manageable 44 genes associated with COPD or lung function14.

Another common drawback of gene expression studies in COPD is a limited access to a large number of lung samples. Some of the reported COPD studies had access to only a low number of patients (i.e. lung samples)7,8,9, originating from the highly invasive nature of collecting such samples. Gene expression analyses based on low numbers of patient samples are at risk in underestimating the heterogeneity of such samples, partially leading to the poor replicability of the resulting sets of genes between studies16. One manner to increase the sample size is by combining several independent gene expression datasets as has been done by previous studies12,15. Careful scrutiny of the different datasets should then be taken to account for variations in age, health status, smoking status, the presence of comorbidities, etc. Nevertheless, this approach can remove variability which stems from different sampling methods, geographical and ethnical differences, among others.

Here, we combined several public gene expression datasets to create a large transcriptome dataset containing WLT and AE samples from 405 and 411 subjects, respectively. Each dataset was tested for DEGs, which were only included for further analysis if they showed a consistent occurrence and the same trend of regulation in several datasets of the same lung sample type. These two criteria increase the likelihood that the remaining genes are associated with COPD in all the tested populations. We further showed that the resulting sets of 436 and 663 DEGs can be used to distinguish COPD from control subjects using principal component analysis (PCA). To decrease the number of COPD-associated DEGs, we developed a novel approach combining PCA with a random depth-first search (RDFS). This approach randomly removes single genes while testing, using the first principle component, for an improvement in distinction between COPD and control subjects. In contrast to other black-box, machine-learning based methods, our gene set reduction method is based on a computational search-tree approach that locally maximises the distinction between COPD and control subjects using the first principal component, making the results easily interpretable. The method identified two sets of 10 and 15 genes (for the WLT and AE, respectively), which persistently appeared in all the resulting sets of discriminatory DEGs. Most of these 25 genes have not been previously associated with COPD and therefore reveal new potential players in the underlying molecular mechanisms of the disease.

Results

Included datasets and subject characteristics

We included five different GEO gene expression datasets in our analysis. These microarray datasets contained in total 405 and 411 subjects sampled from the WLT and the AE (bronchial and small AE (SAE)), respectively. We matched all COPD and control subjects of a dataset for the type of microarray used and their current smoking status (former smoker (FS) or current smoker (CS)). The latter means that statistical comparison was only conducted between COPD and control subjects with the same smoking status as this should ensure that DEGs emerge based on the disease status and not the differences in smoking status. This matching resulted in a list of seven comparison groups (WLT1-3 and AE1-4) (Table 1). Statistical testing showed that all comparison groups were matched for age and gender (pā€‰>ā€‰0.05) except for AE3 (age: pā€‰<ā€‰10ā€“4) and AE4 (age: pā€‰<ā€‰10ā€“5). The comparison groups AE1-3 were also matched for pack-years (pā€‰>ā€‰0.05), while WLT1 and AE4 showed a significant difference (pā€‰<ā€‰0.01) in smoking history between COPD subjects and controls. For WLT2 and WLT3, no information regarding smoking history was available. The forced expiratory volume in 1Ā s (FEV1) % predicted was for all comparison groups lower in COPD subjects compared to controls.

Table 1 Overview of the included GEO gene expression data sets and their subject characteristics.

Selecting COPD-associated DEGs based on consistent occurrence and same sign of log2 fold change

Each comparison group was processed and tested for significant DEGs separately. In total, 17,249 genes were measured and tested in each group. The statistical testing revealed different numbers of significant (pā€‰<ā€‰0.05) DEGs for each comparison group varying between 1841 (AE3) and 6236 (WLT1) genes (Fig.Ā 1a). To increase our confidence of a DEG being associated with COPD in all the populations we tested, we selected DEGs based on the combination of the following two criteria:

  1. 1.

    Significant (pā€‰<ā€‰0.05) DEG in three or more comparison groups of the same lung sample type (white outlined in Fig.Ā 1b).

  2. 2.

    Same log2 fold change sign over three or more comparison groups of the same lung sample type.

Figure 1
figure 1

Selection of COPD-associated DEGs for the WLT and the AE. (a) Bar diagram showing the number of significant (pā€‰<ā€‰0.05) DEGs found in each comparison group. (b) Venn diagrams showing the overlap of DEGs between different comparison groups. We only compared groups of the same lung sample type. White-outlined sections mark DEGs that fulfil the first selection criteria (see Results) for COPD-associated DEGs. (c) Heatmap showing fold changes of the 436 and 663 COPD-associated DEGs from the WLT and the AE, respectively. The DEGs were sorted by the mean fold change over all comparison groups.

Selection criterion one assures that the selected DEGs reproducibly emerge in several independently tested populations. Criterion two assumes that DEGs are affected by COPD in a similar manner across all the populations. This excludes genes that are up-regulated in one comparison group, but down-regulated in another and vice versa. 436 DEGs and 663 DEGs satisfied these two criteria in the WLT and the AE, respectively. In both sample types, about two third of the DEGs were upregulated (WLT: 279/436 DEGs, AE: 446/663 DEGs) in COPD subjects compared to controls, while one third was downregulated (WLT: 157/436 DEGs, AE: 217/663 DEGs) (Fig.Ā 1c). 34 of these DEGs satisfied the two criteria in both lung sample types.

Distinguishing COPD from control subjects by applying PCA to expression values of COPD-associated DEGs

Based on our two selection criteria, we assumed that the two sets of DEGs are strongly associated with COPD and should distinguish COPD subjects from controls. To test this assumption for each lung sample type, we first combined the COPD and control subjects of the respective comparison groups into a single dataset (see Methods). When running a PCA on the rescaled expression values of the 436 DEGs from the WLT, we found that principal component 1 (PC1) described 29% of the variance in the data (Supplementary Figure S1). Importantly, there was a significant (pā€‰<ā€‰10ā€“17) difference between the rescaled PC1 scores of COPD (mean: 0.52) and control subjects (mean: 0.69) (Fig.Ā 2a, left). This distinction does not appear in the results of the PCA applied to all the 17,249 tested genes. Similarly, running the PCA on the 663 selected DEGs of the AE, PC1 described 25% of the variance in the data (Supplementary Figure S1) with a significant (pā€‰<ā€‰10ā€“23) difference between the PC1 scores of the COPD subjects (mean: 0.51) and the controls (mean: 0.33) (Fig.Ā 2a, right). We did not observe that the distinction between COPD and control subjects in PC1 emerged from the different comparison groups (e.g. WLT1 vs. WLT2 or AE2 vs. AE3), indicating that the rescaling of the data and the two selection criteria were successful in reducing larger batch effects in PC1 (Supplementary Figure S2). Nonetheless, some variance in gene expression remained from the differences in comparison groups (see PC2 in Supplementary Figure S2, middle column). The differences in PC1 scores led to a good performance in distinguishing COPD from control subjects for both sets of COPD-associated DEGs compared to all tested genes (Fig.Ā 2b). The area under the ROC curve (AUC) for the two sets of 436 and 664 DEGs was 0.79, while it remained close to a random guess for all tested genes (WLT: AUCā€‰=ā€‰0.55, AE: AUCā€‰=ā€‰0.56). Thus for the COPD-associated DEGs, PC1 is an indication for the distinction between COPD and control subjects.

Figure 2
figure 2

Distinction between COPD and control subjects using PC1. (a) Beeswarm plots comparing rescaled PC1 scores of control and COPD subjects for both lung sample types (WLT and AE). The PC1 scores are computed from either the gene expression values of all tested genes (17,249) or the corresponding COPD-associated DEGs. Only the COPD-associated DEGs lead to a significant difference (pā€‰<ā€‰0.05) between COPD and control subjects. Red lines with black error bars show the meanā€‰Ā±ā€‰SEM. (b) ROC curves comparing the performance of the computed PC1 scores (see (a)) in distinguishing COPD from control subjects.

Reducing the number of COPD-associated DEGs while improving distinction between COPD and control subjects using a RDFS

To test if a reduced number of COPD-associated DEGs can improve the distinction between control and COPD subjects, we used a combination of PCA and a RDFS approach. By randomly removing single genes, the RDFS searches for subsets of DEGs that improve the distinction between COPD and control subjects using PCA (Fig.Ā 3a). The searches were performed on 90% of the total number of subjects (training), while the validation of the performance was conducted on the remaining 10% of subjects (test).

Figure 3
figure 3

Search for small sets of discriminatory DEGs. (a) Visualisation of one iteration of our RDFS approach. The search starts with the full number of COPD-associated DEGs (=ā€‰N) on the left and continues to the right (black arrows) by randomly removing single genes. After a removal, the PC1 scores for the two subject groups are computed from the expression values of the remaining subset of DEGs (e.g. Nā€‰āˆ’ā€‰1) and the p value using a t-test is calculated. If the newly calculated p value is smaller than the previous one (e.g. pNāˆ’1ā€‰<ā€‰pN), the gene is removed entirely, and the search continues on that branch of the search tree by randomly removing another gene. If the p value is equal to or bigger than the previous one (e.g. pNāˆ’1ā€‰ā‰„ā€‰pN), the gene is returned to the set of DEGs and another random gene is removed and tested. The search ends if no removal of a gene leads to a decrease in p value and the remaining subset of DEGs (Nā€‰āˆ’ā€‰Lā€‰+ā€‰1, with Lā€‰=ā€‰depth of search tree) is the smallest set of discriminatory DEGs for this iteration. (b) Smoothed histograms showing the mean frequency of rescaled PC1 scores for COPD and control subjects over all ten search runs. The top ones consider only training subjects, while the lower ones consider only test subjects. The dashed line represents the mean threshold with the highest F-scores in distinguishing COPD from control subjects. (c) ROC curves showing the performance of the different sets of discriminatory DEGs in distinguishing COPD from control subjects. The dashed line represents the performance of a random guess.

Our RDFS approach resulted for both lung sample types in a strong reduction of the number of COPD-associated DEGs. For the WLT, the number over all ten search runs decreased from 436 to 35 (Ā±ā€‰5.62) DEGs on average (Supplementary Table S1). For the AE, the number of DEGs was reduced from 663 to 66 (Ā±ā€‰4.20) (Supplementary Table S2). These sets of discriminatory DEGs all resulted in a clear distinction between COPD and control subjects using PC1 for both the training and test subjects (Fig.Ā 3b and Supplementary Figures S3, S4, S5). Using again ROC curves to assess the performance, the sets of discriminatory DEGs for both lung sample types achieved a mean AUC above 0.9 for the training subjects (Fig.Ā 3c and Supplementary Tables S1, S2). In the case of the WLT, a similar performance was achieved on the test subjects, while for the AE, the AUC were on average slightly lower (0.87ā€‰Ā±ā€‰0.07) (Fig.Ā 3c and Supplementary Table S2). Comparing these performance measures to the initial sets of COPD-associated DEGs, the small sets of discriminatory DEGs for both lung sample types indeed improved the distinction between COPD and control subjects.

Persistent set of 10 and 15 DEGs remains during reduction process

During our search for smaller sets of COPD-associated DEGs, 10 and 15 DEGs persistently remained in the sets of discriminatory DEGs after each search run and were never removed during the reduction process for the WLT and the AE, respectively (see Tables 2, 3 for an overview). Their persistent appearance in the final sets of discriminatory DEGs made them a core set of DEGs that seemed indispensable when distinguishing COPD from control subjects. Notably, the number of persistent DEGs is exponentially reduced with the number of search runs and reaches a number close to 10 or 15 after 7ā€“8 runs (Supplementary Figure S6). We showed that they distinguish COPD from control subjects using PC1 (Fig.Ā 4a, selection subjects) with a resulting AUC of 0.85 and 0.79 for the WLT and the AE, respectively (Fig.Ā 4b). Similar to the 436 and 663 COPD-associated DEGs, no batch effects were observable in PC1 as there is no clear distinction between different comparison groups (Supplementary Figure S2). Moreover, the apparent batch effects we observed in PC2 for the COPD-associated DEGs were further reduced for the persistent genes. Interestingly, only a few of the 25 persistent DEGs have been associated with COPD before (see Discussion and Tables 2, 3).

Table 2 10 persistent DEGs of the WLT. Genes are sorted by the mRNA regulation in COPD and gene symbol (alphabetic).
Table 3 15 persistent DEGs of the AE.
Figure 4
figure 4

Study of 10 and 15 persistent DEGs. (a) Beeswarm plots comparing rescaled PC1 scores computed for COPD and control subjects using only the expression values of the two persistent sets of COPD-associated DEGs. For subjects used to select the persistent DEGs, the scores of the COPD subjects are significantly (pā€‰<ā€‰10ā€“20) different compared to the control subjects. The PC1 scores computed for the validation subjects show their distribution in comparison to the subjects used for selection. Red lines with black error bars show the meanā€‰Ā±ā€‰SEM (computed only for selection subjects). (b) ROC curves comparing the performance of the persistent DEGs in distinguishing COPD from control subjects using PC1 scores (see (a)). (c) Scatter plots showing the relation between the FEV1% predicted and the gene expression of DPYS in the WLT and URB1 in the AE of COPD and control subjects. The linear regression models the decrease in expression of DPYS and URB1 based on the increase in lung function, adjusted for comparison group, age and gender for DPYS or adjusted for smoking status (FS or CS), age, gender and pack-years for URB1.

To validate if the distinction between COPD and control subjects using the 10 and 15 persistent DEGs is reproducible, we tested their predictive capability in three independent gene expression datasets. The first dataset contained WLT samples from a Spanish population comparing 23 CS with COPD to 6 without COPD (GSE10317419), while the second one contained SAE samples from 8 FS with COPD and 14 without taken from a Canadian population (GSE5634120). The third dataset contained 135 SAE samples taken from an US American population (GSE2025721), but as many of them were already included in our comparison group AE3, we only considered 8 CS with COPD and 21 without COPD for validation. All 10 and 15 persistent genes except FRAS1 related extracellular matrix 3 (FREM3, from WLT) were measured in the corresponding validation datasets. To be able to compute the PC1 scores, gene expression values needed to be available for all persistent genes. We therefore imputed values for FREM3 based on the mean expression values of the three WLT gene expression datasets (see Methods). When computing the PC1 scores of the validation subjects, we observed a similar separation between COPD and controls (Fig.Ā 4a, validation subjects) with improved AUC values for both lung sample types (WLT: AUCā€‰=ā€‰0.91, AE: AUCā€‰=ā€‰0.85) (Fig.Ā 4b). Of the 25 persistent genes, 20 showed the same trend in regulation as found in our comparison groups of the corresponding lung sample type (Supplementary Figure S7). When adjusted for multiple testing, only one of the observed differences in gene expression was significant (qā€‰<ā€‰0.05) which presumably resulted from the low numbers of subjects.

To investigate possible associations between the persistent DEGs and lung function (i.e. FEV1% predicted) or lung damage (i.e. % emphysema), we applied multiple linear regression analysis to the total study population including both COPD subjects and controls. We analysed the association between persistent gene expression values and FEV1% predicted for both lung sample types, while values for the % emphysema were only available for two WLT comparison groups (WLT2 & 3). As FEV1% predicted and % emphysema exhibited moderate multicollinearity (variance inflation factor (VIF)ā€‰>ā€‰2.0), we ran separate regression models for these factors. Overall, most of the 10 and 15 persistent DEGs showed a significant association (pā€‰<ā€‰0.05) between their expression and FEV1% predicted when adjusted for age, gender, comparison group (only WLT), smoking status (only AE) and smoking history (i.e. pack-years; only AE) (Supplementary Table S3). From the AE, peroxisomal biogenesis factor 5 like (PLCG2) did not display any association with lung function, while WD repeat domain 4 (WDR4) and eva-1 homolog C (EVA1C) showed a significant association (pā€‰<ā€‰0.05) with lung function only in men and MOB kinase activator 3C (MOB3C) only in CS (Supplementary Table S3). The expression of dihydropyrimidinase (DPYS) from the WLT and URB1 ribosome biogenesis homolog (URB1) from the AE showed the strongest association (lowest p value) with lung function when adjusted for confounders (Fig.Ā 4c). Their expression both increased with decreasing lung function. The regression analysis on the 10 persistent DEGs from the WLT and % emphysema revealed similar results with 8 genes showing a significant association (pā€‰<ā€‰0.01) when adjusted for age, gender and comparison group. Pyroglutamylated RFamide peptide receptor (QRFPR) did not show any association, while autophagy related 3 (ATG3) was only associated with lung damage in men from WLT2.

Discussion

In this study, we aimed to identify a small set of robustly COPD-associated genes to extend the knowledge on molecular mechanisms underlying this chronic respiratory disease. To base our analysis on a large sample size, we combined several gene expression datasets containing in total 405 WLT and 411 AE samples from COPD and control subjects. Our focus on transcriptional data from microarray studies restricted the analysis to genes which were measured by the different arrays. The testing for significant DEGs (pā€‰<ā€‰0.05) revealed different numbers for each dataset. To limit their effect on the differences in expression, factors such as gender, age, current smoking status and smoking history were accounted for (see Results). Only DEGs with a consistent expression pattern over all datasets of one lung sample type were selected for further analysis. We showed that these selection criteria resulted in two lists of 436 and 663 COPD-associated DEGs. When examined by PCA, the first principle component accounted for about 30% of the variability in the data and significantly (pā€‰<ā€‰10ā€“10) distinguished COPD from control subjects. There remained an overlap between the two subject groups leading to an incomplete separation, likely resulting from uncertainties in the data. Uncertainties affect the expression of genes and arise for instance from differences in the time since smoking cessation or different lifestyles. Such information was unavailable to us and therefore not accountable for in the analysis. Based on our findings, we developed a novel computational approach combining PCA with RDFS to select subsets of COPD-associated DEGs which improve the distinction between COPD and control subjects. In contrast to the study of Mostafaei et al. which uses partly black-box models to select important subsets of genes14, our method selects genes based on their performance in distinguishing COPD from control using the first principal component. Our approach reduced the number of 436 and 663 COPD-associated DEGs to about 35 and 66 genes that showed an improved performance in distinguishing COPD from control subjects compared to the initial sets. This confirms that our approach managed to detect smaller subsets of COPD-associated DEGs which improve the distinction. Importantly, the average AUC values between training and test subjects were comparable, showing that the reduced sets performed well also on independent test data. While comparing the different reduced sets of DEGs from each lung sample type, we identified two sets of 10 and 15 DEGs which persistently appeared in all of them. These persistent DEGs not only performed well in distinguishing COPD from control subjects in the original datasets (used to identify the genes), they also showed an improved performance in three independent gene expression datasets including about 20ā€“30 subjects from a Spanish, a Canadian and an US American population, respectively19,20,21. This result suggests that our approach was able to select a core set of DEGs with high predictive capability for COPD in different populations without overfitting to the data used for the identification of the genes. When compared to the study of Mostafaei et al.14, only 1 of our 15 persistent DEGs from the AE was among the 44 candidate genes selected for their importance in predicting COPD. This small overlap between the two studies can be explained by differences between the initial sets of genes fed into the selection algorithms (our 663 COPD-associated DEGs vs. 90 genes associated with the progression of COPD), differences in methods used to identify the small sets of genes (our approach combining PCA with a RDFS vs. the combination of several different machine-learning based methods) and differences in sample size (411 in our study vs. 133). Finally, while our study largely rules out overfitting to one specific dataset, this might have contributed to the gene selection of Mostafaei et al. who did not perform external validation14. Although only one of the persistent DEGs was significantly differentially expressed (qā€‰<ā€‰0.05) in one of the three independent datasets presumably due to the low numbers of subjects, we showed that similar expression patterns (up- or downregulation) of the persistent DEGs are observable in COPD patients not included for the identification of the DEGs. Additionally, most of the persistent DEGs showed a significant (pā€‰<ā€‰0.05) association with lung function and lung damage when adjusted for confounding factors like age and gender using multiple linear regression models. These findings indicate that our small sets of persistent DEGs could be possible players in the pathogenesis of COPD.

For surfactant protein B (SFTPB, downregulated in AE of COPD patients), a role in COPD pathogenesis was previously established based on genetic polymorphisms 50,51,52, decreased protein concentrations in bronchoalveolar lavage fluid (BALF)53 of COPD patients and in vitro experiments41. SFTPB promotes alveolar stability by decreasing surface tension of epithelial lining fluid and has been shown to repress macrophage nitric oxide (NO)-mediated inflammation47. Thus, it appears to be a protective factor in healthy AE, while being transcriptionally repressed in COPD. Another AE candidate, glutamate decarboxylase 1 (GAD1, upregulated in AE of COPD patients) was among the 44 genes selected by Mostafaei et al. based on its importance in predicting COPD using an AE gene expression dataset14. Xiang et al. have shown that expression of GAD1 is increased in AE of persons with asthma and associated with goblet cell hyperplasia and mucus hyperproduction54. In another study, in vitro exposure of AE cells to tobacco smoke resulted in increased expression of GAD1 and the mucin MUC5AC, suggesting that GAD1 may also contribute to mucus hyperproduction in smoking-associated COPD33. None of the other genes identified here have previously been associated with COPD, although some are known to be involved in biological functions that are relevant for COPD pathogenesis, such as transmucosal zinc absorption (CRIP127), fibrosis (SCGB3A241,42), epithelial-to-mesenchymal transition (RAC139) and inflammation (also RAC136). External validation of changes in protein expression of these genes in lung samples from COPD patients, along with experimental studies, are required to corroborate their role in COPD pathogenesis.

Notably, our analysis is based on gene expression datasets originating only from populations in North America (USA and Canada). We validated the predictive capability and the expression of the 25 persistent DEGs in three independent gene expression datasets from Spain, Canada and the USA, but their number of sampled subjects was limited and a validation in bigger populations from different countries would be necessary to further strengthen their role and the evidence of their consistent change in expression in COPD patients. Additional data would be particularly valuable if linked to subject attributes such as time since smoking cessation. This would enable studying associations between these attributes and the expression of the 25 persistent DEGs.

To conclude, our novel computational approach identified both previously established and novel COPD-associated genes. Identification of previously identified genes supports suitability of our approach for identifying pathophysiologically relevant genes, whereas the novel genes (not previously associated with COPD) extend the available knowledge on molecular mechanisms contributing to lung damage in COPD. For many of these novel genes, a role in COPD pathophysiology is supported by their established cellular functions. Nevertheless, experimental follow-up studies are required to validate the differential expression and potential pathobiological involvement of these genes in the pathogenesis of COPD.

Methods

Study design

The aim of this study was to identify a small set of DEGs associated with the observed lung damage in COPD patients. To base our analysis on a large number of subjects, we combined several public gene expression datasets containing two different lung sample types (WLT and AE). We searched the gene expression datasets for COPD-associated DEGs using two selection criteria (see Results) and validated their ability in distinguishing COPD from control subjects using PCA. To afterwards reduce the number of COPD-associated DEGs, we developed a novel approach based on PCA and RDFS. This approach searches for subsets of DEGs that improve the distinction between COPD and control subjects using PCA while randomly removing single genes. The search was run and validated on different compilations of training and test subjects respectively to remove influences of population compositions on the resulting sets of discriminatory DEGs. The sets of DEGs that persistently appeared in all resulting sets of discriminatory DEGs of one lung sample type were considered as the final small set of COPD-associated DEGs. We further validated their expression and their ability in distinguishing COPD from control subjects using PCA in three independent gene expression datasets. We also investigated if their expression associates with lung function or lung damage when adjusted for factors like age, gender and smoking status using multiple linear regression.

Selection of gene expression datasets

We searched the NCBI Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) for human (Homo sapiens) datasets matching the search terms ā€œCOPDā€ or ā€œchronic obstructive pulmonary diseaseā€. We included datasets if they (1) contained clinically collected lung samples, (2) compared subjects with no confirmed COPD (controls) to subjects with COPD and (3) contained information about the age, gender, disease status and smoking history of the subjects. We further excluded datasets that had sampled only a small number of subjects (<ā€‰9) for at least one of the subject groups. Summary statistics for each included dataset are shown in Table 1.

Microarray data processing

We downloaded the raw microarray files and the corresponding clinical information file from NCBI GEO. For microarrays from Affymetrix (Human Genome U133 Plus 2.0 Array (GPL570), Human Gene 1.0 ST Array (GPL6244) or Human Genome U219 Array (GPL13667)), we additionally downloaded the CDF files from the Affymetrix Support by Product webpage (https://www.affymetrix.com/support/technical/byproduct.affx) and normalised the intensity values using the Robust Multi-array Average (RMA) procedure. For microarrays from Agilent (Whole Human Genome Microarray 4ā€‰Ć—ā€‰44Ā k (GPL6480) or SurePrint G3 Human GE 8ā€‰Ć—ā€‰60Ā K Microarray (GPL14550)) or Illumina (HumanHT-12 v4.0 Gene Expression BeadChip (GPL10558)), intensity values were normalised by applying RMA background adjustment, quantile normalisation and log2 transformation consecutively. We performed a quality assessment of all normalised microarrays using boxplots. We did not exclude any microarray. Before further analysis, probes were removed from each dataset if they (1) acted as microarray controls or (2) could not be matched to any Entrez Gene ID. If several probes matched the same Entrez Gene ID, only the one with the highest mean intensity values calculated over all microarrays was considered. Log2 fold changes were calculated by subtracting the mean log2 intensity values of the control from the mean log2 intensity values of the COPD subjects. Significant (pā€‰<ā€‰0.05) differences in the log2 gene expression values between control and COPD subjects were assessed using a two-sample t-test. p values were adjusted for multiple testing using the Benjaminiā€“Hochberg procedure. We implemented all microarray processing steps using MATLAB (R2019b, The MathWorks, Inc.), its Bioinformatics Toolbox and its Statistics and Machine Learning Toolbox.

Principal component analysis (PCA)

We used PCA to assess the distinction between COPD and control subjects. For the combined analysis of different datasets, the log2 gene expression values of each comparison group were rescaled to a range between 0 and 1 by the minā€“max normalisation. The rescaled gene expression values of all subjects were then combined in one dataset for each lung sample type. The PCA was run on the rescaled expression values of the genes mentioned in the corresponding Results parts. For the validation of the persistent DEGs, the log2 gene expression values of the three validation datasets were rescaled as described before. Values for missing genes were imputed by computing the mean of the rescaled log2 gene expression values from the corresponding comparison groups used to select the persistent genes. The means were computed for COPD and control subjects separately. To transform the rescaled persistent gene expression values of the validation dataset using the PCA computed on the corresponding comparison groups, we used the following formula:

$${S}_{val}=\left({X}_{val}-\mu \right)*L$$

with \({S}_{val}\) = PC scores for validation subjects, \({X}_{val}\) = rescaled log2 gene expression values of validation subjects, \(\mu \) = estimated mean for gene expressions and \(L\) = PC loadings computed on the original comparison groups. To check for a distinction between COPD and control subjects, we only considered PC1. The PCA was applied using MATLAB (R2019b, The MathWorks, Inc.) and its Statistics and Machine Learning Toolbox.

Random depth-first search (RDFS)

We implemented a RDFS55 approach to reduce the number of COPD-associated DEGs while improving the distinction between control and COPD subjects. During the search, random single genes get removed and the performance of the remaining set of DEGs in distinguishing COPD from control subjects is assessed using PCA and the p value resulting from a two-sample t-test. We used the t-test since it considers both the mean and the variance of the PC1 scores when computing the test statistics. A decrease in the p value can originate from an increase in distance between the two means or a decrease in variance (or both), two events that improve the distinction between COPD and control subjects. The process of our search approach is shown in Fig.Ā 3a and a pseudo-code of our recursive implementation can be found in the Supplementary Information. We implemented the RDFS using MATLAB (R2019b, The MathWorks, Inc.) and its Statistics and Machine Learning Toolbox.

We ran the RDFS for each lung sample type 10 times with 200 iterations. For each of these 10 runs, a new set of test and training subjects was compiled. The test subjects were randomly selected (10% of all subjects) and only used to assess the performance of the resulting sets of discriminatory DEGs. The remaining 90% of all subjects served as training subjects and were used during the search. For each of these runs, we obtained as many sets of discriminatory DEGs as we ran iterations. The one with the lowest p value was then chosen as the final set of discriminatory DEGs for this run. Due to the random nature of our approach (random selection of training and test subjects and random removal of single genes), not every run led to the same set of discriminatory DEGs.

Receiver operating characteristic (ROC) curves

The ROC curves were computed on the rescaled PC1 scores and compared the true positive rate (TPR) with the false positive rate (FPR) using variable PC1 thresholds. ROC curves and the resulting area under the ROC curve (AUC) give a general estimate of the performance of a set of DEGs in distinguishing COPD from control subjects. For choosing a fixed threshold to discriminate, we used the F-score. The F-score represents the harmonic mean between the TPR and the rate of true positives over all predicted positives known as positive predictive value (PPV). A value of 1 indicates perfect TPR and PPV values using the current threshold. The F-scores were calculated on the training subjects only. All performance measures were computed using MATLAB (R2019b, The MathWorks, Inc.) and its Statistics and Machine Learning Toolbox.

Multiple linear regression analysis

We tested if the gene expression of the two persistent sets of DEGs can be associated to lung function (FEV1% predicted) or lung damage (% emphysema) using multiple linear regression. The different subject characteristics (e.g. age, gender, FEV1% predicted) served as independent variables and gene expression values as dependent variable for the regression models. Categorical variables were coded as one or several dichotomous variables (e.g. gender: maleā€‰=ā€‰1, femaleā€‰=ā€‰0). Before we ran the multiple linear regression analyses, independent variables were investigated for multicollinearity using the VIF and for effect modification using stratified analysis. Collinear variables are not independent of each other, and we only included one of them in our regression models. If categorical variables were effect modifiers (i.e. the different categories led to different effects in association between lung function or lung damage and gene expression), we ran separate regression analyses for the different categories. We only considered a categorical variable as effect modifier if the association between lung function/lung damage and gene expression was significant (pā€‰<ā€‰0.05) for at least one category. We performed the multiple linear regression analyses using MATLAB (R2019b, The MathWorks, Inc.).

Statistical analysis

Differences in age, pack-years and FEV1% predicted distributions between COPD and control subjects of a comparison group were tested using a two-sample t-test, while differences in gender distributions were tested using a Fisherā€™s exact test. If the age or gender of subjects in a dataset were unknown, we used the statistical values reported in the original publication. A Lilliefors test was used to assess if the PC1 scores of COPD and control subjects followed a normal distribution. If this hypothesis had to be rejected (pā€‰<ā€‰0.05), a Mannā€“Whitney U-test was used to further assess if there was a significant difference (pā€‰<ā€‰0.05) between the two groups. If not, a two-sample t-test was used. We made an exception for the RDFS where we assumed normal distribution and always used a two-sample t-test to standardise the comparison. All statistical analysis was performed using MATLAB (R2019b, The MathWorks, Inc.) and its Statistics and Machine Learning Toolbox. If not stated differentially, values are shown as meanā€‰Ā±ā€‰standard deviation (SD).