Integrated transcriptomic correlation network analysis identifies COPD molecular determinants

Chronic obstructive pulmonary disease (COPD) is a complex and heterogeneous syndrome. Network-based analysis implemented by SWIM software can be exploited to identify key molecular switches - called “switch genes” - for the disease. Genes contributing to common biological processes or defining given cell types are usually co-regulated and co-expressed, forming expression network modules. Consistently, we found that the COPD correlation network built by SWIM consists of three well-characterized modules: one populated by switch genes, all up-regulated in COPD cases and related to the regulation of immune response, inflammatory response, and hypoxia (like TIMP1, HIF1A, SYK, LY96, BLNK and PRDX4); one populated by well-recognized immune signature genes, all up-regulated in COPD cases; one where the GWAS genes AGER and CAVIN1 are the most representative module genes, both down-regulated in COPD cases. Interestingly, 70% of AGER negative interactors are switch genes including PRDX4, whose activation strongly correlates with the activation of known COPD GWAS interactors SERPINE2, CD79A, and POUF2AF1. These results suggest that SWIM analysis can identify key network modules related to complex diseases like COPD.


Results
COPD correlation network. The network-based analysis implemented by the SWIM software (see Materials and Methods section) was exploited to identify disease genes and network modules associated with COPD status by using the GSE47460 dataset (training set) containing microarray gene expression profiling of lung tissue samples from 219 COPD cases and 108 controls 17,18 .
In order to check if the number of GWAS genes included in the DEGs is more than expected by chance, we randomly selected 2097 genes (i.e., the number of DEGs) from the original list of 17530 genes and repeated this procedure 1000 times. Then, the number of the 145 GWAS genes included in the DEGs was zscore-normalized and the p-value for the given z statistics was calculated; the p-value of 0.2 indicates that the number of differentially expressed GWAS genes is equal to what expected by chance. This observation is in accordance with the results obtained in 7 , where the authors showed that COPD GWAS genes were not differentially expressed in lung tissue samples.
The DEG matrix of 2097 rows (DEGs) and 327 columns (219 COPD cases + 108 control samples) was used as input to SWIM in order to build the COPD gene correlation network based on the Pearson correlation coefficient, where a threshold is set for the absolute value of the minimum correlation coefficient necessary to draw an edge (see Materials and Methods section). For the COPD correlation network, we set the correlation threshold equal to 0.57, which corresponded to the 98 th percentile of the entire correlation distribution ( Supplementary Fig. 1).
The obtained COPD correlation network encompasses 1665 nodes and 52513 edges. The most highly connected hub is EMP2 that codes a tetraspan protein of the PMP22/EMP family regulating cell membrane composition. It is down-regulated in COPD cases (Fold-change = 0.68, FDR = . ⋅ − 1 1 10 7 ) and is located in a genome-wide significant COPD GWAS region with a COPD-associated top SNP about 86 Kb from the transcription start site.
Module identification in the COPD correlation network. In order to detect the community structure of the network, SWIM used the k-means clustering algorithm, which partitions n objects (here, network nodes) into a predefined number N of clusters (modules). The quality of clustering was evaluated by minimizing the Sum of the Squared Error (SSE), depending on the distance of each object to its closest centroid. As a distance measure, SWIM used: x y ( , ) 1 ( , ) where ρ x y ( , ) is the Pearson correlation between expression profiles of nodes x and y. A reasonable choice of the number of clusters is suggested by the position of an elbow in the SSE plot (named "scree plot") computed as a function of the number of clusters (see Materials and Methods section). The COPD correlation ne twork consisted of 3 modules or clusters, varying in size from 190 genes in module 1, 1411 genes in module 2, and 64 genes in module 3 (Fig. 2a).
In order to check the quality of the k-means clustering algorithm implemented by SWIM, we grouped genes with correlated expression profiles into modules by using complete linkage hierarchical clustering coupled with the correlation-based dissimilarity dist x y xy ( , ) 1 ( , ) ρ = − . We compared detected modules with the ones obtained by SWIM with the k-means method, and we found that cluster 1 and cluster 3 are well separated meaning that their cluster detection is highly robust with respect to the clustering algorithm used ( Supplementary  Fig. 2).
Summarizing the profiles of the COPD modules. To summarize the overall expression profile of a given module in the COPD co-expression network, we exploited the module eigengene (Fig. 2b) defined as the first principal component of a given module 13 . We found that the first principal component across all modules is able to explain more than 85% of the data variance in each module, i.e. 96.7%, 95.4%, 88.8%, in module 1, module 2, and module 3, respectively (Fig. 2c). Thus, the module eigengene can be considered a representative gene able to condense each module into one profile. In light of this, we found that the eigengenes of module 1 and module 2 are both down-regulated in COPD cases (p-value = . ⋅ − 5 8 10 14 and p-value = . ⋅ − 3 6 10 16 , respectively), whereas the eigengene of module 3 is up-regulated in COPD cases (p-value = 2 9 10 13 . ⋅ − ), providing a general idea of the overall expression trend of each module.
Then, for each gene in a given module, the module membership can be computed as the correlation between its expression profile and the module eigengene 13 . We found high correlations within modules 1 and 3 with the mean module membership, equal to 0.71 and 0.67, respectively. However, the mean mo dule membership of mod ule 2 is lower, confirming the result obtained with the hierar chical clustering algorithm (Supplementary Table 3).
The first two genes with the highest membership in module 1 are CAVIN1 and AGER, both down-regulated in COPD cases (Fold-change = 0.8 and FDR = . ⋅ − 1 8 10 5 ; Fold-change = 0.65 and FDR = 1 8 10 5 . ⋅ − , respectively). CAVIN1 encodes a protein that enables the dissociation of paused ternary polymerase I transcription complexes (b) Heatmap represents DEGs clustered according to genes (rows) and samples (columns) by using one minus the Pearson correlation as distance. Colors represent different expression levels increasing from blue to yellow.
from the 3′ end of pre-rRNA transcripts. This protein regulates rRNA transcription by promoting the dissociation of transcription complexes and the reinitiation of polymerase I on nascent rRNA transcripts. This protein also localizes to caveolae at the plasma membrane and is thought to play a critical role in the formation of caveolae and the stabilization of caveolins. AGER, one of the most down-regulated genes in COPD cases, encodes the advanced glycosylation end product (AGE) receptor and interacts with several molecules implicated in homeostasis, development, and inflammation (Fig. 3). Interestingly, AGER is one of the most well-known candidate genes located in a significant COPD GWAS region with a non-synonymous SNP (located about 2 Kb from the transcription start site), which has been associated with multiple COPD-related phenotypes and COPD affection status 20,21 .
The first gene with the highest module membership in module 2 is CAPN2 that is down-regulated in COPD cases (Fold-change = 0.8 and FDR = . ⋅ − 7 3 10 7 ) and encodes the large subunit of the calcium-activated neutral protease calpain 2. It is worth emphasizing that smoking activates macrophages to produce several inflammatory mediators including proteases 22 . Increasing evidences indicate that chronic inflammatory and immune responses play a crucial role in COPD development and progression 22,23 . The chronic inflammatory process in COPD involves both innate (e.g., neutrophils, macrophages, T cells, innate lymphoid cells, and dendritic cells) and adaptive immune response (i.e., T and B lymphocytes) 24 . In particular, patients affected by COPD show a lung inflammation pattern characterized by an increased number of neutrophils, macrophages, and T and B lymphocytes. Consistently, we found that module 2 is more enriched in cell type-specific gene markers (i.e., immune gene signatures) known in literature 25 , with a total of 67 marker genes representative of six immune populations, all up-regulated in COPD cases (see Materials and Methods section, Fig. 3 and Supplementary Table 2).
The first two genes with the highest membership in module 3 are PRDX4 and KCND3, both up-regulated in COPD cases (Fold-change = 1.2 and FDR = 4 8 10 4 . ⋅ − ; Fold-change = 1.2 and FDR = . ⋅ − 9 8 10 6 , respectively). PRDX4 codes a protein that is an antioxidant enzyme with a key regulatory role in the activation of the transcription factor NF-kappaB. Instead, the gene KCND3 codes for potassium channel subfamily D member 3 and is located in a genome-wide significant COPD GWAS region, although the COPD-associated top SNP is located about 575 Kb from the transcription start site. The expression of PRDX4 is strongly positiv ely correlated in the COPD correlation network with SERPINE2, CD79A, and POUF2AF1, which were previously considered as putative interactors of genes at COPD GWAS loci 7 . The expression of KCND3 is strongly positively correlated with two of them (SERPINE2 and CD79A). Among KEGG pathways and GO Biological Processes enriched in module 3, we found annotations related to the regulation of the immune system and inflammatory response (see Materials and Methods section and Fig. 3). Gene expression data are log2-transformed and z-score normalized.
[BOTTOM] Bar plot of the expression levels of module 3 eigengene (y-axis) across samples (x-axis). Gene expression data are log2-transformed and z-score normalized. (c) The percent variability explained by each principal component (PC) computed for module 3, known as a Pareto chart, contains both bars and a line graph, where individual values are represented in descending order by bars, and the line represents the cumulative total value. The left y-axis represents the percentage of the data variance explained by each PC, the right y-axis represents the cumulative distribution, and the x-axis represents the PCs that are able to explain 100% of the cumulative distribution. PC1 represents the module eigengene and explains about 90% of the data variance.
Identification and characterization of switch genes. We classified nodes in the COPD correlation network using the date/party/fight-club hub classification system 11 , based on the Average Pearson Correlation Coefficients (APCCs) between the expression profiles of each hub and its nearest neighbors (see Materials and Methods section). Then, we assigned a topological role to each node based on their inter-and intra-cluster interactions and thus drew the heat cartography map for the COPD dataset, where party, date, and fight-club hubs were easily identified by red, orange, and blue coloring, respectively ( Fig. 4a and Supplementary Table 2). Through the heat cartography map, we are able to identify switch genes as a special subclass of fight-club hubs (APCC <0) characterized by having more links outside than inside their own cluster, while not being hubs in their own cluster (i.e., switch genes are fight-club hubs falling in the R4 region of the heat cartography map).
In the heat cartography map drawn for COPD randomized network, we observed a predominance of positive correlations and an absence of switch genes (Fig. 4b). To assess statistical significance to this observation, we repeated this procedure 1000 times and we calculated the number of switch genes in each COPD randomized network. We found that the number of switch genes in each randomized network was always less than three, with a mean of 0.6 and a standard deviation of 0.8. Then, the number of 62 switch genes found in the original COPD correlation network (Fig. 4a) was zscore-normalized and the p-value for the given z statistics was calculated; the p-value ∼ 0 indicates that the observed heat cartography map in the COPD gene expression dataset (Fig. 4a) is not a random event.
We found 62 switch genes in the COPD correlation network all resulting in gene up-regulation in COPD cases ( Fig. 5a and Supplementary Table 4). Most switch genes (74%) fall in module 3 ( Fig. 5b) and, mutually, module 3 is almost entirely populated by switch genes (73%), thus conferring to this module a well-characterized and defined signature as switch module.
COPD switch genes are all protein-coding, among which 4 transcription factors -including E2F3, HIF4, TAF10 of module 3 and RUNX1 of module 2 -and five other genes located within previously identified genome-wide significant COPD GWAS loci 5 (Fig. 5c).
Functional annotation analysis of switch genes reveals that they are mainly involved in the regulation of several functionalities related to the immune and inflammatory response, mirroring the enrichment results obtained for module 3 ( Supplementary Fig. 3).   . Switch genes are colored according to their associated cluster. (d) Robustness of the COPD correlation network. Blue curve corresponds to the cumulative deletion of non-switch hubs (i.e., the first 62 hubs that are not switch genes, sorted by decreasing degree); red curve corresponds to the cumulative deletion of the 62 switch genes, sorted by decreasing degree; the green curve corresponds to the cumulative deletion of randomly selected nodes. The x-axis represents the cumulative fraction of removed nodes with respect to the total number of 1655 network nodes (i.e., x-maximum is 62/1655 = 0.04), while the y-axis represents the average shortest path. www.nature.com/scientificreports www.nature.com/scientificreports/ Note that PRDX4 and KCND3, the first two genes with the highest membership in module 3, are also switch genes.
Removal of switch genes. Scale-free networks show a surprising tolerance against errors and the ability of nodes to interact is unaffected even by very high node failure rates. However, this error tolerance is paid at a high price in that these networks are extremely vulnerable to attacks, i.e., to the removal of a few nodes that play a vital role in maintaining the network's connectivity 26 .
We studied the tolerance of the COPD network against the removal of the 62 switch genes by comparing to the impact of the removal of the first 62 hubs that are not switch genes (called "non-switch hubs"). Both switch genes and non-switch hubs were sorted by decreasing degree and selected to be removed (Fig. 5d). Then, the effect on the average shortest path (i.e., the mean of the shortest paths for all possible pairs of nodes in the network) of the cumulative node deletion is evaluated.
We found that the removal of switch genes produces a drastic increase of the average shortest path, mirroring the effect caused by the deletion of the first 62 non-switch hubs (Fig. 5d). This means that switch genes play a vital role in maintaining the network's connectivity while not being the primary hubs. In fact, the first 62 nodes sorted by decreasing degree include only two switch genes (Fig. 6). On the contrary, the removal of 62 randomly selected nodes does not affect the integrity of the network (Fig. 5d).

Validation of switch gene identification.
To further assess the validity of the SWIM analysis in identifying disease genes and modules associated to COPD status, we applied the SWIM software on the GSE76925 dataset (test set), which contains microarray gene expression profiling of lung tissue samples from 111 COPD cases and 40 control smokers with normal lung function 7 . In this case, starting from 22631 genes, we obtained 887 significantly DEGs at 15% FDR, of which 493 (56%) were down-regulated in COPD cases, while the remaining 394 (44%) were up-regulated (Supplementary Table 1). To build the COPD correlation network, we selected a correlation threshold equal to 0.55 (corresponding to the 95 th percentile of the entire correlation distribution), which roughly guarantees to preserve the network integrity. A higher correlation threshold would cause a drastic drop in network connectivity.
The obtained COPD correlation network encompassed 667 nodes and 22595 edges, including 103 date hubs, 348 party hubs, and 71 fight-club hubs. From the COPD correlation network, SWIM extracted 61 switch genes all resulting in gene up-regulation in COPD cases ( Supplementary Fig. 4a, Supplementary Table 4). By studying the tolerance of the COPD correlation network against the removal of the 61 switch genes, similar to the other lung tissue gene expression dataset, we found that the removal of switch genes produces a drastic increase of the average shortest path, even overcoming the effect caused by the deletion of the first 61 non-switch hubs www.nature.com/scientificreports www.nature.com/scientificreports/ ( Supplementary Fig. 4b). This strongly supports the hypothesis of their putative key role in preserving the network's connectivity, while not being the primary network hubs (the first 61 nodes sorted by decreasing degree do not include any switch gene).
Notably, the list of 61 switch genes includes SPP1 that encodes adiponectin, which has been suggested as a protein biomarker for COPD 27 and TUFM, which is probably the best candidate within a strong COPD GWAS region on chromosome 16 28 .
The two analyzed datasets shared only one switch gene, i.e. SSR4, which appears in the top-ten switch genes of the training set (GSE47460 dataset) and it is up-regulated in COPD cases in both datasets. Previous analyses of gene expression differences in COPD have noted the challenges of finding consistent results across studies 29 . There is marked heterogeneity in the development of COPD even among people with similar cigarette smoking histories, which is likely partially explained by genetic variation making the functional understanding of the disease a formidable challenge. Network-based approaches enable modeling of the complex molecular interactions involved in COPD pathogenesis aiding translational understanding of the complex mechanisms underlying the disease. These approaches start from the assumption that complex diseases are rarely a consequence of an abnormality in a single gene, but are likely influenced by a network of interacting genes and proteins where diseases can be identified with localized perturbation within a certain neighborhood or module 30 . Therefore, the identification of these modules is a prerequisite of a compelling investigation of a certain pathophenotype. In order to investigate the extent to which the two lists (S1, S2) of switch genes are in close proximity in the Human Interactome (i.e., the cellular network of all physical molecular interactions), we used a network proximity measure and interactome database obtained from 31 : where the closest distance p S S ( 1, 2) is the average shortest path length between switch genes s 1 of the list S1 and the nearest switch gene s 2 of the list S2 (Fig. 7).
To evaluate the significance of the observed network proximity across the two lists of switch genes, we built a reference distance distribution corresponding to the expected distance between two randomly selected groups of proteins with the same size and degree distribution of the original two sets of switch genes in the human in teractome. This procedure was repeated 5000 times (Fig. 7).
Then, the network proximity measure across the two lists of switch genes was zscore-normalized by using the mean and the the standard deviation of the reference distance distribution. Subsequently, the p-value for the given z statistics was calculated. The obtained p-value <0.05 indicates that the proximity in the human interactome of the two lists of switch genes is lower than the mean of the network distances between any two sets of randomly selected nodes of the same size and degree.
To further investigate the extent to which the two lists (A, B) of switch genes are in the immediate vicinity of each other in the human interactome, we used a network separation measure that tests if the two lists form modules that are separated or overlap and is defined as follows 32 : is the proximity measure above-defined. A value for the separation measure s > =0 means that the two lists of genes map to proteins that in the human interactome are topologically separated, otherwise a va lue for separation measure s < 0 means that two gene sets are located in the same network neighborhood. By computing the separation measure between the two lists of switch genes obtained from the COPD training and test set, we obtained s = −0.074, meanifng that means that two gene sets are located in the same ne twork neighborhood (i.e., they overlap). www.nature.com/scientificreports www.nature.com/scientificreports/ To demonstrate the specificity of COPD switch genes with respect to another lung disease with an inflammatory component, we applied SWIM on a dataset from the acute respiratory distress syndrome (ARDS) available through the GEO public repository at accession number GSE76293 33 . This dataset collects microarray gene expression profiling of 12 lung samples from ARDS patients and 12 samples from paired (i.e., age and gender-matched) healthy volunteers (HVTs). The obtained ARDS switch genes were completely different from the ones found in the COPD training and test set. By computing the proximity and separation measurements between the list of switch genes obtained from the training set and from the ARDS dataset, we found that their proximity is not statistically significant (p-value = 0.2) ( Supplementary Fig. 5) and their separation is positive, meaning that the two lists of switch genes form two modules that are topologically separated in the human interactome (i.e., they do not overlap) (Fig. 8).
Interestingly, by studying the functional annotations of the two lists of switch genes from the COPD training and test set, we found that they share COPD-related KEGG pathways (Table 1) and GO biological processes (Table 2), namely the NF-κB and toll-like receptor signaling pathways, regulation of immune and inflammatory response and key processes in cell development. Among switch genes involved in the NF-κB pathway, we found SYK, BLNK, and LY96 in the training set (GSE47460 dataset) and IRAK4 in the test set (GSE76925 dataset), all up-regulated in COPD cases. Thus, despite the apparent discrepancy in the lists of switch genes across the two datasets, the observed proximity in the human interactome, as well as the shared COPD related functionalities suggest they are working together in determining the COPD pathophenotype.
Interestingly, performing the functional enrichment analysis on the list of switch genes obtained from the ARDS dataset, we found the same pathways affected as in Table 1.
Switch genes interacting with genes at COPD GWAS loci and with SERPINE2, CD79A and POUF2AF1. Looking at the COPD GWAS regional genes that are nearest network neighbors of switch genes in the training set (GSE47460 dataset), we found that switch genes are highly correlated and anti-correlated with 24 and 36 GWAS genes, respectively (Supplementary Table 5). Since a liberal region around 82 genomic loci associated with COPD was included (about +/− 1 Mb), approximately 5% of the genome was encompassed by those COPD GWAS regions. Thus, in order to check if the number of GWAS genes included in the positive/negative    www.nature.com/scientificreports www.nature.com/scientificreports/ nearest neighbors of COPD switch genes (i.e., 24 and 36 GWAS genes, respectively) is more than expected by chance, the nearest neighbors of COPD switch genes were randomly shuffled 1000 times preserving the degree of each switch gene and the interaction weights. Then, the original values (non-random values) of GWAS positive and negative nearest neighbors were z-score-normalized and the p-values for the given z statistics were calculated, that are 6.63 × 10 −76 and 3.93 × 10 −205 , respectively. This suggests that the observed number of GWAS genes included in the positive/negative nearest neighbors of COPD switch genes (i.e., 24 and 36 GWAS genes, respectively) is not a random event.
Interestingly, the list of 36 GWAS genes that are negative nearest neighbors of switch genes encompasses AGER and EMP2, which negatively correlate with 26 and 33 switch genes, respectively, of which 24 switch genes are in common (Fig. 9 left and Supplementary Table 5). Note that this signature in mainly due to the switch genes falling in module 3. In fact, AGER and EMP2 negatively correlates in module 3 with 25 and 30 switch genes, respectively, of which 23 switch genes in common, including KCND3, SSR4, LY96, and TIMP1. Overall, the 70% of the negative interactors of AGER and/or EMP2 in module 3 are switch genes.
Looking at genes that have been previously considered as putative interactors of genes at COPD GWAS loci 7 , we found that 22 switch genes are strongly positively correlated with SERPINE2, or with CD79A, or with POUF2AF1 ( Fig. 9 right). Among them, we found PRDX4 and FKBP11 that are positively correlated with all three GWAS interactors POUF2AF1, SERPINE2, and CD79A.

WGCNA network analysis.
To test the SWIM performance, we applied the commonly used Weighted Gene Coexpression Network Analysis (WGCNA) framework on the COPD training set (GSE47460 dataset) to identify gene modules associated with COPD case-control status 13 . An unsigned network was built by using the Pearson correlation metrics and a soft thresholding power equal to 5 was set in order to guarantee a scale-free topology ( Supplementary Fig. 6a). The final network consisted of 12 modules (labeled by color), ranging in size from 43 to 720 genes, each containing a set of unique genes ( Supplementary Fig. 6b). The grey module is a grouping of genes with outlying gene expression profiles and was not considered further. Tests of association between phenotype variables of interest and the module eigengenes were performed for each model and the results were summarized in a heatmap ( Supplementary Fig. 6c). The phenotype variables predicted dlco, predicted fev1 (post-bd and pre-bd), predicted fvc (post-bd and pre-bd), and smoker status (i.e, current, ever, or never) decrease with COPD disease status, while emphysema increases with COPD disease. To identify groups of correlated module eigengenes, a hierarchical clustering methods was exploited quantifying the module similarity by eigengene correlation (Supplementary Fig. 6d). The purple and the brown modules were the most significantly associated with COPD case-control status (FDR < 0.05). Driver genes in these two modules were identified using the module membership measure and the intra-module node degree. We found that driver genes for the brown module include EMX1, VWA7, LCE1A; while driver genes for the purple module encompass RPL17, RPL5, CRACR2B ( Supplementary Fig. 7). None of these driver genes are known to be related to COPD. Moreover, the functional enrichment analysis performed on these two network modules did not display statistically enriched annotations related to COPD (Supplementary Table 6), whereas SWIM identified the module of switch genes that was statistically enriched in COPD-related pathways, like B cell receptor signaling pathway ( Supplementary Fig. 8).
We then compared this WGCNA analysis on the training set with a recently published WGCNA analysis on the COPD dataset we used as test set 7 , where the authors found that only one module, the cyan one, was most significantly associated with COPD case-control status (FDR < 0.05) and significantly enriched in B cell related processes. We found that driver genes of cyan module were not the same as the driver genes of the brown module from the COPD training set. However, the shared pathways affected in these two modules included NF-κB signaling pathway. We then checked if these two modules were or were not in close proximity in the human interactome. We found that their proximity was significantly higher than the mean of the random distribution (p-value = 0.03) (Supplementary Fig. 9a) and their separation was positive, meaning that the two modules are topologically well-separated in the human interactome (i.e., they do not overlap). The same results were found for the purple module from the COPD training set with respect to the cyan module from the COPD test set ( Supplementary Fig. 9b).

SWIM performance evaluation. To test the SWIM performance, we computed the Receiver Operating
Characteristic (ROC) probability curve and the corresponding Area Under the Curve (AUC) that represents how much a model is capable of distinguishing between classes. Actually, a reliable truth table for COPD specific genes is hampered by variable definitions of COPD, incomplete consideration of past and current smoking status, failure to consider quantitative traits and COPD heterogeneity. To overcome this limitation, we classified genes as COPD-specific if they are annotated for COPD specific-pathways (i.e., pathways that were enriched in the two lists of switch genes from training set and test set) and we used this definition as the "real association" to COPD status for each switch gene: 0 means that the switch gene is not annotated for a COPD-specific pathway, while 1 means that the switch gene is annotated for a COPD-specific pathway and thus it can be considered as COPD-specific gene. Then, we calculated the capability of SWIM to predict COPD-specific genes by considering for each switch gene the number of COPD-GWAS gene falling in its interactors in the correlation network: greater is the number of COPD-GWAS interactors greater is the probability that SWIM does not fail to consider it as COPD-specific gene. According to these criteria, we built a truth table and we calculated the ROC. We found that the AUC is 0.7 both for the training set ( Supplementary Fig. 10a) and the test set ( Supplementary Fig. 10b) switch genes, meaning that there is 70% chance that SWIM will be able to distinguish between positive class (COPD-specific genes) and negative class (genes that are not COPD-specific). (2020) 10:3361 | https://doi.org/10.1038/s41598-020-60228-7 www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
We analyzed lung tissue gene expression data from two well-characterized COPD case-control populations to study the differences between lung samples from normal subjects (represented by smokers with normal spirometry) and COPD cases. We used one dataset as a "training set" to perform the analysis and the other dataset as a "test set" to validate the results. We built a COPD correlation network, and we exploited the module-centric approach to identify putative COPD molecular determinants that we called "switch genes". COPD is characterized by an inflammatory component persisting even after smoking cessation. This inflammatory status is heterogeneous, but the key inflammatory cells involved are T cells, B cells, neutrophils, and macrophages. Macrophages play a crucial role in orchestrating chronic inflammation in COPD patients and their number markedly increases in both the airways and lung parenchyma 22 . Neutrophils provide powerful proteases and are broadly present in acute exacerbations in the lung airways 34 . The role of B cells and T cells in COPD pathogenesis has growing support from the basic science studies of COPD 7,35-37 . In fact, recent evidence shows a strong positive correlation between the COPD severity and the size and number of B cell-rich lymphoid follicles, as well as between the amount of alveolar destruction and severity of airflow obstruction and the number of T cells 24,37 .
Several aspects of innate and adaptive immune functions are regulated by the transcription factor NF-κB that is a pivotal mediator of inflammatory responses. NF-κB induces the activation of various pro-inflammatory genes and serves as a key regulator of the survival, activation and differentiation of innate immune cells and inflammatory T cells 38 . Therefore, the NF-κB deregulation contributes to the pathogenic processes of several diseases with inflammatory components. Recently, various therapeutic strategies that target the NF-κB signaling pathway have been considered for treatment of inflammatory diseases, such as asthma and COPD 39 . Among the causes responsible for the activation of this pathway, there is the binding of the advanced glycosylation end-products (AGEs) to RAGE, the protein encoded by the gene AGER that is one of the most well-known candidate genes located in a significant COPD GWAS region. RAGE is a membrane receptor, but also has soluble forms (sRAGE) generated mainly by alternative splicing mechanism of the AGER gene. Reduced sRAGE levels are associated with heightened inflammation in various chronic conditions, and they are also associated with increased emphysema and COPD status 20 . sRAGE is one of the most promising biomarkers for emphysema 40 .
In addition to inflammation, our analyses highlighted hypoxia-related pathways. Inflammation shares an interdependent relationship with hypoxia. In fact, oxygen passes from the lung tissue to the blood via the lung alveoli. COPD damages the lungs, and if they get seriously damaged, hypoxia may occur since the blood does not deliver enough oxygen to the alveoli in the lungs. Patients affected by inflammatory diseases show elevated levels of hypoxia-inducible factors (HIF), a transcription factor that is stabilized during conditions of hypoxia, and the activation of HIF1 signaling pathway has been shown to correlate with a decrease of lung function, reduced quality of life and progression of COPD [41][42][43] . Thus, while hypoxia can elicit tissue inflammation, inflammatory disease states are frequently characterized by tissue hypoxia, supporting the hypothesis that hypoxia and inflammation are two sides of the same coin 44 . Besides inducing inflammation, hypoxia causes also the disappearance of caveolae in the epididymal adipose tissue and inhibits the expression of CAVIN1 through HIF1 45 . Caveolae dysfunction is implicated in various pathologies, such as muscular dystrophies and pulmonary hypertension in COPD 45,46 . www.nature.com/scientificreports www.nature.com/scientificreports/ Consistent with all these observations, we found that the COPD correlation network built by SWIM software consists of three well-characterized modules: one populated by switch genes all up-regulated in COPD cases and related to the regulation of immune and inflammatory response; one populated by well-recognized immune signature genes all up-regulated in COPD cases; and one where the GWAS gene AGER and CAVIN1 are the most representative module genes, both down-regulated in COPD cases. Interestingly, 70% of the negative interactors of AGER are switch genes.
Among switch genes involved in NF-κB signaling pathway, we found LY96, BLNK, and SYK ( Supplementary  Fig. 3). In particular: LY96 codes a protein which is associated with toll-like receptor 4 on the cell surface; BLNK codes for an adaptor protein that plays a crucial role in B cell development and activation; SYK encodes for a tyrosine protein kinase that is involved in coupling activated immunoreceptors to downstream signaling events mediating diverse cellular responses, like proliferation, differentiation, and phagocytosis. Among switch genes linked to the NF-κB signaling pathway, we found PRDX4 that is associated with neutrophil activation and degranulation and with I-kappaB phosphorylation that is an important step in the NF-κB activation. Moreover, we found that the gene expression of PRDX4 strongly correlates with the activation of known COPD GWAS interactors SERPINE2, CD79A, and POUF2AF1 [47][48][49][50][51][52][53] .
Among switch genes involved in the inflammatory response and hypoxia, sharing an interdependent relationship 44 , we found TIMP1 and the transcription factor HIF1A (Supplementary Fig. 3). TIMP1 is the tissue inhibitor of metalloproteinase-1 related to airway hyperresponsiveness (AHR) in smokers 43 . AHR is associated with airway inflammation and is a predictor of future risk of COPD among smokers 43 . HIF1A encodes the alpha subunit of hypoxia-inducible factor-1 (HIF1) and has been shown to be an essential regulator of the response to hypoxia 54 . Recent data have also suggested that HIF1A plays a major role in COPD, indicating that its high expression may be associated with decreased lung function and reduced quality of life, contributing to disease progression 41,42 .
Among switch genes related to the regulation of immune response, we found the gene CYBA associated with the nucleotide-binding oligomerization domain-like (NOD-like) receptor signaling pathway. NOD-like receptors are a group of key sensors for lung microbiota and damage and might also indirectly regulate immune responses. Thus, they play a key role in multiple infectious as well as acute and chronic sterile inflammatory diseases, such as pneumonia and COPD 55 . CYBA codes for light chain (alpha subunit) of the cytochrome b protein, which has been proposed as a primary component of the microbicidal oxidase system of phagocytes and shows selective cytoplasmic expression in immune cells.
Furthermore, we found also switch genes involved in other mechanisms beyond chronic inflammation that are implicated in the development and the progression of the COPD, such as cellular senescence, apoptosis, and oxidative stress (Supplementary Fig. 3) 23,24 .
In order to evaluate the predictive power of SWIM tool, we computed the ROC probability curve and the corresponding AUC for the results obtained studying both COPD training and test set. We used the pathways affected in the list of switch genes from the first dataset as predictive for the identification of COPD-specific switch genes from the second dataset, and vice versa. In both cases we found that the AUC is 0.7, meaning that there is 70% chance that SWIM will be able to distinguish between positive class (COPD-specific genes) and negative class (genes that are not COPD-specific).
In order to estimate the capacity of SWIM tool in identifying network disease modules, we compared the results of SWIM analysis on COPD training set with the ones obtained by applying the WGCNA method on the same dataset. The analysis led to the identification of two most significant network modules but, unfortunately, the list of driver genes belonging to these modules was not statistically enriched in pathways known to be related to COPD. On the contrary, the switch genes module identified by SWIM analysis on the same dataset was statistically enriched in COPD-related pathways, like B cell receptor signaling pathway. These findings demonstrated that switch genes identified by SWIM have more biologically meaningful than the driver genes identified by WGCNA. We then compared these results with the ones reported in a recently published paper, where the authors applied the WGCNA method on the COPD dataset we used as test set 7 . We found that the driver genes identified in this paper were not the same as the driver genes we identified by exploiting WGCNA analysis on the COPD training set, but they share common pathways related to inflammation, including NF-κB signaling pathway. This is not surprising since this pathway is common to many lung diseases with an inflammatory component, like ARDS. However, what is actually unexpected is that these two lists of driver genes form two modules in the human interactome that are statistically significantly separated. This is in stark contrast with the results of SWIM analysis, which showed that the two lists of switch genes from the two datasets were in close proximity and overlapped in the human interactome. These finding demonstrated that the WGCNA analysis is less specific than SWIM analysis since it found network modules that distinguish between two datasets of the same disease as they would be associated to different diseases.
In order to demonstrate the disease specificity of switch genes, we compared the results from SWIM analysis obtained on COPD training set with the ones obtained on ARDS, another lung disease with an inflammatory component. Interestingly, we found that ARDS switch genes were different than COPD switch genes, but the major pathways affected in the two lists were similar and include NF-κB and toll-like receptor signaling pathways, regulation of immune and inflammatory response, emphasizing that different diseases often have common underlying mechanisms and share intermediate endophenotypes 56 . We then checked if the list of switch genes from the two different lung diseases were close in the human interactome. We found that their proximity was not statistically significant and their separation was positive, meaning that the two lists of switch genes form two modules that are topologically separated. This suggests that different diseases can share similar endophenotypes, but the network molecular determinants responsible for them are disease-specific. This is in full accord with the fundamental principles of network medicine, where disease proteins are assumed not to be randomly scattered, but agglomerate in specific regions of the molecular interactome, suggesting the existence of specific disease network modules for each disease. In sum, the SWIM analysis of the additional dataset of an inflammatory lung syndrome clearly showed the specificity of our approach able to find modules that distinguish between COPD and ARDS. (2020) 10:3361 | https://doi.org/10.1038/s41598-020-60228-7 www.nature.com/scientificreports www.nature.com/scientificreports/ Limitations and future directions. In this study, we have collected a number of clues, ranging from global to local properties, from purely computational to more biological ones, aiming to draw a sketch of putative underlying mechanisms that could lead to a large-scale transition towards the occurrence of COPD. It is worth noting that the computational approach used in this study is based on correlations that are just "associations" and do not imply necessarily "causal" relationships. Nevertheless, adding further computational and biological information allowed us to zoom-in from global properties (i.e., power laws, fight club hubs, switch genes, etc.) to a small pool of genes that could give the promise for a better understanding of the molecular mechanisms underlying the onset of COPD.
Moreover, SWIM constructs hard-thresholded networks (or binary networks) in order to remove meaningless relationships and thus focus on significant associations between highly correlated nodes. The hard-thresholding approach creates binary networks where sub-threshold inter-node correlations are suppressed (edge values set to 0), and supra-threshold correlations are compressed (edge values set to 1). This approach could in principle lead to a loss of information since small differences in the chosen threshold, or in correlation strength, can result in edges being present or absent in the network. However, this limitation is partially overcome by using stricter thresholds, thus maximizing the contribution from the strongest correlations and emphasizing the network characteristics of nodes falling in the extreme (positive and negative) tails of the correlation distribution.
An efficient solution to the hard-thresholding problem is to build soft-thresholded networks where thresholding is replaced with a continuous mapping of correlation values into edge weights, which has the effect of suppressing rather than removing weaker connections. In the future we hope to improve the SWIM functionalities by proposing different types of "soft" adjacency functions, like a sigmoid or logistic function.
Other limitations of our analysis include the relatively small size of the gene expression datasets and the heterogeneous nature of lung tissue samples. Increasing samples size and using data from specific cell types in future studies will likely improve the identification of COPD molecular determinants.

Conclusions
Our findings demonstrate that switch genes play an active role in inflammatory responses and regulating the immune environment in COPD. Modulating the function of switch genes may be an important mechanism to dampen the hypoxia-promoting inflammatory response and may lead to an improved understanding of COPD pathogenesis.
The majority of the genes highlighted through the SWIM methodology would not have been identified using a traditional GWAS approach. This observation demonstrates how SWIM can aid the identification and the prioritization of novel diagnostic markers or therapeutic candidate genes involved in the etiology of COPD.

Materials and Methods
Datasets. GSE47460 dataset. The first dataset analyzed for the present study is available through the GEO public repository at accession number GSE47460 published on May 30, 2013 17,18 . This dataset includes microarray gene expression profiling obtained from total RNA extracted from whole lung homogenates from subjects undergoing thoracic surgery for clinical indications. These subjects were diagnosed as being controls or having interstitial lung disease (ILD) or chronic obstructive pulmonary disease (COPD). All samples are from the Lung Tissue Research Consortium (LTRC) and derived from two array platforms with a total of 582 samples: 255 have ILD, 219 have COPD, 108 are controls.
In order to compare COPD cases versus control, the gene expression data GSE47460 was analyzed as follows: ILD samples were removed from the two array platforms and then each array-type was Robust Multi-array Average (RMA)-normalized separately 57 . Then, to "stitch together" the data from the two arrays, we first matched genes based on probe-ids (the arrays were quite similar and had many overlapping probes). However, some genes were never measured by the same probe (e.g., IREB2). Therefore, we next matched any remaining genes based on shared gene-id. After creating a single-merged dataset with both array types together, we treated the array-types as "batches" and ran Combat function from R/Bioconductor package SVA to correct for array-specific effects. The probe-sets were mapped to official gene symbols by using BioMart -Ensembl tool (https://www.ensembl.org/). GSE76925 dataset. The second dataset analyzed for the present study is available through the GEO public repository at accession number GSE76925 published on Mar 29, 2017 7 . This dataset collects microarray gene expression profiling of lung or airway tissues from subjects with chronic obstructive pulmonary disease (COPD) by using HumanHT-12 BeadChips (Illumina, San Diego, CA). A total of 111 COPD cases and 40 control smokers with normal lung function were collected; all subjects were ex-smokers. The probe-sets were mapped to official gene symbols by using the platform GPL10558 (Illumina HumanHT-12 V4.0 expression beadchip) available from GEO repository. Multiple probe measurements of a given gene were collapsed into a single gene measurement by considering the mean.
GSE76293 dataset. The acute respiratory distress syndrome (ARDS) dataset is available through the GEO public repository at accession number GSE76293 published on Apr 11, 2016 33 . This dataset collects microarray gene expression profiling of 12 lung samples from ARDS patients and 12 samples from paired (i.e., age and gender-matched) healthy volunteers (HVTs). The probe-sets were mapped to official gene symbols by using the platform GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) available from GEO repository. Multiple probe measurements of a given gene were collapsed into a single gene measurement by considering the mean.
Human protein-protein interactome. The human protein-protein interactome was downloaded from the Supplementary Data of 31 . The authors of 31 merged 15 commonly used databases with several types of experimental evidences (e.g., binary PPIs from three-dimensional protein structures; literature-curated PPIs identified Differentially expressed genes. To compute the differentially expressed genes (DEGs), we used R statistical software (v 3.4.4) and the package limma (Fig. 10). For each dataset, we fitted a linear regression model (Table 3) to the expression values of each gene (EXP) in order to detect the association with the variable of interest representing the case/control condition (COPD). Microarray batch effects were addressed by using age, sex, and smoking status (i.e., current, ever, never) for GSE47460 dataset and age, sex, race and pack-years of smoking for GSE76925 dataset as clinical phenotypes. For both datasets, two surrogate variables (obtained via the R/ Bioconductor package SVA) were added as further covariates in the linear models ( Table 3). The linear models were fitted by using least squares regression. Then, an empirical Bayes shrinkage method was used by the package limma to obtain a moderated t-test statistic and its p-value. Adjustment for multiple testing were controlled for false discovery rate (FDR) method 19 .
SWiM software. In order to identify switch genes associated with the transition between control smokers and COPD cases, we run SWIM (Fig. 10), a software for gene co-expression network mining developed in MATLAB with a user-friendly Graphical User Interphase (GUI) and freely downloadable 11 .
SWIM builds a correlation network of differentially expressed genes. Generally, a network corresponds to an adjacency matrix = A a [ ] i j , that encodes the connection strength between each pair of nodes. In unweighted and undirected networks, a i j , is equal to 1 if nodes i and j are connected and 0 otherwise. In particular, in unweighted and undirected gene correlation networks, a i j , is equal to 1 if the expression profiles for nodes (i.e., genes) i and j are significantly associated across samples. In order to select significant associations, SWIM uses the absolute value of the Pearson correlation coefficient as similarity index. In other words, a i j , is equal to 1 if the absolute value of the Pearson correlation coefficient between the expression profiles of nodes i and j is greater than a selected significance threshold. For the COPD correlation network, we set the correlation threshold equal to 0.57, which corresponded to the 98 th percentile of the entire correlation distribution. This choice for the correlation threshold stems from two selection criteria (Supplementary Fig. 1). The first criterion is motivated by the observation that most biological networks display a scale-free distribution of node degree. Therefore, the network obtained based on the selected correlation threshold should approximate this topology. Scale-free networks are extremely heterogeneous, and their topology is dominated by a few highly connected nodes (hubs), which link the rest of the less connected nodes. Indeed, the defining property of scale-free networks is that the probability that a node is connected with k other nodes (i.e., the degree distribution P k ( ) of a network) decays as a power law ~α − P k k ( ) . Many biological networks have been shown to be scale-free networks [60][61][62] . For evaluating whether the COPD gene expression correlation network exhibits a scale-free topology, we calculated the square of the correlation between log(P(k)) and log(k), i.e. the index R-squared, as a function of the Pearson correlation ( Supplementary Fig. 1a). Since it is biologically implausible that a network contains more hub genes than non-hub genes, we multiply R-squared with −1 if the slope α of the regression line between P k log( ( )) and k log( ) is positive www.nature.com/scientificreports www.nature.com/scientificreports/ and thus we obtain a signed version of this index. If the R-squared approaches 1, then there is a straight-line relationship between P k log( ( )) and k log( ) and a scale-free topology is reached. These considerations motivated us to choose a correlation threshold that can lead to a net work satisfying scale-free topology at least approximately, e.g. signed R-squared >0.8 12 . The second criterion relies on choosing a threshold that should reflect an appropriate balance between the number of edges and the number of connected components of the network ( Supplementary  Fig. 1b).
Next, SWIM searches for specific topological properties of the correlation network using the date/party/ fight-club hub classification system, based on the Average Pearson Correlation Coefficients (APCCs) between the expression profiles of each hub (i.e., node with degree greater than 5 61 ) and its nearest neighbors. Given a node i and its n i first nearest neighbors, the APCC value is: where ρ x x ( , ) i j is the Pearson correlation between the expression profiles of node i and its j-th nearest neighbor. The authors in 11 , defined: date hubs as hubs with APCC < 0.5 (i.e., low co-expression with their partners); party hubs as hubs with APCC ≥ 0.5 (i.e., high co-expression with their partners); and fight-club hubs as hubs with negative APCC values (i.e., inversely correlated with their partners). In the COPD network, SWIM found 92 fight-club hubs, 489 date hubs, and 795 party hubs.
SWIM then identifies communities in the network by means of the k-means clustering algorithm, employing Sum of Squared Errors (SSE) values to determine the appropriate number of clusters, and assigns a role to each node by using the Guimera-Amaral approach 61 , based on the inter and intra-clusters interactions of each node quantified by the computation of two statistics: the within-module degree z g and the clusterphobic coefficient K π . The two parameters are defined as:σ where k i in is the number of edges of node i to other nodes in its module C i , k i is the total degree (i.e., number of edges emanating from a node) of node i, k C i and C i σ are the average and standard deviation of the total degree distribution of the nodes in the module C i . According to K π and z g values, the plane is divided into seven regions (R1-R7), each defining a specific node role. High z g values correspond to nodes that are hubs within their module (local hubs), while high values of π K identify nodes that interact mainly outside their community, i.e., having much more external than internal links. SWIM colored each node in the plane identified by z g and π K according to its APCC value, thus defining a heat cartography map.
Finally, SWIM extracts a select set of genes, named switch genes, as a special subclass of fight-club hubs falling in the R4 region and thus satisfying the following topological and expression features: (i) not being a hub in their own cluster (z g < 2.5); (ii) having many links outside their own cluster (K π > 0.8); (iii) having a negative average weight of their incident links (APCC <0).
Immune response signatures. Immune cell-related genes were obtained from 25 , where the authors iden- Functional enrichment analysis. The associations between selected genes and functional annotations such as KEGG pathways 63 and GO terms 64 were obtained by using Enrichr 65 web tool. P-values were adjusted with the Benjamini-Hochberg method and a threshold equal to 0.05 was set to identify functional annotations significantly enriched amongst the selected gene lists.

Data availability
Data supporting the findings of this study are available within the article and its supplementary information files.