Introduction

Chronic obstructive pulmonary disease (COPD) is the third leading cause of death in the United States1. In 2015, the economic burden of COPD in the United States through 2020 was estimated to be ~$90 billion2, while in the European Union a recent 2017 release estimated the economic burden to be ~ €48.4 billion3. COPD occurs predominately in smokers; however, only 10–50% of smokers develop COPD4. Current therapy is focused on treating symptoms such as chronic cough and excessive sputum production, and preventing exacerbations. There are no therapies that reduce the progression of disease or mortality. Diagnosis and treatment are complicated because COPD may manifest as multiple phenotypes including airway obstruction, emphysema (destruction of alveoli), and frequent exacerbations. Identifying biomarkers and pathways that distinguish COPD phenotypes can potentially lead to more specific treatments.

In addition to clinical challenges, there remain significant gaps in understanding the pathways and mechanisms involved in the pathogenesis of COPD phenotypes. Moreover, there are inherent challenges in clinical omics-based research that make studying COPD difficult, including (1) complexity of patient stratification across multiple phenotypes5, (2) obtaining suitable numbers of study subjects across phenotypes, (3) lack of validated informatics tools and strategies for large clinical datasets6,7, and (4) limitations within the omics technological platforms. The latter includes challenges in integrating clinical metadata with multiple datasets from functional genomics, proteomics, and/or metabolomics sources; i.e. a true “systems approach”.

Some progress has been made in the fields of functional genomics and metabolomics; however, no discovery-based approach has yet resulted in validated clinical biomarkers. For example, genetic studies previously identified alpha-1 antitrypsin in 1964 as a cause of COPD8; however alpha-1 antitrypsin deficiency accounts for only 1% of COPD cases9. Most recently, proteomic approaches have identified new biomarkers such as plasma sRAGE for the presence and progression of emphysema10,11. Metabolomics approaches have identified potential markers of disease severity or therapeutic candidates such as purines12, sphingolipids13, glycerophospholipids14, and amino acids which differentiate patients with or without emphysema and/or cachexia15. While these omic-centric studies have added to the knowledge base, only limited information is obtained. A systems- or pathways-based approach could further pinpoint potential mechanisms by integrating datasets and thereby increasing statistical power. This approach has been used in plant and bacterial studies and has recently been applied to human disease. For example, integrating functional genomics and metabolomics identified markers associated with poor outcome in human neuroendocrine cancers16, determined predictors of asthma control17, and determined novel therapeutic targets in lung cancer18.

While single ‘omic studies have helped to address the knowledge gap, to date no study has integrated transcriptomics and metabolomics from a COPD cohort to determine if specific pathways are associated with individual COPD outcomes. This could be a useful strategy since functional genomics measures the association between genes, transcripts and clinical phenotype19 and metabolomics provides a readout of the physiological state of the human body20. Therefore, we used metabolomics and transcriptomics to identify markers associated with six COPD health outcomes, then mapped the significant results to pathways for each outcome. The overall goal is to identify dysregulated pathways that distinguish outcomes.

The novelty and strength of our study lies in the uniqueness of the highly phenotyped and matched samples across several stages of COPD severity. Using transcriptomics and metabolomics datasets provides information at a molecular level that is not available when only a single profiling technology is used. Determining how the molecular basis of disease phenotypes vary will improve treatment options and clinical outcomes since unique therapies can be adapted to each patient based on their individual phenotype.

Results

Cohort description

Clinical characteristics are provided in Table 1. Exacerbation frequency is the total number of exacerbations (0–6) in the year prior to enrollment. Exacerbation severity is the total number of severe exacerbations (0–3) in the year prior to enrollment. Age was uniform across Global Initiative for Chronic Obstructive Lung Disease (GOLD) Stages. Gender was balanced across males (n = 79) and females (n = 70). Clinical phenotypes were associated with GOLD stage, smoking pack-years, asthma, and current smoking status.

Table 1 Characteristics of the COPD cohort.

Approximately 10–15% of the current and former smoker population falls into the Preserved Ratio, Intact Spirometry (PRISm)21 category. Studies in COPDGene have found that these subjects are not as healthy as control subjects21 and it is not clear where these patients fit in the spectrum of COPD. We therefore included PRISm subjects in the transcriptomics analyses since patients within that category presented with outcomes such as exacerbations or emphysema. None of the PRISm subjects overlapped with the metabolomics dataset and were therefore not included in the metabolomics analysis.

Individual outcome associations

We evaluated 12,531 transcripts and 2,999 metabolites for associations with COPD outcomes (Fig. 1). The highest number of significant transcript differences was found in FEV1% predicted (1,652) while no transcript differences were associated with exacerbation severity. FEV1/FVC had the most significant metabolites (269) while emphysema and BDR had none. There was minimal overlap in the significant transcripts (Fig. 2A) and metabolites (Fig. 2B) associated with each outcome.

Figure 1
figure 1

Flow chart summarizing the omics methods and analysis. Peripheral blood mononuclear cells (PBMC) (n = 136 human subjects) and matched plasma (n = 131) were prepared and analyzed using functional genomics and metabolomics, respectively. Metabolomics samples were analyzed randomly in triplicate. Regression model fitting in R resulted in a list of statistically significant transcript probes and metabolites associated with each outcome. These significant transcript probes and metabolites were mapped to pathways using IMPaLA. Pathway p-values for metabolites and transcripts were combined by a meta-analysis using Fisher’s method to obtain a single p-value and FDR for each pathway. In the outcome tables, ‘# sig’ refers to the number of statistically significant transcripts probes and metabolites. BMI: body mass index, FEV1: forced expiratory volume in 1 second, FVC: forced vital capacity.

Figure 2
figure 2

Venn diagrams. (A) Edwards’ Venn diagram showing the overlap of significant transcript probes (p ≤ 0.015, FDR ≤ 0.1) across COPD outcomes. (B) Edwards’ Venn diagram showing the overlap of statistically significant metabolites (p ≤ 0.05, FDR ≤ 0.15) across COPD outcomes. (C) Classic Venn diagram showing the overlap of statistically significant pathways (p ≤ 0.05, FDR ≤ 0.2, ≥3 hits). The overlapping pathways are shown in red arrows. The unique pathways are listed in Table 4. %Emphy: % emphysema, Exac Freq: exacerbation frequency, Exac Sever: exacerbation severity, BDR: bronchodilator response, FEV1: Forced expiratory volume in 1 second, FVC: Forced vital capacity, FEV1%Pred: FEV1% predicted.

The top gene transcripts for each outcome based on smallest p-value and FDR are listed in Table 2. ACTG1, a cytoplasmic protein found in all cell types22, was associated with seven transcripts, all with increased expression with increasing exacerbation frequency. For emphysema, the top gene transcripts were all associated with a decrease with worsening emphysema. These genes are associated with apoptosis (BCL10, SUMO2), cancer (CCDC186, RECQL), regulation of dietary iron absorption (CYBRD1), lipid binding (OSBPL8), protein transport (TMCO3), protein binding (TOPORS), and the mitochondria (SLC25A43, TFAM)22. The gene interactions among these significant associations were investigated, with FEV1/FVC having the most interactions (Supplementary Document S1).

Table 2 Significant associations from transcriptomics analysis.

For the metabolomics analysis, only significant compounds that were confidently identified are shown in Table 3. A large number of amino acids were associated with exacerbation frequency while both amino acids and carbohydrates associate with exacerbation severity. Predominantly glycerophospholipids are associated with FEV1/FVC, and several lipid classes associate with FEV1% predicted. Carnitine(C14:2) was increased with increasing exacerbation frequency, acetylcarnitine* was decreased with increasing exacerbation severity, and octanoylcarnitine** was decreased with COPD for FEV1/FVC. The metabolite interactions for the significant associations are shown in Supplementary Document S1. Interactions were present for exacerbations and absent for lung function.

Table 3 Significant compounds in the metabolomics analysis.

Specific metabolites were associated with disease progression based on CT scan and FEV1

Clinical data from quantitative computerized tomography (CT) scans and lung function measurements from each patient was used to quantify disease progression. This clinical data was incorporated with metabolomics to identify small molecule changes associated with disease severity. There were 90 significant compounds in the emphysema progression analysis compared to 142 significant compounds in FEV1 progression analysis. Three compounds overlapped between the two comparisons. These were PC(36:4)* (% emphysema p = 0.0016, FEV1% predicted = 0.034), PE-Cer(d16:1/18:0) (% emphysema p = 0.0272, FEV1% predicted = 0.019), and an unannotated compound at mass 771.5351 Da (% emphysema, p = 0.0215, FEV1% predicted = 0.0489). PC(36:4)* and PE-Cer(d16:1/18:0) are negatively correlated with emphysema progression and positively correlated with FEV1 progression, while the unannotated compound was negatively correlated with both emphysema and FEV1 progression.

Pathway associations

Fifteen pathways were significantly associated with FEV1/FVC, 22 pathways with FEV1% predicted, 7 with exacerbation frequency, and 15 with exacerbation severity (Table 4). The transcript expressions for the significant pathways for BDR were all increased (p ≤ 0.0013, FDR ≤ 0.0954) post-BDR treatment; for phagosome these were ATP6V0B, ATP6V1F, FCGR2A, CYBB, HLA-DMA, RAC1, SCARB1, TUBA1B, TUBA1C, and for lysosome these were ATP6V0B, NPC2, NAGA, GAA, GUSB, HEXA, PSAP.

Table 4 List of significant pathways.

Three pathways were unique to FEV1/FVC (Fig. 2C), with fat digestion and absorption being the most significant. Nine pathways were unique to FEV1% predicted, with T cell receptor signaling as the most significant. There was one unique pathway associated with exacerbation frequency (antigen processing and presentation). There were eight unique pathways related to exacerbation severity, with arginine and proline metabolism as the most significant. Sphingolipid metabolism was significant in exacerbation severity with ganglioside GD3 (d18:0/23:0), lactosylceramide (d18:1/14:0), and the glycosphingolipid Fucα1–4GlcNAcβ1-3Galβ1-3GlcNAcβ1-3Galβ1-4Glcβ-Cer(d18:1/22:0) positively associated (FDR < 0.0001) with disease severity. Sphingolipid metabolism was also significant in the lung function outcomes (FDR < 0.0001), with two glycosphingolipids positively associated with FEV1% predicted (lung size), and three glycosphingolipids and four ceramides positively associated with FEV1/FVC (airflow obstruction). Emphysema was related to two unique pathways (oxidative phosphorylation and mRNA surveillance pathway). There were no unique pathways associated with BDR. Individual pathway entities are available in Supplementary Document S2.

Transcriptome and metabolome pathways

Oxidative phosphorylation was uniquely associated with emphysema and therefore examined more closely. All gene transcripts were decreased (Fig. 3A) and are located in complex I, II, and IV. Likewise, antigen processing and presentation (Fig. 3B) was uniquely associated with exacerbation frequency. This pathway comprises two sub-pathways (MHC I and MHC II), however only MHC I was significantly perturbed with an overall decrease. Eight unique metabolome pathways were associated with exacerbation frequency, two of which are amino acid-related metabolic pathways; arginine and proline metabolism was decreased (Fig. 4A) and glycine, serine and threonine metabolism was increased (Fig. 4B).

Figure 3
figure 3

Transcriptomics pathways. (A) Oxidative phosphorylation in emphysema. (B) Antigen processing and presentation for exacerbation frequency. Blue stars indicate: associated with a decrease with worsening outcome. Purple stars represent both an increase or decrease in expression if multiple transcripts are mapped to a single gene. Pathway image is modified from KEGG85,86,87.

Figure 4
figure 4

Metabolomics pathways. (A) Arginine and proline metabolism in exacerbation severity. (B) Glycine, serine and threonine metabolism in exacerbation severity. Red boxes indicate associated with an increase with worsening outcome and blue boxes indicate associated with a decrease with worsening outcome.

Energy and cellular degradation pathways

Pathways associated with exacerbation severity and exacerbation frequency differed to some extent; therefore, HumanCyc23 was used to determine what was responsible for the differences. Exacerbation severity comprised eight degradation pathways and three energy pathways, while exacerbation frequency comprised three degradation pathways and no perturbed energy pathways (Supplementary Document S3).

Integrated transcriptome/metabolome analysis

Three pathways were examined to determine association between gene transcripts and metabolites. Glycerophospholipid metabolism was based on FEV1% predicted (Fig. 5A). As disease severity increases from GOLD 0-GOLD 4, eight gene transcripts (CDS2, CHKB, CRLS1, DGKA, DGKE, GNPAT, LPIN1, PLA2G6) were decreased (p ≤ 0.015, FDR ≤ 0.1) and five gene transcripts (CEPT1, ETNK1, GPCPD1, LPCAT2, PISD) were increased (p ≤ 0.015, FDR ≤ 0.1). Lipids that decreased with disease severity were LysoPE(16:0)*, DG(36:2)*, PC(36:2)*, PE(P-16:0), PI(41:1) (FDR ≤ 0.05), and LysoPC(20:0)* (p = 0.0018, FDR = 0.15). Conversely, PG(29:2), PS(39:7), and TG(42:0) increased (FDR ≤ 0.05) with disease severity.

Figure 5
figure 5

Integrated transcriptomics and metabolomics pathway diagrams. (A) Glycerophospholipid metabolism in FEV1% predicted. (B) Fat digestion and absorption in FEV1/FVC modified from KEGG85,86,87. FA: fatty acid, BA: bile acid, PA: phosphatidic acid, PL: phospholipids, MAG: monoglycerides, DAG: diglycerides, TAG: triglycerides, CE: cholesterol ester, CL: cholesterol. Red font and red boxes indicate associated with an increase with worsening disease. Blue font and blue boxes indicate associated with a decrease with worsening disease. Black font or uncolored boxes indicate no statistical significance.

Fat digestion and absorption (Fig. 5B) was uniquely significant in FEV1/FVC with an associated increase with disease of triglycerides, cholestrol esters, phospholipids, and bile acids, and a decrease in monoglycerides, diglycerides, and docosahexaenoic acid (DHA) with disease. The SCARB1 gene located in the brush border-membrane, and ABCA1 located in the basolateral membrane, were associated with an increase in expression in COPD subjects compared to healthy controls. The protein encoded by SCARB1 is a plasma membrane receptor for high density lipoprotein cholesterol (HDL)24. Lastly, glycerolipid metabolism, also uniquely significant in FEV1/FVC revealed an increased association with phosphatidic acid and the AGPAT9 gene that encodes it (Supplementary Document S4).

Pathway relationships

To determine whether one or multiple pathways were driving the outcome perturbations, pathways were plotted based on KEGG25. An integrated pathway map (Fig. 6) demonstrates that MAPK and PI3K-AKT signaling pathways, although not statistically significant, were each connected to eight pathways. In comparison, lysosome and T cell receptor signaling pathway each had six connections while purine metabolism and glycine, serine and threonine metabolism each had five connections.

Figure 6
figure 6

Overview of outcome perturbations and pathway relationships. The connections between the pathways were mapped using KEGG. White boxes indicate no statistical significance, while colored boxes indicate significance based on outcome as described in the Legend.

Discussion

Overlaying the significant transcripts and compounds using Venn diagrams allowed us to view the differences amongst the various outcomes in a discernable manner. For example, while a larger overlap between FEV1/FVC and FEV1% predicted might be expected, we observed minimal overlap. This is however not surprising because although these two outcomes are related, especially in a COPD cohort, their measures are different. FEV1/FVC ratio is a measure of obstruction and is required in the COPD diagnosis, while FEV1% is a measure of lung size. Therefore, a patient can have small lungs (FEV1%) and still have no obstruction and occasionally is the reverse (for mild COPD). Also, not surprising is the minimal overlap across all the tested outcomes since COPD is a heterogeneous disease. This approach was important in allowing us to distinguish phenotypes and outcomes.

Overall, a metabolomic signature differentiated the COPD outcomes, which was supported in the transcriptomics analysis for corresponding pathways. Both omics datasets demonstrated that specific COPD phenotypes and outcomes are associated with different transcriptome and metabolome pathways. This supports the premise that distinct COPD phenotypes have distinct mechanisms. There were also differences in the types of omics associations. For example, emphysema appears to have a larger number of gene associations compared to exacerbation severity, which has more metabolic associations. Therefore, we chose to focus the discussion on the most significant and/or unique pathways for each outcome, as a means of narrowing down the large amount of data generated from this study.

We first interpreted exacerbations as a group and observed that for exacerbation frequency, there are no significant differences in the energy pathways and very few associations in the degradation pathways. Conversely, for exacerbation severity, the energy pathways are more dysregulated and accompanied by an increase in the number of degradation pathways, particularly carbohydrate degradation, nucleoside degradation, fatty acid degradation, and amino acid degradation. This is supported by the energy association in degradation pathways; carbohydrate breakdown provides a source of cellular energy for other syntheses such as protein transport, nucleosides act as signaling molecules or provide chemical energy to cells, and fatty acids store energy in the cell26. We then observed that sphingolipid metabolism was significant in exacerbation severity. In our previous studies, we observed that trihexosylceramides, a subclass of sphingolipids, were positively associated with severe exacerbations13.

Interestingly we also observed for exacerbations that while antigen processing and presentation was unique to exacerbation frequency, only the MHC I pathway (leading to CD8 T cell activation) and not the MHC II pathway (leading to CD4 T cell activation) is affected. The activation of CD4 T cells is beneficial to sustaining memory of CD8 T cells following acute infection27. Since COPD exacerbations are associated with viral28,29 and bacterial30,31 infections, the perturbation of CD8 but not CD4 in our study suggests that a crucial connection is not being made, and this may predispose patients to frequent exacerbations as their immune system is weakened. An alternative explanation is that CD4 T cells were activated and CD8 T cells are now destroying the viral or bacterial pathogen in the more frequent exacerbations. Interestingly, an outcome-pathway shift is observed where the perturbed genes within the MHC I pathway for exacerbation frequency end at the T cell receptor signaling pathway which is uniquely perturbed in FEV1% predicted.

We then examined the emphysema phenotype. Oxidative phosphorylation was unique to and significant in the emphysema outcome. Cigarette smoking (CS) is a major cause of emphysematous COPD. CS is known to produce reactive oxygen species (ROS). Excessive production of ROS can lead to oxidative damage, leading to cell apoptosis32. In a recent study, exposure of human lung fibroblasts to various doses of nicotine and e-cigarette condensate inhibited myofibroblast differentiation and inhibited oxidative phosphorylation complex III33. Therefore, in spite of adjusting for smoking status, our results are not unexpected as our cohort is comprised of current and former smokers.

Next, we focused on the BDR outcome. Unlike the other outcomes, BDR was associated with two cytoplasmic pathways, lysosome and phagosome. Lysosomes break down proteins, nucleic acids, carbohydrates and lipids26. Phagosomes fuse with lysosomes allowing the ingested material to be digested; some of the material is recycled to the plasma membrane26. The perturbed transcripts in these pathways showed increased expression and may serve as markers of airflow reversibility.

We then examined FEV1% predicted outcome. Glycerophospholipid metabolism, although significant in many of the other outcomes, was highly significant in this lung function outcome. These lipids play a role in respiratory infections34 and in asthma and COPD35 since they constitute lung surfactant36,37,38. Surfactant, a complex combination of proteins and ~90% lipids39, is secreted by the airway epithelial cells into the airspace to reduce surface tension and provides a barrier to pathogens. In the FEV1% predicted outcome, the lipids classes LysoPE and PG, decreased and increased respectively, and have been previously suggested as markers of lung remodeling40. Their associated genes PLA2G6 and LPCAT2 were also decreased and increased respectively and could serve as targets in future studies to repair disrupted surfactant composition.

Sphingolipids were also highly associated with this outcome and FEV1/FVC. This is congruent with our recent publications that discuss in depth the significance of sphingolipids to COPD pathogenesis, including signaling related to structural cell fate, innate immune responses, and lymphocyte trafficking41,42. Other pathways uniquely and significantly associated with the FEV1/FVC was involved in fat digestion and absorption, typicaly reflecting perturbations between the intestine and blood. SCARB1 may be driving the cholesterol (CL) and cholesterol ester (CE) transport and regulation while the phospholipids (PL) may be driving ABCA1. Since bile acids and triglycerides were increased with disease, and the omega-3 fatty acid DHA was decreased with disease, this pathway suggests a diet or nutrition component associated with disease. This is supported by a recent publication on a cohort of 34,739 women where long-term consumption of fruits was inversely assocated with COPD incidence43. Diet has been suggested to play a role as a risk factor for many chronic diseases including COPD44,45, and suggestions have been made on the role of nutritional supplement therapy in the treatment of COPD46.

Finally, we attempted to deconstruct the results into a pathway map where relations could be observed among the outcomes and pathways. Based on the overview pathway map, MAPK and PI3K-AKT signaling pathways appeared to be the unifying links among many of the perturbed pathways. In a recent study exposing human bronchial epithelial cells to urban particulate matter, airway inflammation was triggered via activation of MAPK and NF-κB signaling pathways47. Although MAPK signaling was not significant in our study, NF-κB signaling was significant in FEV1% predicted. MAPK and PI3K-AKT signaling pathways may therefore be the driving force for downstream events. Researchers have already begun exploring these targets48,49,50,51.

A limitation of our study is the use of plasma and PBMCs as these may not directly translate to the lung environment. However, our previous studies, both published and unpublished, have observed a high degree of congruency for compounds in lung lavage fluid, lung tissue and plasma in both human and animal studies52,53. Also non-fasting samples may affect results as food intake affects the plasma metabolite profile54, We therefore used stringent statistics to filter out many of the diet-associated compounds. Lastly, patients in this study were taking medications, which is unavoidable with this type of cohort. However, we did our best to filter those medications from the final dataset prior to analysis. Future work would incorporate lavage fluid and increase sample numbers to gain more power for discriminating metabolite-transcript-phenotype differences. Targeted quantitative analysis of pathway-specific molecules could provide increased power and more quantitative results. Although these results need to be validated in a larger cohort, they point to therapeutic targets to be tested in animal experiments, and the most significant species from each outcome could serve as non-invasive markers of disease outcomes.

Despite these limitations, we were able to determine unique pathways that can be leveraged for treatment targets. For example, oxidative phosphorylation was decreased with worsening emphysema. It is perturbed due to cigarette smoke exposure33; cigarette smoking is a major cause of emphysema and produces reactive oxygen species, leading to oxidative damage. Interventions such as rotenone, antimycin A, or oligomycin can be used to inhibit oxidative phosphorylation33,55,56, while other small molecule interventions can be used to activate the pathway. Autophagy was highly significant for FEV1/FVC and can be inhibited using, 3-methyladenine57,58. Another pathway that can be leveraged is antigen processing and presentation. This pathway, decreased in exacerbation frequency and is dysregulated in a rat model of COPD59. Therefore, activating these pathways using either small molecules or gene interventions in animal models could be beneficial in developing phenotype-specific COPD treatments to reduce or halt lung damage or viral load. These pathway interventions in mouse models have the potential to lead to phenotype-specific treatment targets. Other follow up studies could include targeted analysis in COPD or smoking cohorts such as COSYCONET60, Birmingham61, SPIROMICS62, or KORA63,64.

Conclusion

Our pilot study has shown that using a pathway enrichment approach to analyze transcriptomics and metabolomics data from matched samples, identified perturbed biological pathways associated with COPD outcomes and identified pathways that distinguish between outcomes. In addition, our results show that the various phenotypes and outcomes have distinct pathways associated with each one, in spite of COPD being a heterogenous disease. Investigators may choose to select the strongest pathways (those with low p-values) for follow up analyses using either mouse intervention studies via knockouts, mouse treatment studies, or using targeted analysis in other COPD or smoking cohorts. The long term clinical applicability of this work would be in the use of blood-based markers where a simple blood panel could identify patient outcomes and indicate patient-targeted treatment. This work could help in the management of COPD.

Methods

Ethics statement

All methods were performed in accordance with the relevant guidelines and regulations. Human subjects were from the COPDGene cohort65 which is a National Institute of Health–sponsored multicenter study of the genetic epidemiology of COPD. COPDGene was approved by the institutional review board at each participating center; all subjects were enrolled from January 2008 to April 2011 and provided written informed consent. The current analysis was approved by the National Jewish Health Institutional Review Board.

Study population

The COPDGene study enrolled 10,192 non-Hispanic White and Black individuals, aged 45–80 years old with at least a 10 pack-year history of smoking, who had not had an exacerbation of COPD for at least the previous 30 days. Additional information on the COPDGene study and the collection of clinical data has been previously described65. There were ten Preserved Ratio, Intact Spirometry (PRISm)21 subjects included in the transcriptomics analysis since these subjects presented with exacerbations or emphysema but are considered an unclassified. Patients who were pregnant or with co-morbidities such as a history of other lung disease (except asthma), active cancer under treatment, and other cardiac hospitalizations, were excluded from the study65. Non-fasting blood was collected from a random sampling of subjects (n = 149) from a single clinical center (National Jewish Health) that represented the entire cohort and used for omics analysis. Peripheral blood mononuclear cells (PBMC) from 136 subjects were used for functional genomics and fresh frozen plasma from 131 subjects were used for metabolomics, with an overlap of 118 subjects for both omics technologies. The cohort characteristics are shown in Table 1.

Clinical data collection and definitions of COPD outcomes

The clinical data included forced expiratory volume in 1 second/forced vital capacity (FEV1/FVC), FEV1 percent (%) predicted, percent (%) emphysema on CT scan10, exacerbation frequency, exacerbation severity, and bronchodilator response (BDR). COPD was defined as post bronchodilator ratio of FEV1/FVC < 0.70. Current or former smokers without spirometry evidence of airflow obstruction (FEV1/FVC ≥ 0.70) were classified as controls66. Subjects with without airflow obstruction but FEV1 < 80% of predicted where classified as (PRISm)21. Emphysema was measured using quantitative high resolution CT as previously described67 and quantified as the percent of lung attenuation lung voxels (LAA) ≤−950 HU on the inspiratory images for the whole lung. Emphysema was further categorized as none (LAA ≤ 5%), mild (5% < LAA ≤ 10%), moderate (10% < LAA ≤ 20%), or severe (20% < LAA). Exacerbations (flare ups) of COPD are characterized by acutely worse cough, sputum, and dyspnea. Only moderate exacerbations (treated by corticosteroids or antibiotics) or severe exacerbations (causing an admission to the hospital) were evaluated. Patients had no exacerbations for at least 30 days prior to sample collection. BDR determined the reversibility of airway obstruction by measuring FEV1 before and after the administration of bronchodilator medication. A significant response was defined as an increase in FEV1 of 12% and 200 mL increase68.

Sample collection

Blood was drawn into a BD Vacutainer Cell Preparation Tube for peripheral blood mononuclear cells (PBMC) and plasma as previously described65. PBMCs were used for gene expression profiling using the Affymetrix Human Genome U133 plus 2.0 Gene Array, (GEO accession number GSE42057)69. Plasma was used for untargeted liquid-chromatography mass spectrometry (LC-MS)-based metabolomics.

Chemicals, standards and reagents

All solvents were LC-MS grade. Water and isopropyl alcohol were purchased from Honeywell Burdick & Jackson (Muskegon, Michigan); chloroform, acetonitrile, methanol, acetic acid, low retention microcentrifuge tubes, serological pipettes were purchased from Fisher Scientific (Fair Lawn, New Jersey); plastic pipette tips were purchased from USA Scientific (Orlando, Florida); methyl tert-butyl ether was purchased from J.T. Baker (Central City, Pennsylvania); internal standards were purchased from Avanti Polar lipids Inc. and Sigma Aldrich (St. Louis, MO); pyrex glass culture tubes were purchased from Corning Incorporated (Corning, New York).

Sample preparation

Blood samples for functional genomics analysis were prepared as previously described69. Briefly, blood was collected from subjects, processed within 30 minutes and peripheral blood mononuclear cells (PBMC) were isolated from the supernatant for RNA isolation as previously described69. Fresh frozen plasma was isolated using P100 tubes as previously described70 and stored at −80 °C until sample preparation for untargeted liquid-chromatography mass spectrometry (LC-MS)-based metabolomics. For LC-MS, 100 µL of each sample underwent protein precipitation using methanol, followed by liquid-liquid extraction using methyl-tert butyl ether as previously described71,72 to obtain an aqueous fraction and a lipid fraction. All samples were prepared randomly to avoid batch effects from phenotype, gender, or GOLD stage.

Liquid chromatography mass spectrometry

The samples from the lipid fraction were analyzed randomly and each sample vial was run randomly in triplicate using an Agilent 1290 series pump with an Agilent Zorbax Rapid Resolution HD (RRHD) SB-C18, 1.8 micron (2.1 × 100 mm) analytical column and an Agilent Zorbax SB-C18, 1.8 micron (2.1 × 5 mm) guard column. The autosampler tray temperature was set at 4 °C, column temperature was set at 60 °C, and the sample injection volume was 4 µL. The flow rate was 0.7 mL/min with the following mobile phases: mobile phase A was water with 0.1% formic acid, and mobile phase B was 60:36:4 isopropyl alcohol:acetonitrile:water with 0.1% formic acid. Gradient elution was as follows: 0–0.5 minutes 30–70% B, 0.5–7.42 minutes 70–100% B, 7.42–9.9 minutes 100% B, 9.9–10.0 minutes 100–30% B, followed by column re-equilibration. The lipid fraction mass spectrometry conditions were as follows: Agilent 6210 Time-of-Flight mass spectrometer (TOF-MS) in positive ionization mode with dual electrospray (ESI) source, mass range 60–1600 m/z, scan rate 2.03, gas temperature 300 °C, gas flow 12.0 L/min, nebulizer 30 psi, skimmer 60 V, capillary voltage 4000 V, fragmentor 120 V, reference masses 121.050873 and 922.009798 (Agilent reference mix).

The samples from the aqueous fraction were analyzed randomly with each sample vial run randomly in triplicate using an Agilent 1200 series pump using a Phenomenex Kinetex HILIC, 2.6 µm, 100 Å (2.1 × 50 mm) analytical column and an Agilent Zorbax Eclipse Plus-C8 5 µm (2.1 × 12.5 mm) narrow bore guard column. The autosampler tray temperature was set at 4 °C, column temperature was set at 20 °C, and the sample injection volume was 1 µL. The flow rate of 0.6 mL/min with the following mobile phases: mobile phase A was 50% ACN with pH 5.8 ammonium acetate, and mobile phase B was 90% ACN with pH 5.8 ammonium acetate. Gradient elution was as follows: 0.2 minutes 100% B, 0.2–2.1 minutes 100–90% B, 2.1–8.6 minutes 90–50% B, 8.6–8.7 minutes 50–0% B, 8.7–14.7 minutes 0% B, 14.7–14.8 minutes 0–100% B, followed by column re-equilibration. The aqueous fraction mass spectrometry conditions were as follows: Agilent 6520 Quadrupole Time-of-Flight mass spectrometer (Q-TOF-MS) in positive ionization mode with ESI source, mass range 50–1700 m/z, scan rate 2.21, gas temperature 300 °C, gas flow 10.0 L/min, nebulizer 30 psi, skimmer 60 V, capillary voltage 4000 V, fragmentor 120 V, reference masses 121.050873 and 922.009798 (Agilent reference mix).

Tandem mass spectrometry (MSMS)

The HILIC and C18 chromatographic methods discussed above were replicated for LC-MS/MS analysis using 10, 20, and 40 eV collision energies on a 6520 Q-TOF-MS (Agilent) with a 500 ms/spectra acquisition time, 4 m/z isolation width, and 1 minute delta retention time.

Metabolomics quality control

Samples were randomly selected from 2000 subjects from the COPDGene cohort comprising both healthy subjects and COPD patients to create aliquots of pooled QC plasma samples. At least 20 of the 131 subjects in this study were included in the pooled sample. Samples were prepared as described above and pooled for use as quality control (QC) samples to monitor instrument reproducibility across multiple days. These pooled QC samples were injected after every five samples, followed by a solvent blank. Total ion chromatograms of all samples were evaluated for retention time reproducibility and intensity overlap. Instrument QC samples were analyzed to ensure that peak areas of spiked internal standards in the plasma samples were reproducible with coefficient of variations ≤10%.

Metabolomics spectral peaks extraction

A final list of compounds was obtained by collapsing features (e.g. ions, adducts) into compounds as follows: Spectral peaks were extracted using the following parameters in MassHunter software B.07 (Agilent Technologies): Find by Molecular Feature algorithm, single charge, proton, sodium, potassium, ammonium adducts in positive ionization mode. Data were imported into Mass Profiler Professional software 14.5 (MPP, Agilent Technologies) for mass (15 ppm) and retention time alignment (0.2 minutes). Data from sample preparation blanks and instrument blanks were background subtracted to eliminate noise from contaminants. Because LC-MS data can result in missing values, data was further processed using the ‘Find by Formula’ algorithm parameters (+H, +Na, +K, +NH4 adducts for positive ionization mode, charge states limited to 2, and absolute height >3000 counts). The ‘Find by Formula’ algorithm merged multiple features such as ions, adducts and dimers into a single compound. The final data set was then re-imported into MPP for metabolite annotation.

Metabolite annotation and identification

For the first round of identification, ID Browser within the Mass Profiler Professional software v14.5 was used to annotate metabolites with putative identifications. This software utilizes an in-house database comprising data from METabolite LINk (METLIN), Human Metabolome Database (HMDB), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Lipid Maps and applies isotope ratios, accurate mass, chemical formulas, and scores to provide preliminary identifications. A database score ≥70 out of a possible 100 was considered acceptable for annotation confidence. Chosen elements for molecular formula generation were: C, H, N, O, S, and P. An error window of ≤10 ppm was used with a neutral mass range up to 2500 Da and positive ions selected as H, Na, K, and NH4. The database identifications were limited to the top 10 best matches based on score, and charge state was limited to a maximum of 2. There were 1,633 out of 2,999 compounds that were putatively matched to a name in a database.

For improved confidence, only the statistically significant compounds were fragmented as described in the Tandem MS section above, and searched using either an in-house mass, retention time, and MSMS library built from purchased standards, or their MSMS fragments were matched to reference standards from the NIST14 and NIST17 MSMS spectral library73. For compounds where their fragments were not present in a reference spectral library, MetFrag74,75 was used for in silico fragmentation. A list of the confirmed and putative IDs is available in Supplementary Documents S5 and S6.

Reporting level of confidence

Within the manuscript and in Table 3, the symbol * indicates that the compound name was confirmed by accurate mass (<10 ppm) and matching MSMS fragments to the NIST spectral library. The symbol **indicates that the compound name was confirmed by accurate mass (<10 ppm), retention time, and MSMS match to purchased standards. These are MSI 1. All other compound names are MSI 2 and are based on accurate mass (<10 ppm) and isotopic abundance and distribution.

Data processing

Metabolomics

Processing of metabolomics data was performed using the R package MSPrep76. Only compounds identified in at least 2 of 3 sample vial injection triplicates were summarized to obtain one measurement per compound per subject, all others were filtered out. A maximum coefficient of variation of 0.5 was set as a threshold for utilizing the mean of the triplicates or the median of the triplicates. This threshold was selected to account for when the extraction software may have missed a peak, thus resulting in one replicate having low to no signal and therefore causing a larger CV across replicates. The data was then filtered for compounds that were found in at least 80 percent of subjects. Missing data was then imputed using Bayesian Principal Component Analysis (BPCA), which uses linear combinations of principal axis vectors to estimate the missing value and is not sensitive to the level of missing data77. The data was then normalized using the Cross-Contribution Compensating Multiple Standard Normalization (CRMN) method to remove unwanted batch variation78. Following filtering, imputation, and normalization the numbers of metabolites decreased from 1,813 to 662 and 6,183 to 2,337 compounds in the aqueous and lipid fractions respectively, for a total of 2,999 compounds. Many of the compounds that were removed from further analysis were attributed to instrument noise, software extraction artefacts, and single hits from individual subjects.

Functional Genomics

Transcript expression was measured using Affymetrix Human Genome U133 plus 2.0 Gene Array (GEO accession number GSE42057; Affymetrix, Santa Clara, CA). Quality control, batch correction, normalization, filtering, and log transformation was performed as previously described69. Following filtering, transcripts were reduced from 54,675 to 12,531. Subsequent statistical analysis on the reduced transcript list is described below.

Statistical analysis

To assess the significance of transcript and metabolite levels in COPD phenotypes, a series of regression models were fit for each outcome and the form of the model depended on the distribution of the outcome. For FEV1% predicted, a linear regression was used. For FEV1/FVC and % emphysema beta regressions were used. Number of exacerbations was modeled as a negative binomial; since many subjects had no exacerbations, a zero-inflation portion was added. Severity of exacerbations used logistic regression. Since BDR was constrained between 0 (no response) and 1 (positive response), logistic regression was used for BDR. For all models, covariates were included. Covariates for FEV1% predicted were age, BMI, gender, smoking status, parents have COPD, and ATS Pack-years. Covariates for FEV1/FVC were age, gender, asthma, and smoking status. Covariates for % emphysema were age, gender, BMI, FEV1% predicted, and smoking status. Covariates for exacerbation frequency and exacerbation severity were FEV1% predicted, gastro esophageal reflux, St. George’s Respiratory Questionnaire (SGRQ) score, and gender. Covariates for BDR were age, gender, smoking status and asthma. Significant differences within the cohort, such as age and smoking, were controlled for in subsequent analyses.

Then, to assess the significance of various metabolites and transcripts, the log-transformed metabolite or transcript levels were added to the models as the independent variable, and the significance of the effect of these metabolite or transcript levels were reported. For the transcripts, p ≤ 0.015 was considered significant with FDR ≤ 0.1 adjustment. For the metabolites, p ≤ 0.05 was considered significant with FDR ≤ 0.15 adjustment. All analyses were performed in R79.

Pathway enrichment analysis

The significant transcript probes were converted to official gene symbols using DAVID Bioinformatics Database version 6.880 ID conversion tool. The significant compounds were converted to KEGG IDs using Chemical Translation Service81. The lists of statistically significant transcripts and metabolites within each COPD outcome were searched using IMPaLA82 to obtain pathways relevant to each outcome. In IMPaLA, the gene transcripts and associated values were uploaded using ‘gene symbol’, the metabolites and associated values were uploaded in the ‘KEGG’ identifier, and pathway over-representation analysis was performed. A meta-analysis using Fisher’s method provided a combined p-value and a corrected p-value for the pathways. The results were filtered as follows: KEGG pathways, p ≤ 0.05, FDR ≤ 0.2 and ≥3 pathway hits. A more lenient FDR ≤ 0.2 cut-off was used since the individual samples within each group for each outcome or phenotype were small. Figure 1 highlights the experimental procedure used to generate results from the two datasets.

Visualization

Venn diagrams were created to show the overlap of statistically significant transcript probes, compounds, and pathways for the COPD outcomes using the publicly available jvenn program83.

Cellular function relationships for each outcome were visualized using the ‘Omics Dashboard’ of HumanCyc23 (https://humancyc.org/dashboard/dashboard-intro.shtml) using ‘Analysis’. The gene symbols for the significant transcript probes and their coefficients were uploaded in tab-delimited format for expression analysis. The KEGG IDs for the significant compounds and their coefficients were uploaded in tab-delimited format for metabolomics analysis. The interactions between the top gene transcripts and top metabolites were explored using ConsensusPathDB (http://consensuspathdb.org/)84 and selecting ‘over-representation analysis’. The selected identifiers were ‘gene symbol’ and ‘ChEBI’ to determine gene interactions and pathways interactions for the gene transcripts and metabolites, respectively. For the transcriptomics data, protein, genetic, biochemical, and gene regulatory interactions were considered; for protein interactions, only binary protein interactions with high and medium confidence were allowed, while complex interactions were ignored. Intermediate nodes were allowed for all interactions. For the metabolomics data, only KEGG pathways with a minimum overlap of 2 and a p-value cut-off of 0.05 was allowed. Finally, outcome and pathway relationships were visualized using KEGG85 to plot pathways for each outcome.