The leading cause of morbidity and mortality for patients with cystic fibrosis (CF) is respiratory failure associated with chronic bacterial infections and inflammation of the airways1. Prevalence of microorganisms varies with age, with Staphylococcus aureus and Pseudomonas aeruginosa being the most common bacterial species found in the airways of children and adults, respectively2. Accurate microbiological profiling of the lower airways and, consequently, prompt antibiotic treatment, are crucial for delaying chronic infections3. This is particularly true for P. aeruginosa, where eradication therapy is initiated whenever it is detected4. Early identification of S. aureus, both methicillin-sensitive (MSSA) and methicillin-resistant (MRSA), is also important, as MRSA is associated with increased risk of death, irrespective of CF disease severity5.

In expectorating patients, sputum is cultured for detection of bacterial pathogens. Most children as well as individuals with mild CF disease do not spontaneously expectorate sputum6,7. In non-expectorating patients, bronchoalveolar lavage (BAL) may be performed to diagnose lower airway infection. However, it is an invasive procedure that requires sedation and cannot be performed frequently. Hence, oropharyngeal (OP) swab cultures are used as surrogate specimens for species identification and antibiotic sensitivity determination of lower airway infections8. The current literature is divided on the utility of OP swabs as a reliable surrogate for BAL. Most studies on children report variable positive predictive values (PPV) ranging from 0.44 to 0.83 for P. aeruginosa and 0.33 to 0.64 for S. aureus9,10,11,12,13.

The measurement of volatile molecules from respiratory specimens, including BAL fluid, sputum and breath has been proposed as a minimally-invasive approach for differentiating P. aeruginosa-positive from P. aeruginosa-negative CF patients14,15,16,17,18,19,20,21,22. Diagnostic volatile molecular ‘suites’ from adult and pediatric patient breath have shown the greatest promise for detecting P. aeruginosa, with sensitivities and specificities approaching 1.00 and 0.88, respectively, in studies ranging from 16–233 subjects17,23. For S. aureus identification, Neerincx et al. showed that volatile molecules from patient breath (age ≥ 6 years) could be used to discriminate between S. aureus-infected CF patients (n = 13) from non-infected CF patients (n = 5) with a sensitivity of 1.00 and a specificity of 0.8024.

To set the groundwork for a breath study targeting biomarker evaluation of both P. aeruginosa and S. aureus, we accessed a heterogeneous set of ex-vivo BAL fluid samples (n = 154) obtained from 13 CF centers in the United States (US), via the Cystic Fibrosis Foundation Therapeutics (CFFT) biorepository. P. aeruginosa and S. aureus, the two most prevalent pathogens associated with CF, were detected by culture in 19% and 32% of samples, respectively. Validation sets were used to establish the sensitivity, specificity, PPV, and negative predictive value (NPV) for each set of putative volatile biomarkers.

Materials and Methods

Study subjects and design

BAL fluid samples stored at the Cystic Fibrosis Foundation Therapeutics (CFFT) Biorepository were accessed after approval from the Committee for the Protection of Human Subjects (CPHS) (STUDY00028597) at Dartmouth College. The stored samples were originally collected from subjects (age range: 2 months –50 years, n = 154) with a confirmed diagnosis of CF from 13 CF centers in the US at the time of clinically-indicated bronchoscopy with BAL fluid collection. Subjects with remnant BAL fluid after clinical testing were eligible to participate. The study was approved by the Institutional Review Board at each site. Written informed consent and HIPAA Authorization were obtained from all subjects ≥ 18 years or from parents or legal guardians of subjects < 18 years. Assent was obtained from subjects between 10–17 years. Clinical data (e.g., age, gender, lung function, body-mass index (BMI), and comorbidities) at the time of bronchoscopy were entered into a secure, web-based electronic database (REDCap).

Specimen collection and processing

Bronchoscopy and BAL fluid collection were performed following each site’s standard clinical procedure with most samples collected via laryngeal mask airway or endotracheal tube, limiting upper airway contamination. Standard BAL fluid culture was performed by the local clinical microbiology laboratory in accordance with Cystic Fibrosis Foundation (CFF) guidelines25 and results recorded. Remnant BAL fluid was frozen neat in 1 mL aliquots within one hour of collection at −70 °C. Research samples collected at participating sites were batch-shipped overnight on dry ice to Children’s Hospital Colorado, United States for storage in the CFFT biorepository specimen bank. For volatile metabolomics analysis, cryovials were shipped overnight on dry ice to Dartmouth College, US and stored at −80 °C. Accompanying de-identified clinical and microbiologic data was shared electronically using encryption.

Chemical measurements and data collection

Five hundred microliters of thawed BAL fluid were transferred to sterile 10 mL glass headspace vials containing a magnetic stir bar, and sealed with a polytetrafluoroethylene/silicone screw cap. Samples were stored at 4 °C prior to analysis by GC × GC-TOFMS. A total of 154 GC × GC chromatograms were analyzed, resulting in the detection of 973 volatile molecules across all samples (see Supplementary Information for details).

Statistical analysis

The aim of the study was to investigate whether suites of volatile molecules discriminate between BAL fluid samples in each of the following comparisons: (1) P. aeruginosa-positive (Pa+) versus no cultured microorganism (NCM), (2) S. aureus-positive (Sa+) versus NCM, (3) culture-positive (Pa+/Sa+) versus NCM, (4) Pa+ versus Sa+, (5) Pa+ versus P. aeruginosa-negative (Pa−), and (6) Sa+ versus S. aureus-negative (Sa−). Clinical microbiology results obtained at the time of bronchoscopy were used as the gold standard. Samples labelled as NCM contained no reported pathogens but could be positive for respiratory flora. Descriptive statistics were calculated for age, gender, genotype, BMI, comorbidities (pancreatic insufficiency and cystic fibrosis-related diabetes (CFRD)), forced vital capacity (FVC) percent predicted, and forced expiratory volume in 1 s (FEV1) percent predicted. To assess confounding demographic variables between study groups, the Mann-Whitney U test26 with Benjamini-Hochberg (BH) correction27 was used to compare continuous variables (age, BMI, FEV1 percent predicted, FVC percent predicted) and Mantel-Haenszel chi-square test28 with BH correction was used to compare categorical variables (gender, genotype, pancreatic insufficiency, CFRD).

Preprocessing of chromatographic data decreases influence of artefacts and allows for biological variation to be visualized. Data were log-transformed and normalised using probabilistic quotient normalization29. Samples were randomly assigned to training sets for all six models. The Random Forest (RF) classification algorithm was performed on the training sets to identify suites of discriminatory volatile biomarkers30. RF creates many de-correlated decision trees from a randomly selected subset of volatile compounds and predicts the sample class assignment. Two-thirds of the samples in the training sets were randomly selected with replacement for each decision tree (with an equal number of samples selected per class) and the remaining one-third were used to calculate the performance of the RF classification model. A total of 1000 trees were built for each round of RF. Discriminatory volatile molecules were ranked based on their mean decrease in accuracy over a 100 independent rounds of RF. Volatile molecules that contributed to decrease in accuracy of >30% in at least 90/100 rounds were chosen as the most discriminatory features. A test set was used to validate the accuracy of selected discriminatory panel of volatile molecules for all models.

For visualization purposes, principal component analysis (PCA) score plot was used to demonstrate the relatedness between all samples in the data. In a PCA score plot, each single point is represented by a sample. Points that lie close to each other have similar volatile molecule profiles, while points that are distant have different properties. ROC curves were plotted to represent performance of the predictive models made by RF. The area under the ROC curve (AUROC) is used to assess predictive performance: a value close to 1 indicates high predictive power of the model, whereas an AUC close to 0.5 means that the model has no predictive power31. Confidence intervals (CIs) for sensitivity and specificity are exact Clopper-Pearson CIs. Confidence intervals for PPV and NPV, were calculated using standard logit CIs. CIs for AUROC were calculated using DeLong method. All statistical analyses were performed in R 3.3.2.

Multivariate analysis of variance (MANOVA) was performed to test the influence of confounders on the profiles of putative discriminatory volatile biomarkers. Of the 38 volatiles identified from all models, the Mann-Whitney U-test with BH correction was used to assess differences in the mean concentration between sample classes.

Putative identification for volatile molecules

In total, 553, 400, and 200 chromatographic peaks were identified in at least one chromatogram of Pa+, Sa+ and NCM, respectively. Putative chemical class and chemical identification for volatile molecules that were present in at least 80% of one study group was based on mass spectral similarity score to a compound in the NIST library of at least 850/1000 and having a molecular weight of at least 60.0 amu. Volatile molecules that passed the above criteria and have been reported in the literature were assigned putative names. Experimentally-determined retention indices (RI) that are consistent with the volatile molecules assigned putative names on the mid-polar Rxi-624Sil stationary phase, were quantified using published median RIs for non-polar and polar columns, and the following equation:

$$\frac{{R}{{I}}_{{experimental}}-{R}{{I}}_{\text{non}{-}\text{polar}}}{{R}{{I}}_{{polar}}-{R}{{I}}_{\text{non}{-}\text{polar}}}\times 100 \% =5-35 \% $$

Retention indices less than 600 (corresponding to C6) or greater than 1600 (corresponding to C16) were not extrapolated.

Pathway identification

For each putative volatile molecule assigned a name, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database was searched for pathways identified in bacteria32.

Data availability

The datasets generated during and/or analysed in the current study are available from the corresponding author on reasonable request.


BAL fluid samples from subjects with cystic fibrosis are polymicrobial and heterogeneous

BAL fluid samples from subjects with CF (n = 154) were cultured for the identification of bacterial and fungal pathogens; 79% (n = 121) were identified as culture-positive and the remaining 21% (n = 33) culture-negative. The largest proportion of samples had only one culturable microorganism (n = 51), while 45, 21, 3, and 1 BAL fluid samples cultured two, three, four, and five microorganisms, respectively (Fig. 1A). Of the 66 unique microbiological profiles identified, 74% (n = 49) were specific to a single patient. Organisms cultured from at least 20% of BAL fluid samples were S. aureus (40%), followed by Aspergillus fumigatus (27%) and P. aeruginosa (25%) (Fig. 1B).

Figure 1
figure 1

Microbiological results from BAL fluid samples of patients with cystic fibrosis (n = 121; samples with no cultured microorganisms (NCM) excluded for clarity). (A) Colour matrix of the microbiological profiles observed among the samples; (red) bacteria (excluding nontuberculous mycobacteria (NTM)), (yellow) fungi, (green) NTM. (B) Bar plot of the most prevalent microorganisms in the samples (for organisms present in ≥1% samples).

S. aureus and P. aeruginosa represent the two most prevalent pathogens in the setting of CF, therefore BAL fluid samples that cultured either S. aureus (Sa+) or P. aeruginosa (Pa+) were selected for further analysis. BAL fluid samples containing no cultured microorganism (NCM), P. aeruginosa-negative (Pa−) and S. aureus-negative (Sa−) were used as comparison groups. For statistical analysis, we excluded data from samples that cultured both S. aureus and P. aeruginosa (n = 9) as well as 19 samples with incomplete clinical or demographic information. Age and BMI were significantly different in our Pa+ versus NCM and Pa+ versus Pa− model and could therefore confound our analysis (Table 1). We did not include medication in our analysis because the combination of drugs used by patients in our study groups exceeds the number of subjects in each group.

Table 1 Demographic and clinical characteristics of subjects in the study groups.

Volatile molecular profiles of study groups

In total, 973 peaks were identified in the headspace of the BAL fluid samples (Supplementary Figure S1). After removal of chromatographic artefacts and known contaminants (e.g., siloxanes, phthalates, atmospheric gasses), 805 peaks were selected for further analysis and putative chemical class identifications were assigned to 60 molecules covering a variety of chemical classes. Hydrocarbons were the most abundant chemical class in each group (Pa+ = 11, Sa+ = 9, NCM = 9), followed by ketones (Pa+ = 7, Sa+ = 6, NCM = 2) and alcohols (Pa+ = 5, Sa+ = 4, NCM = 3) (Fig. 2).

Figure 2
figure 2

Putative chemical class identifications of volatile molecules present in BAL fluid samples of Pa+, Sa+, and NCM groups.

Identifying P. aeruginosa and S. aureus from BAL fluid samples using suites of volatile biomarkers

Starting from a basis of 805 peaks, the RF classification algorithm30 was used to identify suites of discriminatory volatile molecules for all models. Each model was validated using a test set, and sensitivity, specificity, PPV, NPV and AUROC values were calculated (Table 2). First, we evaluated the diagnostic accuracy of volatile molecules in discriminating culture-positive from culture-negative samples. Using a set of six, seven, and nine discriminatory volatile molecules selected from training sets for Pa+ versus NCM, Sa+ versus NCM, and Pa+/Sa+ versus NCM, respectively, AUROC values of 0.67 (95% CI 0.39–0.89), 0.81 (95% CI 0.74–0.99), and 0.70 (95% CI 0.62–0.89) were obtained on respective test set samples (Fig. 3A–E). Our Pa+/Sa+ versus NCM model had a sensitivity of 0.79 and PPV of 0.81 for the presence of either P. aeruginosa or S. aureus. In addition, we tested models for Pa+ versus Pa−, which yielded an AUROC of 0.86 (95% CI 0.71–1.00) and NPV of 0.92 using nine volatile molecules, and models for Sa+ vs Sa−, which yielded an AUROC of 0.88 (0.79–1.00) and a PPV of 0.86 using eight volatile molecules.

Table 2 Model error rate, sensitivity, specificity, PPV, NPV, AUROC (95% CI) for volatile molecules from BAL fluid as diagnostic for Pa+ versus NCM, Sa+ versus NCM, Pa+/Sa+ versus NCM, Pa+ versus Sa+, Sa+ versus Pa+, Pa+ versus Pa− and Sa+ versus Sa−.
Figure 3
figure 3

Receiver operating characteristic (ROC) curves on test set samples of (A) Pa+ (n = 8) versus NCM (n = 15) using a panel of six volatile molecules, (B) Sa+ (n = 15) versus NCM (n = 15) using seven volatile molecules, (C) Pa+/Sa+ (n = 23) versus NCM (n = 15) using nine volatile molecules, (D) Pa+ (n = 7) versus Pa− (n = 53) using nine volatile molecules, (E) Sa+ (n = 15) versus Sa− (n = 45) using eight volatile molecules and (F) Pa+ (n  = 8) versus Sa+ (n  = 15) using 11 volatile molecules. (Dotted line indicates random classification).

To determine whether volatile molecules can differentiate between polymicrobial P. aeruginosa-positive and S. aureus-positive samples, the Pa+ versus Sa+ model was built and validated on a test group of samples using a set of 11 discriminatory volatile molecules. The model had a specificity and NPV of 0.81 and 0.72, respectively for detecting P. aeruginosa, sensitivity and PPV of 0.67 and 0.66, respectively for detecting S. aureus and an AUROC of 0.79 (95% CI 0.67–0.95) (Fig. 3F). PCA plot was used to visualize the clustering of test set samples using discriminatory volatiles for the Pa+ versus Sa+ model, with labels indicating the complete clinical microbiology profiles (Fig. 4). Samples that cultured only S. aureus (profile #8, n = 8) clustered together, while samples that cultured S. aureus with additional Gram-negative microorganisms (profiles #9, #14, n = 2), or S. aureus along with Aspergillus spp. (profiles #10, #11, #12, n = 4) clustered closer to P. aeruginosa samples.

Figure 4
figure 4

Three-dimensional principal component scores plot on test set samples of Pa+ (diamond, n = 8) versus Sa+ (plus, n = 15) with complete microbiology profiles.

In pediatric patients, S. aureus pre-colonization has been shown to be a risk factor for initial P. aeruginosa airway infection33. Hence, it is important to assess the utility of diagnostic biomarkers in S. aureus and P. aeruginosa co-infected samples. Using the panel of our 28 Pa+ biomarkers identified from all our models above, we tested whether samples that were co-infected with P. aeruginosa and S. aureus (n = 9) could be identified as Pa+. We added these samples in our test set of Pa+ versus Pa− analysis and achieved a sensitivity and specificity of 0.74 and 0.89, suggesting that our reported suite has the potential to rule in P. aeruginosa in polymicrobial samples.

Putative identifications and relative abundance of the 38 discriminatory volatile molecules from all six models were determined using mass spectrometric and chromatographic data (Table 3, Supplementary Figures S2S7). MANOVA was used to ensure that the volatile profile was not confounded by clinical and demographic characteristics of CF patients in Table 1. None of the characteristics significantly influenced the models (Supplementary Table S1). Six discriminatory volatile molecules had been previously reported and were given putative names; 2-butanone, 2-methyl-2-butanol, 1-butanol, ethyl acetate, 2-butanol and 3-methyl-2-butanone. Two out of the six volatile molecules were associated with metabolic pathways in the KEGG database; 2-butanone and 1-butanol (Table 4).

Table 3 Putative identification of 38 discriminatory volatile molecules for all six models.
Table 4 Previously reported P. aeruginosa– and S. aureus–-associated volatile molecules.


This is the first study to examine the diagnostic potential of a suite of volatile molecules in identifying the presence of two critical lung pathogens for patients with cystic fibrosis, P. aeruginosa and S. aureus. The molecules identified in the headspace of BAL fluid were similar in chemical class composition to previously reported culture and exhaled breath studies34,35,36,37,38. The range of models presented for the detection of each pathogen were validated using test sets and yield a set of putative discriminatory biomarkers for translation and evaluation in future breath studies.

In non-expectorating subjects, clinicians use OP swab culture as a surrogate for BAL fluid. Jung et al. reported a PPV and NPV of 0.83 and 0.74 respectively, for detecting P. aeruginosa using OP swab culture and a PPV and NPV of 1.00 and 0.94, respectively, using sputum culture in 38 patients13. Seidler et al. reported that in expectorating adults (n = 20), OP swab and sputum cultures both had a PPV of 1.00 for detecting P. aeruginosa and NPV of 0.50 and 0.60, respectively39. In this work, 28 putative volatile biomarkers were selected across all our models for the identification of P. aeruginosa in BAL fluid samples yielding a PPV range of 0.25–0.81 and an NPV range of 0.56–0.92. Our Pa+ versus Pa− model generated a specificity of 0.88 and an NPV of 0.92, suggesting that volatile molecules from the headspace of BAL fluid could be useful to screen patients and ‘rule in’ P. aeruginosa infections. Clinical best practice, as defined by the CF Foundation Pulmonary Guidelines, is to routinely surveil CF patients for new or recurring P. aeruginosa infections and initiate antibiotic eradication therapy at first detection, in an effort to prevent chronic infection and preserve lung function4. Ruling in P. aeruginosa is important as presence of the bacterium in the lung is significantly correlated with lower FEV1% predicted, increased serum C-reactive protein, and increased neutrophil elastase in sputum40.

Identification of S. aureus is also important, as methicillin-resistant S. aureus (MRSA) is associated with worse survival in CF patients5. Seidler et al. reported an NPV of 1.00 for detecting S. aureus in OP swab and sputum cultures versus BAL fluid and a PPV of 0.41 for OP swab and 0.57 for sputum versus BAL fluid39. Neericnx et al. demonstrated that a combination of nine volatile molecules in breath were able to discriminate between CF patients infected with S. aureus (n = 13) and uninfected CF patients (n = 5) with a sensitivity of 1.00 and a specificity of 0.8024. We identified 30 volatile molecules across all of our S. aureus models which yielded a PPV range of 0.60–0.86 and an NPV range of 0.37–0.70, respectively. Our Sa+ versus Sa− model generated a sensitivity of 0.80 and a PPV of 0.86, indicating that volatile molecules from BAL fluid samples could be used to ‘rule out’ S. aureus. The determination of S. aureus infection via volatile molecule panels is a promising avenue of research, however, the development of reliable exhaled breath diagnostics for S. aureus will require the use of paired representative samples from the lower and upper airways of subjects with CF to rule out false positives, as S. aureus is a colonizer of the upper airways of persons with CF41.

Six of the 38 discriminatory volatile molecules have been previously identified as bacterially-derived in volatile metabolomics studies (Table 4). Of these, the KEGG database identified microbial pathways associated with the metabolism of 2-butanone and 1-butanol. Of interest was the release of 2-butanone as a secondary metabolite during a reaction that also results in the formation of hydrogen cyanide (HCN). Hydrogen cyanide is an extensively studied biomarker for P. aeruginosa identification19,20,21 and more recently also shown to be produced by S. aureus42. We do not report on HCN (27.0 amu) as we did not collect data on ions with a m/z < 30. 2-butanone has also been reported as a biomarker for S. aureus detection in the breath of CF patients24, indicating the potential translation of BAL fluid volatile biomarkers to patient breath. KEGG pathways associated with 1-butanol indicate that it is produced as a by-product of the reversible oxidation of butane and butanal, a metabolic pathway that has been reported in P. butanavora43. 1-butanol has also been reported as a putative biomarker in the breath of CF patients with P. aeruginosa infection17. In addition, production of 1-butanol via the glycolytic fermentation of pyruvate has been reported for S. aureus44,45. These findings point towards possible microbial origins of 2-butanone and 1-butanol in BAL fluid samples.

Although culture remains the gold standard for clinical diagnosis, molecular-based approaches can identify additional microbial species that were not cultured, some of which may affect clinical outcomes in CF patients46,47,48. To the best of our knowledge, no volatile metabolomics study to-date has compared culture-dependent and culture-independent techniques for the purpose of organism identification. For our Pa+ vs Sa+ model, we compared the bacterial classifications based on 16 S ribosomal RNA (rRNA) sequencing (see supplementary information for details) and culture data. Our results show that samples with concordant microbiological results (average predicted probabilities; Pa+ = 0.70, Sa+ = 0.75) showed higher predicted probabilities than samples that had discordant microbiological findings (average predicted probabilities; Pa+ = 0.35, Sa+ = 0.30) (Supplementary Figure S8). Although the predicted probability of samples with concordant microbiology did not change, indicating the model was robust, the choice of reference sample for model generation in the field of volatile metabolomics is an unexplored issue. Future studies for P. aeruginosa and S. aureus volatile biomarkers should consider incorporating 16 S rRNA or quantitative polymerase chain reaction (qPCR) in order to determine sample status for predictive modeling.

CF lung infections are polymicrobial and our data shows that suites of volatile molecules from BAL fluid produced a predictive signal. We recognize that we have not fully investigated the impact of multiple pathogens on our reported volatile biomarker panel. This was beyond the scope of the study and shall be explored as a future avenue. As a first pass to address this question, we used our 28 Pa+ biomarkers to test whether the samples that were co-infected with P. aeruginosa and S. aureus could be identified as Pa+. Our sensitivity and specificity were 0.74 and 0.89 when these samples were included in the test set of Pa+ versus Pa− model, demonstrating the utility of the volatile biomarkers to rule in P. aeruginosa in polymicrobial samples. In addition, we acknowledge that using the BAL technique to obtain diagnostic samples is invasive, expensive, and thus cannot be performed frequently in a clinical setting. We do not propose the use of volatile molecule suites from BAL fluid as a rapid or non-invasive diagnostic in clinical labs. Furthermore, the PPV and NPV values for our study might not be reflective of the actual prevalence of P. aeruginosa and S. aureus in all CF populations. However, the proposed suite of biomarkers in this study can now serve as a basis for designing well-powered breath studies in the CF patient population, an avenue we are currently exploring. Furthermore, we expect that complementing the culture results with culture-independent data might have the potential to identify additional, clinically relevant microbial species, some of which may affect model generation and interpretation46,47,48. We conclude that volatile molecules from BAL fluid can provide discriminatory power for ruling in P. aeruginosa and ruling out S. aureus and plan to extend this work to exhaled breath in CF patients.