Chronic obstructive pulmonary disease (COPD) is a predominantly smoking-related lung disease, characterized by airflow obstruction, local and systemic inflammatory responses, and phenotypic heterogeneity. Inflammation can persist after smoking cessation for reasons not fully understood, and patients vary in disease development, progression and other outcomes. Bacterial colonization of the airways in COPD has long been recognized from culture-based studies, but its role in the pathogenesis or progression of COPD remains unclear.

Recent studies, using culture-independent sequencing to profile microbial communities more comprehensively, have reported alterations in airway bacterial composition mostly in more severe COPD1,2,3,4,5. Less is known about the lung microbiome in milder COPD or in smokers without evidence of airflow obstruction and, in particular, whether differences in lung microbiota associate with clinical or biological features in earlier stages of COPD development. In the National Heart, Lung and Blood Institute (NHLBI) Subpopulations and Intermediate Outcome Measures in COPD study (SPIROMICS)6, 50% of current or former smokers without airflow obstruction reported significant respiratory symptoms, coupled with activity limitation and more exacerbation-like events7. These findings indicate a need to determine whether changes in the lung microbiome may contribute to clinical outcomes in milder COPD or in those who do not meet current spirometric criteria for the diagnosis.

Few studies of COPD have examined the lung microbiome sampled by bronchoscopy8,9,10,11,12,13, an invasive approach that may provide unique insights into the potential role of lung microbiota in COPD pathogenesis. Using bronchoalveolar lavage fluid (BAL) collected from subjects in the well-characterized SPIROMICS cohort6,14, we explored the hypotheses that differences in lung bacterial composition are associated with a diagnosis of COPD and/or with clinical features that reflect COPD pathophysiology, including measures of lung function and symptom burden.


Cohort characteristics

Characteristics of the 181 subjects in this study are summarized in Table 1. Those with COPD were older and, as expected, had lower lung function and greater symptom burden [higher scores on COPD Assessment Test (CAT), and St. George’s Respiratory Questionnaire (SGRQ)]. Subjects with COPD also displayed greater lung function response to an inhaled bronchodilator, as measured by both change in FVC (forced vital capacity) or FEV1 (forced expiratory volume in one second); these and other lung function measurements, except for FEF25–75, did not differ significantly from measurements at the baseline SPIROMICS visit (Supplementary Table 1). Current smoking was associated with differences in CAT and SGRQ scores, but not lung function measures.

Table 1 Baseline characteristics of subjects.

Lung bacterial composition is associated with specific measures of lung function, symptom burden and functional impairment

First, to examine whether variation between subjects in their overall lung bacterial composition (β-diversity) associated with COPD status and related clinical characteristics, we performed distance-based PERMANOVA using Bray-Curtis and weighted Unifrac distance measures, using data from all 181 subjects. This analysis identified significant associations between variation in lung microbiota composition and specific measures of lung function, symptom burden, and functional impairment (PERMANOVA p < 0.05; Table 2 and Supplementary Table 2). Although pre-bronchodilator measures of FEV1 and FVC were not associated with lung microbiota variation (not shown), measures of post-bronchodilator response (BDR), in particular change in FVC, were. FEF25–75 (forced expiratory flow rate between 25 and 75% of FVC) and PEFR (peak expiratory flow rate) also were associated with lung microbiota variation, as were CAT and SGRQ scores and baseline six-minute walk distance. This analysis did not identify associations between lung microbiota variation and a diagnosis of COPD, age, sex, current smoking, or inhaled corticosteroid use, nor with study center or season in which the bronchoscopy was performed. Collectively, these initial findings suggested that differences between subjects in their overall lung bacterial composition were more strongly linked to specific measures of airways dysfunction and related symptom burden, rather than a diagnosis of COPD per se.

Table 2 Clinical features associated with differences in lung bacterial community structure (N = 181 subjects).

We next applied principal component analysis (PCA) to examine further the relationships between these associated clinical features and lung microbiota variation between subjects. This analysis demonstrated contrasting gradients of bacterial compositional difference related to BDR and CAT scores versus FEF25–75 and PEFR (Fig. 1A; R function envfit). CAT scores followed the same compositional gradient as BDR, and independently CAT score and FVC_BDR were correlated (Spearman R; rs = 0.35, p < 0.001).

Fig. 1: Variation in lung bacterial community composition is associated with specific measures of lung function and symptoms.
figure 1

a Principal component analysis (PCA) of the overall variation in BAL bacterial composition and relationships to the associated clinical measures for the entire cohort (N = 181). Contrasting relationships exist between lung microbiota and measures of airflow limitation (FEF25–75 and peak expiratory flow rate, PEFR) compared to that for measures of bronchodilator response and symptoms (COPD Assessment Test, CAT). PCA based on Hellinger transformation of OTU abundance counts. Vectors indicate direction and magnitude of the linear relationship between a clinical variable and the gradient of bacterial composition shown by PCA (R function envfit). b, c Correlations between within-sample bacterial diversity (inverse Simpson index) to FVC bronchodilator response and FEF25–75 (Spearman r = −0.26, p = 0.005 and r = 0.17, p = 0.02, respectively).

We also determined within-sample measures of bacterial diversity (inverse Simpson index). These correlated negatively with FVC_BDR (rs = −0.26, p = 0.005) but positively with FEF25–75 (rs = 0.17, p = 0.02; Fig. 1B, C), which support the converse relationships of these variables observed by PCA. This diversity measure did not differ by current smoking status, consistent with our above reported finding that current smoking was not associated with inter-subject variation in lung bacterial community structure.

Analysis of ever-smokers with or without mild-moderate COPD

A crucial longstanding question is why only some smokers develop COPD. Exploiting the advantage that our cohort consisted primarily of ever-smokers (current or former) without COPD or with mild-moderate COPD, we next focused on these two groups to explore the potential role of differences in the lung microbiota in the disease process. We again observed associations between variation in lung bacterial composition and measures of BDR and 6-min walk distance (Unifrac distance-based PERMANOVA; r2 = 0.02 for both measures, p = 0.019 and p = 0.007, respectively), but not with FEV1/FVC ratio, FEV1 % predicted or COPD status itself. BDR measures were greater in the mild-moderate COPD group compared to ever-smokers without COPD (Wilcoxon rank-sum test, p < 0.001 for both change in FEV1 or FVC), while CAT scores were marginally higher in the COPD group (Wilcoxon rank sum p = 0.04). PCA again demonstrated contrasting gradients of lung bacterial composition associated with BDR versus other lung function measures, including PEFR and related parameters that reflect airflow obstruction (Fig. 2, blue arrows; Unifrac distance; R envfit). As before, similar directional gradients of bacterial composition were observed for FVC_BDR and CAT score, which remained correlated with each other (rs = 0.30, p < 0.001). This directionality contrasted from the gradients associated with FEV1/FVC, FEV1, FEF25–75, and PEFR. Biplot analysis indicated that members of the Streptococcus, Staphylococcus, Pseudomonas, Prevotella, and others were key contributors to the variation in lung bacterial composition (Fig. 2, red arrows) among these ever-smokers without or with mild-moderate COPD.

Fig. 2: Differential relationships between lung bacterial composition, specific lung function measures, and symptoms are maintained among ever-smokers without or with mild-moderate COPD.
figure 2

Principal coordinate analysis (PCoA) of lung bacterial community variation (weighted Unifrac distance) among ever-smokers without (gray circles) or with (gray Xs) mild-moderate COPD and relationships to the clinical variables shown (blue arrows; R envfit). An overlaid biplot analysis shows the microbiota members (red arrows) driving the observed variation in lung bacterial community structure amongst these ever-smokers.

We next examined each principal component axis to dissect further which particular bacteria and clinical measurements together contributed to the variation in bacterial community structure amongst subjects. For axis 1 (23.5% of the total variation) principal component (PC) scores were significantly correlated with two measures of FVC_BDR (% change in FVC; R = 0.21, p = 0.01; FVC volume response in milliliters, R = 0.19, p = 0.019; Fig. 3A), and with six-minute walk distance (R = –0.24, p = 0.003; Fig. 3B). Bacterial taxa contributing to PC1 scores included members of Pseudomonas (OTU0007 R = 0.33 q < 0.001; OTU0025 R = 0.22, q = 0.055), Streptococcus (OTU0005 R = 0.21, q = 0.06; OTU0016 R = 0.20, q = 0.097), Staphylococcus (OTU0012 R = 0.21, q < 0.001), and multiple Prevotella (OTU0003, OTU0004, OTU0014; R = −0.46 to −0.59; q < 0.001).

Fig. 3: Post-bronchodilator FVC response and 6-min walk distance correlate with variation in lung bacterial composition along principal component 1, which explained 23.5% of the total variation in lung bacterial community structure among ever-smokers.
figure 3

Each dot represents an individual BAL sample and is colored according to its PC score. OTUs contributing significantly (q-value < 0.10) to PC1 are shown in the accompanying table, ranked by Pearson’s correlation coefficient. The relative abundances of Prevotella, Pseudomonas, and Streptococcus are primary drivers of PC1 variation. a Post-bronchodilator FVC response measured by change in percent and volume in mL. b Six-minute walk distance (meters).

Axis 2, which explained an additional 11.7% of lung bacterial community variation, was associated primarily with measures of PEFR (R = 0.20, p = 0.014). Five taxa were significantly correlated with PC2 scores (q-value <0.10). All were negative correlations, the strongest being with three Streptococcus members (OTU0005, R = −0.73, q < 0.001; OTU0016, R = −0.33, q = 0.002; OTU0019, R = −0.25, q = 0.049). Streptococcus relative abundance also displayed correlative relationships with PEFR, CAT score, as well as FEV1 and FEV1/FVC. OTU0005 had the strongest negative relationship with PC2, associated with lower PEFR, and trended weakly with higher CAT score (Fig. 4) and lower FEV1 (R = –0.14; p = 0.095). When the combined relative abundance of these Streptococci (OTUs 0005, 0016 and 0019) was considered, a weak negative trend with FEV1/FVC also was noted (R = −0.16, p = 0.081). Altogether these findings suggest that a collection of lung microbiota members contribute to the dysbiosis associated with airways dysfunction and symptoms observed in these ever-smokers.

Fig. 4: Streptococcus, PEFR and CAT score in ever-smokers with or without COPD.
figure 4

The relative abundance of Streptococcus Otu5, either alone (top panels) or together with other Streptococcus OTUs (bottom panels), correlate negatively with peak expiratory flow rate (PEFR) and positively with COPD Assessment Test (CAT) score.

To complement these analyses, we examined if individual bacterial taxa were differentially abundant between the two groups of ever-smokers and/or associated with lung function measures within each group. Three specific taxa were differentially abundant by DESeq analysis between the ever-smokers without COPD and those with mild-moderate COPD: Streptococcus (OTU0005; log2FC = 2.05, q < 0.001), an unclassified Lactobacillales (OTU0042; log2FC = 1.47, q = 0.035), and Veillonella (OTU0024; log2FC = −1.41, q = 0.079; Fig. 5 and Supplementary Table 3). Among ever-smokers without COPD, the abundances of Staphylococcus (OTU0012), Prevotella (OTU0014), Streptococcus (OTU0016, OTU0019), and Veillonella (OTU0024) were negatively associated with lung function by FEV1/FVC ratio, FEV1 % predicted and/or FEF25–75. Of these taxa, Streptococcus OTU0016 and OTU0019 also negatively associated with lung function in the mild-moderate COPD group (FEV1/FVC ratio), along with Streptococcus OTU0005 (FEV1) and Gemella (OTU0044; FEV1/FVC ratio, FEV1 and FEF25–75). Taxa associated with greater bronchodilator response in either group included three Streptococcus taxa (OTU0005, OTU0016, and OTU0019) and a Staphylococcus (OTU0012). We also explored differential abundance relationships with leukocyte populations in the BAL (Supplementary Table 4), which were determined by flow cytometry as previously described15. Significant relationships were observed predominantly with the percentages of BAL neutrophils, wherein only Streptococcus OTU0019 and an unclassified member of the Pasteurellaceae family (OTU0006) were positively associated with BAL % neutrophils in the mild-moderate COPD group.

Fig. 5: Heatmap summarizing results of DESeq analysis showing lung bacterial taxa significantly associated (q-value < 0.10) with measures of lung function, symptoms (CAT score), and percentage of BAL neutrophils.
figure 5

Further information is provided in Supplementary Table 3.

Lung bacterial burden and cultivation experiments

Finally, we examined relationships between lung bacterial burden (log-16S rRNA copy numbers as proxy) and lung bacterial composition in BAL samples from all subjects in this study. Bacterial burden positively correlated with phylogenetic diversity of samples (Faith_PD; rs 0.30, p < 0.001). Microbiota contributing to this relationship included members of Veillonella (rs = 0.48), Prevotella (rs = 0.28-0.37) and Streptococcus (rs = 0.37) (all padj < 0.001). In contrast, the relative abundances of two Pseudomonas members and a Staphylococcus negatively correlated with bacterial burden (rs = −0.21 to −0.26, padj < 0.001). We noted slightly higher bacterial burden in current smokers compared to subjects not currently smoking (median log-16S copies 4.6 ± 0.6 vs. 4.4 ± 0.7; Wilcoxon rank-sum p = 0.045). Bacterial burden and within-sample diversity measures did not differ between the groups.

Because of biological interest in several of the identified bacteria in this study (e.g. Pseudomonas, Streptococcus and Staphylococcus taxa), further efforts to identify the particular species were pursued. For Pseudomonas (OTU0007) which was associated with community variation along PC1, the representative sequence demonstrated little similarity to the P. aeruginosa represented in the mock community DNA. BLAST analysis of the representative sequence for OTU0007 also did not match P. aeruginosa, but rather a range of other species including P. fluorescens complex.

We also attempted to isolate by culture and identify the most prevalent Staphylococcus and Streptococcus species in our dataset (e.g. OTUs 0005, 0012, 0019). From unprocessed frozen (−80 °C) bronchial wash fluid, single colony isolates of Gram-positive cocci were sub-cultured and sequenced using primers for the full-length 16S rRNA gene, plus the rnpB locus (RNA subunit of the bacterial endoribonuclease P complex), as the latter can better differentiate Streptococcus species16. BLAST analysis of these sequences indicated these cultured isolates to be Staphylococcus hominis and Streptococcus salivarius (>97% similarity), which by CLUSTAL alignment were identified as OTU0012 and OTU0019, respectively.


In this sub-cohort from SPIROMICS, enriched in ever-smokers without a clinical diagnosis of COPD or with mild-moderate COPD, we observed previously unrecognized relationships between the compositional structure of lung microbiota, the relative abundances of specific bacterial members, and pathophysiologic attributes of COPD, in particular measures reflective of airways dysfunction. Variation in overall lung bacterial community structure did not associate with a categorical diagnosis of COPD, and only two bacterial taxa were significantly positively associated with COPD status (Streptococcus and Lactobacillales). Instead, we observed more associations between lung bacterial composition and measures of lung function, which provide more granular readouts of airway physiology. Given the milder disease in this cohort, the relationships between these indicators of airways dysfunction and lung bacterial composition hint at interactions in the lung environment that may contribute to airway disease in COPD. To identify specific bacteria that may be involved, we performed taxon-level differential abundance analyses. Although this entailed many comparisons with the lung function measures captured in SPIROMICS, results adjusted for false discovery identified several bacteria negatively associated with FEV1/FVC, FEV1 or FEF25–75 that have previously been implicated in chronic inflammatory airway diseases (e.g. Streptococcus, Staphylococcus, Prevotella, Gemella)17,18. We also noted relationships between lung bacterial composition and patient-centered measures of respiratory symptom burden (CAT score), suggesting possible clinical consequences related to lung dysbiosis in earlier stages of COPD. Together these results, from one of the largest bronchoscopy-based multicenter investigations focused on understanding the pathogenesis of COPD14, suggest that the lung microbiota present in those at risk for COPD may play a role in the pathogenesis of airways dysfunction.

An important strength of our study is the bronchoscopy-based and multicenter nature of our investigation, performed at clinical sites representing the U.S. geographic spectrum. Our study also contrasts from many other studies of the airway microbiome in COPD, where more often sputum has been analyzed to infer insights into the lung microbiome and in patients with more severe COPD or frequent exacerbation history19,20,21,22. These clinical features were neither the focus of this study nor characteristic of this cohort. We also analyzed cellular BAL which has been shown to capture greater representation of the lung bacterial community and differ in lung bacterial composition compared to acellular samples23. It is noteworthy that our data did not reveal differences in lung bacterial composition attributable to geographic locale, season in which bronchoscopy was performed, or current smoking status. An earlier multicenter study of the lung microbiome also did not find an association between current smoking and lung bacterial community structure24. We did observe slightly higher lung bacterial burden among current smokers, and it has been reported that BAL from smokers may promote the growth of specific bacterial species25. Our study was unique also in our attempt to culture primary isolates of organisms of interest identified in our analyses (e.g. Staphylococcus, Streptococcus). This led to the identification of two species whose sequences aligned to specific OTUs that may play a role in airways dysfunction. Although Staphylococcus and Streptococcus are known inhabitants of the upper airways, that these species were viably recovered from BAL samples and not identified as contaminants in our analysis, suggests they have the potential to interact actively in the lung environment.

Our study has a number of limitations. First, it is a descriptive analysis of the lung microbiota in COPD. Our focus on earlier stage disease, however, renders our study somewhat unique. Second, we did not confirm previously reported relationships between inhaled corticosteroid use and lung bacterial community structure in COPD2. This may be due to very few subjects identified as taking an inhaled steroid around the time of bronchoscopy, potentially limiting this evaluation. Similarly, robust examination of lung microbiota relationships to prospective exacerbation events in SPIROMICS subjects, while of interest, was not possible given the limited number of events in the timeframe of available data. Third, due to logistical challenges we could not directly characterize functional features of the microbiota, nor did we examine the identity of non-bacterial members of the lung microbiome. These topics are of interest to advance understanding of interactions between microbial kingdoms and with host immunity. We had limited paired immunological data available for this analysis but were able to explore relationships to BAL leukocyte numbers, which revealed a significant relationship between Streptococcus and BAL neutrophils. Future studies incorporating such data from the same biological compartment could be insightful, potentially supplementing the modest strength of the associations we observed with individual bacterial taxa. Indeed, efforts are underway in SPIROMICS to examine these aspects including possible functional interactions between the lung microbiota and host. Lastly, the current study was solely a cross-sectional analysis. While logistically challenging, longitudinal evaluations underway in the current phase of SPIROMICS will permit deeper evaluation of lung microbiota-immune interactions over time and how they relate to COPD development or progression.

The relationships we observed between the lung microbiota and measures of lung function, in particular BDR, FEF25–75 and PEFR, will also require further study. As with analogous findings in idiopathic pulmonary fibrosis26,27, a different lung disease linked in part to smoking, longitudinal studies of COPD could provide stronger evidence of the directionality of these associations and whether particular implicated microbiota precede rather than result from loss of lung function. Molecular mechanisms for the microbial associations with BDR are at present unclear, but this consistent finding in our analyses regardless of the inclusion of never-smokers argues for potential relevance to COPD pathophysiology. BDR measures can be variable over time, but in our subjects were not significantly different between their baseline visit and closest annual study visit before bronchoscopy. We observed similar associations with BDR when we used the baseline measures (not shown). Prior studies have reported relationships between airway dysbiosis and the presence of wheeze or airway hyper-reactivity in asthmatic subjects28,29, including involvement of Streptococcus and Staphylococcus species29. A recent analysis from COPDGene30, a more severe disease cohort, found that the presence of BDR was associated with radiographic evidence of small airway disease and more exacerbations.

Overall, the relationships observed lead us to speculate that airflow obstruction results in or could also reflect a mutually reinforcing relationship within microenvironmental conditions that exert selective pressure on the compositional make-up and functional behavior of microbiota in the niche. We previously reported that total mucin concentrations in sputum inversely associated with FEF25–75 in SPIROMICS subjects31, which may reflect one aspect of changes in the lung environment that could shape the microbiome. COPD is also associated with loss of several host defense factors that may facilitate bacterial invasion of airway epithelium32,33,34. Over time, evolving interactions between host and microbiota may provoke further immune responses and contribute to perpetuation of airway inflammation. Synergistic interactions between microbiota, as has been shown for Streptococcus and Veillonella species35,36,37,38, may also contribute. This ecological perspective of host-microbial interactions aligns with the “vicious cycle hypothesis” of inflammation and ‘infection’ in COPD39 and is important to maintain in considering how such interactions may affect airway disease pathogenesis in COPD. Findings from this study suggest additional avenues for further translational and mechanistic investigations into lung microbiota-host interactions in early stage COPD.


Subject characterization and research bronchoscopy

SPIROMICS is a multicenter observational study (NCT01969344; of never-smokers, smokers without airflow obstruction or with mild, moderate or severe COPD. All SPIROMICS participants (n = 2981) underwent detailed clinical characterization and collection of biological and radiographic data, as previously described6,40. The institutional review boards of all SPIROMICS sites approved the main study protocol, and all participants provided written informed consent. Pulmonary function testing was performed according to the 2005 ATS/ERS guidelines. Ever-smokers were defined as current or former smokers with a smoking history of ≥20 pack-years; former smokers were those who had been free from use of any tobacco products for 6 months preceding the bronchoscopy. A subset of subjects (n = 215; inclusive of never-smokers, smokers with normal lung function, mild-moderate COPD, and severe COPD) additionally participated in a bronchoscopy sub-study14, which was approved by the institutional review boards of the participating SPIROMICS sites and for which subjects provided separate informed consent. The institutions that enrolled subjects for the research bronchoscopy included Columbia University/Weill Cornell Medical College, National Jewish Health (Denver, Colorado), University of Alabama at Birmingham, University of California at Los Angeles, University of California at San Francisco, University of Michigan, University of Utah, and Wake Forest University School of Medicine. Bronchoscopy was performed a median of 62 days after the preceding annual study visit (Supplementary Fig. 1), from which most of the clinical measures used in our analyses were obtained and did not differ significantly from earlier measurements (Supplementary Table 1). Samples of BAL allocated for microbiome analysis (10 mL into RNALater) were obtained from 181 subjects, along with biological control specimens (e.g. oral wash/tongue scraping, instrument samples such as bronchoscope suction channel flushes with sterile saline, negative controls of sterile saline or lidocaine). Remaining BAL was saved for other purposes in SPIROMICS including a 5-mL aliquot for potential microbial cultivation experiments. Samples were stored long-term at −80 °C until processing for this study.

Sample processing, microbiota sequencing, and raw data processing

Five mL of BAL, oral rinse/tongue scraping, and bronchoscope flushes were centrifuged at high-speed (13,000 rpm) to pellet cellular material. Total genomic DNA was extracted from cell pellets using a commercial kit (DNeasy Blood and Tissue kit, Qiagen) with the following modifications: after re-suspension in ATL buffer the samples were transferred to Lysing Matrix E tubes (MP Biomedicals) and subjected to bead beating for 30 s before proceeding. The manufacturer’s protocol was modified to use 40 μL proteinase K instead of the recommended 20 μL, and samples were eluted with 100 μL of buffer AE instead of the suggested 200 μL. Negative extraction controls were generated by performing the extraction protocol without addition of sample but with sterile PBS.

The Microbial Systems Molecular Biology Laboratory (Microbiome) Core at the University of Michigan performed 16S rRNA gene (V4 region) sequencing using barcoded dual-index primers41 on an Illumina MiSeq (Illumina, San Diego, CA). Given the low biomass of samples, we employed a touchdown PCR amplification strategy42, including a mock reference community (ZymoBIOMICS; Zymo Research; Irvine, CA). Library preparation/sequencing controls (including elution buffer, sterile water, PBS, empty wells) also were sequenced. Several of these were performed in replicate. A mock community (Zymo) was included for quality control. Normalized pooled libraries were sequenced on the Illumina MiSeq platform using the 500 cycle MiSeq V2 Reagent kit. For internal testing purposes, BAL lavage samples were sequenced in triplicate (batch 2 only), while all other samples were sequenced once. The distribution of subjects by SPIROMICS group assignment did not differ between the two runs (Fisher exact test p > 0.05).

16S rRNA gene sequence reads were processed using mothur (v1.40.5)43, aligned to the SILVA reference alignment (release 102)44 and classified using the RDP 16S rRNA reference training set (version 16)45. The range of read counts in samples was 577 to 53,155, with the average being 21,444. Operational taxonomic units (OTUs) were defined using 97% similarity threshold.

Results of negative control samples (raw data prior to decontamination analysis using R package decontam46) are provided as Supplementary Table 5 and Supplementary Fig. 2. The prevalence method in decontam was used. The two batches of sample runs were decontaminated separately based on the negative controls collected/generated in parallel for the samples in a given batch. Initially, there were a total of 5172 OTUs and 986 OTUs were removed after decontam analysis. Eight of these OTUs were in common between the two decontamination runs (Otu0056, Otu0300, Otu0078, Otu1812, Otu0111, Otu0154, Otu0047, and Otu0204) and OTU 0011 was also removed because it appeared to be a contaminant. This brought the total of OTUs to 4193 prior to filtering by relative abundance. To compare further the two batches of samples, it was necessary to choose a single replicate sample from each subject in batch 2. For each patient, the geometric mean of normalized OTU counts was calculated to create a representative community to compare samples against. Then the Bray–Curtis distance between each individual sample and the mean sample was calculated and the sample with the lowest distance to the representative sample kept.

The count table, taxonomy table and phylogenetic tree file (picante)47 were imported into R for analysis using phyloseq (v1.26.0)48. For alpha-diversity calculations, reads were subsampled to the lowest count (577). All other analyses were performed with non-subsampled data, using relative abundance. The OTUs were filtered by average relative abundance >0.1% (final total of 125 OTUs analyzed). Principal component analysis of Hellinger-transformed abundance data from 84 paired BAL and oral rinse/tongue scrape samples demonstrated distinct segregation (clustering) of the two sample types (Supplementary Fig. 3; PERMANOVA p < 0.05).

Statistical/data analysis

Clinical data from the most recent annual SPIROMICS visit preceding bronchoscopy were used wherever possible (SPIROMICS Core dataset CORE5_20180910). For exploratory analyses, we also included information available only from the baseline SPIROMICS visit. Non-parametric tests and Fisher’s exact tests were used where appropriate. Spearman correlation (rs) was used to compare continuous measures including alpha-diversity indices, with Benjamini–Hochberg correction for multiple comparisons49. To identify variables associated with bacterial community differences between subjects (i.e. beta-diversity), univariate distance-based PERMANOVA50,51 based on Bray–Curtis or weighted Unifrac distances was performed (vegan v2.5-427), with 1000 permutations of the input matrices for determination of significance. The Unifrac distance incorporates phylogenetic information to capture the relatedness of bacterial communities between samples. Taxon-level differential abundance was determined by DESeq2 which utilizes a negative binomial generalized linear model with Wald’s test52.

Assessment of bacterial burden and culture-based studies

Copy numbers of the 16S rRNA gene were assessed by quantitative (real-time) PCR using a standard curve comprising a 10-fold dilution series from 108 to 102 copies per microliter. TaqMan reagents (ABI) were used to amplify 3 µL of DNA; nuclease-free water was used in place of DNA for a no-template control. All reactions were performed in triplicate. Primers and probe were as follows: forward primer (TGG AGC ATG TGG TTT AAT TCG A), reverse primer (TGC GGG ACT TAA CCC AAC), probe (FAM-CAC GAG CTG ACG ACA CCC ATG CA-BHQ). Amplification was performed on the ABI Step One Plus thermocycler and data analyzed using Step One software version 2.1. To further identify primary species of interest (e.g. Streptococcus, Staphylococcus), aliquots of raw frozen BAL were cultured onto blood agar and in aerobic broth (sBHI medium) at 37 °C overnight. followed by sub-culture of individual colonies which by Gram stain revealed Gram-positive cocci in chains and clusters. Fresh liquid cultures were grown from single colonies and DNA isolated. Sequencing of the 16S ribosomal RNA gene and alignment in BLAST indicated two isolates in the genus Streptococcus and one isolate of Staphylococcus hominis. CLUSTAL ( was used to align the latter sequence with the representative sequences obtained from the BAL community and identify Staphylococcus hominis (OTU0012). Due to high sequence similarity in the 16S rRNA gene for the genus Streptococcus, sequencing of the rnpB locus was performed16 to determine species identity.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.