Lung microbiota across age and disease stage in cystic fibrosis

Understanding the significance of bacterial species that colonize and persist in cystic fibrosis (CF) airways requires a detailed examination of bacterial community structure across a broad range of age and disease stage. We used 16S ribosomal RNA sequencing to characterize the lung microbiota in 269 CF patients spanning a 60 year age range, including 76 pediatric samples from patients of age 4–17, and a broad cross-section of disease status to identify features of bacterial community structure and their relationship to disease stage and age. The CF lung microbiota shows significant inter-individual variability in community structure, composition and diversity. The core microbiota consists of five genera - Streptococcus, Prevotella, Rothia, Veillonella and Actinomyces. CF-associated pathogens such as Pseudomonas, Burkholderia, Stenotrophomonas and Achromobacter are less prevalent than core genera, but have a strong tendency to dominate the bacterial community when present. Community diversity and lung function are greatest in patients less than 10 years of age and lower in older age groups, plateauing at approximately age 25. Lower community diversity correlates with worse lung function in a multivariate regression model. Infection by Pseudomonas correlates with age-associated trends in community diversity and lung function.

their relative abundance, dominance over other taxa, and temporal stability are potentially important determinants of ecological and clinical change in lung disease 10,15 . While studies of the CF microbiome to date have largely focused on adult patients, we expand this base of knowledge by including 76 pediatric patients less than 18 years old (28% of the total sample). This early time period is particularly interesting from the perspective of CF microbiology since community composition is more dynamic in pediatric patients than in adults 16,17 . Here, we report a cross-sectional survey of the bacterial community in sputum samples from 269 CF patients spanning a 60-year age range and a wide range of lung function.

Results
Clinical and microbiological characteristics of patients. During the study period, 76 pediatric (<18 years old) and 193 adult CF patients provided at least one expectorated sputum sample for analysis. Complete clinical data was available for 73 pediatric and 191 adult patients. The characteristics of these patients are reported in Table 1. Pancreatic insufficiency was more common in pediatric patients (P < 0.001). A larger proportion of adult patients had intermediate (FEV 1 < 70% predicted) or advanced (FEV 1 < 40% predicted) disease (68.6% vs 44.3% of pediatric patients, P = 0.001) and the median FEV 1 was lower in the adult cohort, reflecting the progression of disease with increasing age (P < 0.001). Adult patients were more likely than pediatric patients to be infected by Bcc or Aspergillus species by sputum culture (P = 0.037, 0.011 respectively), with a trend towards increased infection with P. aeruginosa (P = 0.085). Individual sample and genus characteristics. After quality and abundance filtering, 87 genera were identified across all samples. The size of the core microbiota of adult and pediatric samples, arbitrarily defined in our study as genera with a relative abundance of ≥1% of sequence in ≥50% of samples, were similar in both groups (Fig. 1A, Table 2). Streptococcus, Prevotella, Rothia, Veillonella and Actinomyces were core genera in both groups, while Neisseria, Haemophilus and Gemella were unique core genera in pediatric samples and Pseudomonas was unique to the adult core microbiota. In both groups, the genera with the greatest prevalence also had the greatest mean relative abundance (Fig. 1B). A small subset of genera comprised most of the bacterial sequences detected in any sample ( Fig. 1C and D, Table 3). The median number of genera required to account for half of the sequences in each sample was two in both groups, while a median of 20 and 13 genera accounted for 99% of the sequence in pediatric and adult patients, respectively (Table 3).
A dominant genus (the most abundant genus with at least twice the abundance of the second most abundant genus) was present in 45% of pediatric samples and 57% of adult samples. Streptococcus, Haemophilus, Pseudomonas, Staphylococcus and Achromobacter were dominant in multiple pediatric samples in descending order of frequency, while Neisseria and Stenotrophomonas were each dominant in one pediatric sample. Pseudomonas, Burkholderia, Streptococcus, Haemophilus and Staphylococcus were the dominant genus in descending order of frequency in adults ( Table 2). The median relative abundance of the most abundant genus in each sample was 0.38 in pediatric samples (IQR 0.29-0.50) and 0.46 in adults (IQR 0.32-0.71), the median relative abundance of dominant genera was 0.64 (range 0.26-0.99).
Rothia, Prevotella, Gemella, Actinomyces and Veillonella were prevalent in both pediatric and adult samples, but were seldom or never the dominant genus. Pseudomonas was more prevalent in adults but tended to be dominant in both pediatric and adult samples in which it was present (Table 2). Streptococcus, Burkholderia, Achromobacter, Stenotrophomonas and Haemophilus were commonly the dominant genus in samples in which they were present. Bordetella accounted for 95% of the sequence in one adult specimen from a patient identified by culture as being chronically infected by Bordetella avium, but was not present in any other samples. In addition to having a high dominance proportion (the  Within sample (alpha) diversity. Advanced disease stage was associated with lower diversity in adult patients (P = 0.004, Fig. 2A). This difference was not statistically significant in pediatric patients; however, only 6 pediatric patients had advanced disease. Baseline samples were more diverse than treatment samples for adults (P = 0.002, Fig. 2B). There was a stepwise, statistically significant decrease in diversity associated with increased number of antibiotics prescribed in both cohorts (Fig. 2C). Streptococcus dominance was associated with higher community diversity in both datasets (Fig. 2D).

Between sample (beta) diversity.
We assessed inter-sample variability in community structure (beta diversity) using unsupervised principal coordinates analysis (PCoA) of Bray-Curtis dissimilarity (Fig. 3). The relative abundance of Pseudomonas, Burkholderia, Streptococcus and Haemophilus largely drove ordination (Fig. 3). Samples did not cluster by clinical status at the time of sample acquisition (baseline, exacerbation, treatment, recovery or other). Notably, Bray-Curtis ordination was influenced by disease stage differently across the age groups, with clustering based on disease stage becoming more apparent after age 30 (Fig. 4). This analysis did not reveal any grouping of bacterial community structures by gender, CFTR genotype, pancreatic status, FEV 1 , FVC or BMI.
Age group and disease status associated bacterial community changes. Patients were stratified into age groups and compared based on lung function, community diversity and dominant genera (Fig. 5). As expected, FEV 1 was lower in older age groups, with the lowest median FEV 1 in patients aged 30-34 years (Fig. 5A). Bacterial density (log 10 ng/μ L of 16S rRNA gene by quantitative PCR) was greatest at ages 20-24 (Fig. 5B). Diversity was also lower in older age groups, and differences in community diversity paralleled lung function across age groups (diversity nadir 30-34 years, Fig. 5C). In addition, the lower lung function and diversity observed in older age groups coincided with a higher proportion of samples dominated by Pseudomonas or Burkholderia and a decreased proportion of Streptococcus dominance and samples without a dominant genus (Fig. 5D)   Predictors of FEV 1 by multivariate linear regression. Given the observed correlation of bacterial diversity, density and dominant taxa for lung function across age groups, we performed multivariate linear regression to ascertain which features of the lung microbiota independently predict lung function while controlling for the influence of host factors. Factors with known significance and potential confounders present in our dataset were entered into the multivariate model prior to sequentially testing diversity (SDI), bacterial density (log 10 16s rRNA gene by quantitative PCR) and dominant taxa based on the strategy outlined in the methods section. The final model included age, BMI z-score, pancreatic insufficiency, SDI and Pseudomonas dominance (adjusted R 2 = 0.315, df = 5, F = 22.9, P < 0.0005, Table 4). Increasing age, pancreatic insufficiency, increasing number of antibiotics and Pseudomonas dominance negatively correlated with lung function, while increasing BMI and diversity positively correlated with lung function in our model.

Discussion
We describe the bacterial community structure of sputum samples from 269 cystic fibrosis patients across a broad range of age, disease stage, clinical status and treatment using a culture-independent method. This is the largest cohort of CF patients spanning both adult and pediatric populations for whom sputum has been analysed by 16S rRNA gene sequencing yet published, and includes a large number of patients younger than 18 years of age. We observed several clear features of bacterial community structure in CF sputum and their relationship with disease states. Firstly, the airways of CF patients share only a small number of common taxa, but otherwise demonstrate a high degree of inter-individual variability in community structure, composition and diversity, confirming the results of smaller studies [5][6][7] . In pediatric and adult CF patients, the shared core microbiome was small, and consisted largely of genera not considered to be traditional CF pathogens, namely Streptococcus, Rothia, Veillonella and Actinomyces. Pseudomonas, a prototypic CF pathogen, was a member of only the adult core microbiota. A small number of taxa accounted for the bulk of bacterial sequences present in any sample, and approximately half of all patients had a bacterial community dominated by a single genus.
Secondly, the observed taxa in the CF lung vary in their relative abundance and dominance of the bacterial community. Only six genera had a dominance proportion >0.10 and a maximum relative abundance >0.70: Pseudomonas, Burkholderia, Stenotrophomonas, Achromobacter, Haemophilus and Bordetella. Of the remaining genera, only Staphylococcus, Streptococcus and Bifidobacterium were observed at a relative abundance of at least 0.50 in one or more samples.
In addition, whereas within sample diversity correlated with several patient, treatment and microbial factors, between patient diversity was largely driven by the age-associated relative abundance of Pseudomonas and Burkholderia. Age, disease stage, treatment and dominant genus were found to significantly correlate with differences in alpha diversity, confirming the observations of previous published reports of smaller datasets 8,[11][12][13][16][17][18][19][20] . The largest differences between patient communities noted in our study were the greater prevalence and relative abundance of Pseudomonas and Burkholderia in older age groups. As patients with CF age, their risk of acquisition of these organisms increases 17,21 , and we observed a stepwise progression in the prevalence and relative abundance of these two important genera with increasing age. A notable increase in Pseudomonas and Burkholderia dominance was observed in patients aged 25 and older, which coincided with the lowest lung function and sample diversity observed in any age group. Beyond this age, there are few differences in overall community structure, diversity or lung function in surviving adult patients, and it is possible that the establishment of these CF associated pathogens is largely complete by age 20-25. These findings corroborate the findings of smaller cross-sectional studies of CF microbiota over a large age range, including the 'plateau' of community and functional change at approximately 25 years of age 17,22 .  This study also demonstrated that sample diversity was positively correlated with lung function after controlling for host factors. Multivariate regression of clinical and bacterial features for prediction of lung function revealed that age, BMI z-score, pancreatic insufficiency, diversity and Pseudomonas dominance independently predict lung function. The observation that diversity positively correlates with lung function agrees with other studies in CF and non-CF bronchiectasis 8,9,23 . Our and previously published analyses do not address causality, however, and it remains uncertain whether the relationship between disease stage and diversity is causal.
This study has important caveats. The prevalence of infection with Pseudomonas in this study was lower than published registry data for both pediatric and adult patients 24 . Similarly, none of the included specimens were culture-positive for Mycobacterium, despite recently reported prevalence rates of nearly 10% 25 . However, prevalence of all species by culture in our study was determined for the study sample only, not for historical samples from each patient, and thus is unsurprisingly lower than published reports which generally define infection as culture positivity within a time range, rather than at a single timepoint.
Adequacy of sampling in CF microbiome studies is difficult to conclusively ascertain and is a challenge in similar studies. Culture-independent analysis by 16S rRNA gene sequencing is not uniformly sensitive to the detection of all bacterial taxa, such as Staphylococcus, in the absence of specific sample processing procedures 26 . Our study was not designed to systematically compare within-sample relationships between culture and 16S rRNA sequence data. As such, our analysis of Staphylococcal relative abundance should be interpreted cautiously.
Since our cohort did not include paired specimens from the same patient, this analysis was not designed to identify differences in bacterial community characteristics occurring at times of different disease activity (such as during an exacerbation). Given the highly variable bacterial community structure in CF which we and others [5][6][7] have observed, between-patient differences are likely to obscure important within-patient trends occurring during exacerbations.
Furthermore, our analysis was not designed to assess the impact that rare or very low abundance taxa (< 1% relative abundance) may have on the biology of the CF airway, but rather to assess trends in community structure over a broad range of ages and disease stage. Very low abundance and rare taxa may be critical determinants of pathological processes in the CF lung. Our observations highlight the heterogeneity between patients and confirm the need for longitudinal analysis in individual patients and groups of patients to identify factors associated with disease exacerbations, disease progression and response to treatment. Our method resolved community composition to the genus level only. Significant within-genus or intra-species variability in community composition may have important effects on community behavior and disease pathogenesis and has the potential to explain cross-sectional variability in disease status that is not apparent at the genus level.
Despite these caveats, our data agree with prior reports of CF microbiota, and extend these observations to a broader age range of patients, particularly those less than 18 years of age, a group that is not well represented in previously published reports. Our principal findings agree with and extend those of smaller studies of mostly adult patients showing high inter-individual variability in community structure 5-7 , a common core microbiota 9 , age associated changes in community composition 16,17 , and disease stage associated decreases in diversity 8,[11][12][13][16][17][18][19][20] .
Longitudinal studies identifying microbial and host risk factors for infection with patho-adapted taxa such as Pseudomonas and Burkholderia, and mechanistic studies elucidating the interactions of these taxa with other colonizing organisms, as well as exploration of possible causal relationships between community diversity and disease manifestations would be illuminating.

Methods
Study enrollment, sample and data collection. This study received approval from the research ethics boards of both St. Michael's Hospital and the Hospital for Sick Children in Toronto, Canada. Patients provided informed consent and were enrolled prospectively at The Hospital for Sick Children and St. Michael's Hospital between 2010 and 2013. Height, weight, forced expiratory volume in one second (FEV 1 ), forced vital capacity (FVC), age, sex, disease stage, antimicrobial therapy, clinical status, cystic fibrosis transmembrane conductance regulator (CFTR) mutation, pancreatic status, and visit outcome were recorded for the day of sample collection. Expectorated sputa were collected for all patients -no oropharyngeal samples were used. Pediatric samples were split into aliquots, one for culture and one for nucleic acid extraction, while adults provided two samples in parallel at each clinic visit. Samples for sequencing were treated with sputolysin and frozen for subsequent processing. Quantitation of 16S rRNA gene copy number was performed by quantitative PCR based on published protocols 27 .
Si-Seq 16S rRNA gene analysis and informatics. Sequencing was performed at the Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto as previously described using an Illumina GAIIx 28 . An in-house Java script was used to prepare the sequence for downstream analysis using MACQIIME 29 . Chimeric reads were removed using reference-based chimera detection with USEARCH 30 . Operational taxonomic units (OTUs) were picked against a SI-Seq structured reference set using a similarity of 0.87 (corresponding to the genus level with the SI-Seq method), with premature termination turned off. Taxonomy was assigned using an RDP structured dataset using Scientific RepoRts | 5:10241 | DOi: 10.1038/srep10241 UCLUST 30 with --max_accepts set to 0 and similarity set to 0.97, (empirically determined to produce the highest consistency identification at the genus level using Pseudomonas aeruginosa controls) 28 . Unassigned OTUs were identified with BLASTN 31 . OTUs representing less than 0.005% of overall sequence were removed. At this stage, the median number of sequences/sample was 67,812 (IQR 42,835-87,773). For calculation of Shannon diversity index (SDI), samples were rarefied to 1500 sequences to account for the effects of sequencing depth on diversity, with a median Good's coverage of 0.972 (IQR 0.955-0.979) at that sampling depth. Diversity and principal coordinates analyses were performed using MACQIIME with default settings.
Statistical analyses. Statistical analyses were performed using SPSS version 22 (IBM). Means were compared using T-tests or ANOVA with Bonferroni correction as needed. Chi-squared tests were used to test proportions. A multivariate linear regression model predicting FEV 1 (% predicted) was built by entering all variables, then using backwards removal to identify variables in the final model.

Definitions.
Disease stage, pancreatic insufficiency, FEV 1 , and FVC were defined as previously reported 32,33 . BMI z score was calculated using the method described in Flegal et al. 34 using age-specific parameter values from the Centers for Disease Control (www.cdc.gov/growthcharts/percentile_data_ files.htm). For adults over 20 years old, the age values for BMI z score were calculated based on an age of 20 35 . Antibiotic number equaled the number of antibacterials of any type being taken on the day of the visit (including inhaled antibiotics). Clinical status was defined as 'baseline' , 'exacerbation' , 'treatment' or 'recovery' using the criteria of Zhao et al. 11 . Samples not meeting these criteria were categorized as 'other' and included patients receiving new prescriptions for steroids, antifungals or antibiotics for an indication other than a pulmonary exacerbation (e.g. an extra-pulmonary infection). 'Dominant genus' was defined as any genus whose relative abundance was at least twice that of the next most abundant genus 10 or if no such genus was present, 'no dominant genus' .
All methods were performed in accordance with the approved guidelines.