Profiling of lung microbiota discloses differences in adenocarcinoma and squamous cell carcinoma

The lung is a complex ecosystem of host cells and microbes often disrupted in pathological conditions. Although bacteria have been hypothesized as agents of carcinogenesis, little is known about microbiota profile of the most prevalent cancer subtypes: adenocarcinoma (ADC) and squamous cell carcinoma (SCC). To characterize lung cancer (LC) microbiota a first a screening was performed through a pooled sequencing approach of 16S ribosomal RNA gene (V3-V6) using a total of 103 bronchoalveaolar lavage fluid samples. Then, identified taxa were used to inspect 1009 cases from The Cancer Genome Atlas and to annotate tumor unmapped RNAseq reads. Microbial diversity was analyzed per cancer subtype, history of cigarette smoking and airflow obstruction, among other clinical data. We show that LC microbiota is enriched in Proteobacteria and more diverse in SCC than ADC, particularly in males and heavier smokers. High frequencies of Proteobacteria were found to discriminate a major cluster, further subdivided into well-defined communities’ associated with either ADC or SCC. Here, a SCC subcluster differing from other cases by a worse survival was correlated with several Enterobacteriaceae. Overall, this study provides first evidence for a correlation between lung microbiota and cancer subtype and for its influence on patient life expectancy.


Materials and Methods
Sample collection. BALF was collected from subjects undergoing bronchoscopy for evaluation of lung disease at three hospitals in Portugal: Centro Hospitalar São João (CHSJ), in Porto; Centro Hospitalar Baixo Vouga (CHBV), in Aveiro; and Hospital Pulido Valente -Centro Hospitalar Lisboa Norte (CHLN), in Lisbon. Informed consent was obtained for all participants and sample collection for Human Research was approved by hospital ethical committees: Comissão de Ética para a Saúde (CES) -CHSJ, Comissão de Ética -CHBV and Comissão de Ética para a Saúde (CES) -CHLN. The study was conducted in accordance with ethical guidelines and regulations for Human research and with Helsinki Declaration. Sample collection was targeted toward affected lung segments and done by bronchoscope wedging into subsegmental lung regions. In this study, we used only bronchoscope working channel washes, which were done twice with a minimum volume of 15 mL (0.9% saline solution). Samples were then stratified in LC (N = 49) or controls (N = 54) based in positive or negative cytology results (Supplementary Table 1). However, in a follow-up analysis carried out up to 2 years after BALF collection, two cases were found to be false positives and another four initially classified as negative, over time progressed to LC 26 . The pooled sample strategy prevented the reallocation of these cases to controls and vice-versa.
Lung microbiota 16S rRNA screening. DNA extraction from BALF (200 µL) was performed using DNA Mini kit (Qiagen) according to manufacture instructions for capturing bacterial DNA in body fluids. Two 16S rRNA fragments spanning hypervariable regions V3-V4 and V4-V6 were amplified using universal primers (Supplementary Table 2). Pooled samples containing PCR products (~200 ng/sample) were generated for LC (N = 49) and controls (N = 54) and processed as previously described 27 . Briefly, two libraries were constructed according to Ion Xpress ™ Plus Fragment Library Kit protocol (Life Technologies) and ran in an Ion PGM ™ System -316 ™ chip (Life Technologies). The tools USEARCH, UCHIME, QIIME and Greengenes were used in the analysis of operational taxonomic units (≥97% nucleotide sequence identity cut-off) as previously described 27 . TCGA dataset. Raw RNAseq reads from tumors and clinical data files corresponding to 515 ADC and 501 SCC cases from TCGA, were downloaded from Genomic Data Commons (GDC) Data Portal (https://gdc.cancer. gov/). To perform a quantitative analysis of lung microbiota we used reads not aligning with human reference sequence (unmapped reads) as input for QmihR 28 . This pipeline combines Trimmomatic, Bowtie2, and RSEM for a probabilistic inference of bacterial taxa abundances 28 . To instruct bacterial sequence surveys, we first defined a microbiota reference panel based in previous evidence of lung colonization in healthy and diseased patients. Exactly, we considered a total of 567 bacterial taxa according to the data available in: 1) the Human Microbiome Project (HMP) -airways; 2) specialized literature; and 3) our own 16S rRNA study; for which whole genome sequences could be collected from https://www.ncbi.nlm.nih.gov/genome/microbes/. Upon quality control, relative abundances of 112 genera were obtained for 509 ADC and 500 SCC cases. These samples were then stratified according to several variables including ancestry (European or African), gender, age at diagnosis (≤65 or >65), anatomic positioning (Upper or Lower lung), localization in lung parenchyma (Peripheral and Central Lung) and pathological tumor stage (Stages I, II and III + IV). In addition, post-bronchodilator forced expiratory volume in 1 second (FEV1) and forced vital capacity (FVC) ratio were used to determine the presence (FEV1/FVC < 0.70) or absence of airflow obstruction 3 . Smoking history in pack per years (PPY) was considered using a first 20 PPY subdivision (data not shown). Given that most cases largely surpassed this value, a naïve 45 PPY split was used instead based on its proximity to average values (all cases 48; ADC 42 and SCC 53 PPY). Several patient follow-up variables were also considered in this study to evaluate clinical significance of collected data. These included vital (dead or alive) and cancer (tumor free or with tumor) status, days to death and primary therapy outcome. Statistical analysis. Statistical analysis of microbiota diversity was performed in R studio (https://www. rstudio.com/; version 1.1.383) using phyloseq 29 . Alpha diversity was evaluated through inverse Simpson and Shannon indexes. Beta diversity, which integrates phylogenetic relationships of bacteria was calculated by weighted UniFrac. Distances matrixes were used in Principal coordinates analysis (PCoA) and in hierarchical clustering (complete linkage) of TCGA samples. The linear discriminant analysis (LDA) effect size (LEfSe) algorithm 30 was used to detect taxa with differential abundances between TCGA cases. Survival analyses and log rank tests for pairwise comparisons of different case sets were carried out through the Cohort Comparison tool available at GDC Data Portal.
Differentiation of ADC and SCC subtypes. The availability of clinical parameters for TCGA cases allowed an in-depth analysis of possible factors affecting lung microbiota. Aside from some variability in genera abundance per cancer subtype (Fig. 1b), we found that SCC tends to show higher diversity than ADC as indicated by inverse Simpson (Fig. 2a). This difference seems to be correlated with European ancestry, male gender, heavy smoking (PPY >45) and older ages at the time of diagnosis (>65 years) (Fig. 2b). Yet, we could not detect any effect on microbiota diversity when considering tumor localization, upper or lower lung and central or peripheral parenchyma (Fig. 2b). (a) Inverse Simpson and Shannon indexes for LC cases grouped by histological subtype. (b) Inverse Simpson index of LC subtypes grouped according to different clinical variables available at TCGA database (ancestry, gender, age at diagnosis, smoking history, lung region and lung parenchyma). Welch's t-test was used to access statistical significance of pairwise comparisons (*P-value < 0.05; **P-value < 0.01, ***P-value < 0.001). ADC: adenocarcinoma. SCC: Squamous cell carcinoma. (2019) 9:12838 | https://doi.org/10.1038/s41598-019-49195-w www.nature.com/scientificreports www.nature.com/scientificreports/ We also used TCGA dataset to test the hypothesis of a loss of microbiota diversity along with disease progression. Nevertheless, no differences were observed across pairwise comparisons of cancer stages (I, II and III + IV) or COPD presence and absence. Stratification by cancer subtype did not alter the results ( Supplementary Fig. 1).
To evaluate the similarity of microbiota profiles weighted UniFrac distances were calculated. As plotted in PCoA (Fig. 3), LC communities are quite variable across samples and overlap between ADC and SCC. Nonetheless, some individual profiles appear to cluster and to be correlated with either ADC (upper right quadrant) or SCC (lower right quadrant). No other variable was found to aggregate groups of cases (results not shown).
Then, to gain a better insight into LC microbiota profiles we performed a hierarchical clustering of TCGA samples, followed by a graphic display of their lung communities at two taxonomic ranks -phylum and genus ( Fig. 4a-d). In the phylum analysis some heterogeneity among cases could be already witnessed, as indicated by upper tree clusters and subclusters. Higher abundances of Proteobacteria were connected with a first cluster (p_C1) and a second one (p_C2) could be divided into three major subclusters. Basically, these diverged in the relative proportions of common phyla: p_C2s1 was dominated by Actinobacteria; p_C2s2 had balanced frequencies of Proteobacteria, Actinobacteria, Firmicutes and Bacteriodetes and p_C2s3 was Firmicutes enriched. The analysis at the genus level depicted a larger complexity of lung microbiota, where cases often clustered into small groups showing long terminal branches. Notably, inside p_C1 two clusters contrasted with the remaining tree by their shorter terminal branches (p_C1/g_C1 and p_C1/g_C2; Fig. 4a). Whereas p_C1/g_C1 could be characterized by a community composed by prevalent genera such as Sphingomonas, Brevundimonas, Acinetobacter and Methylobacterium; p_C1/g_C2 could be defined by Enterobacter, Morganella, Kluyvera and Capnocytophaga. Interestingly, p_C1/g_C1 contained only ADC cases (N = 32), all of them located in the upper right quadrant of PCoA plot ( Supplementary Fig. 2). In contrast, p_C1/g_C2 included essentially SCC cases (89 in 94), this turn corresponding to the plot lower right quadrant ( Supplementary Fig. 2). In the p_C2s2 a single cluster emerged as less heterogeneous (p_C2s2/g_C1), in this instance, this could be correlated with high frequencies of Propionibacterium and mostly linked to ADC cases (32 in 42).
In the LEfSE analysis of ADC and SCC cases a total of 37 genera were detected to display contrasting correlations between LC subtypes (Fig. 5a). Precisely, for ADC the genera with higher LDA scores (>3.5) and extreme P-values (P < 5 × 10 −8 ) were Acinetobacter, Propionibacterium, Phenylobacterium, Brevundimonas and Staphylococcus. On the other hand, for SCC the genera fitting such requirements were Enterobacter, Serratia, Kluyvera, Morganella, Achromobacter, Capnocytophaga and Klebsiella (Supplementary Table 7). Interestingly, most of these bacteria could be correlated with previously identified clusters -p_C1/g_C1, p_C1/g_C2 and p_ C2s2/g_C1. A similar approach was used to address a possible contribution of bacteria into COPD, as a common co-morbidity to both ADC and SCC (Fig. 5b). Among the 12 taxa identified Achromobacter was the one most strongly correlated with airflow obstruction (LDA scores >3.5 and P-values ≤0.010; Supplementary Table 7).

Bacterial communities as prognostic biomarkers.
To investigate if our findings could have clinical potential, especially in a better stratification of LC cases, we compared the survival curves of previously identified clusters. In a first step, no significant differences were detected between p_C1 (Proteobacteria dominated) and p_C2s2 (intermediate abundances of common phyla), not even when separated by ADC and SCC. Yet, in the global comparison, and among SCC cases p_C1 cluster appears to be associated with a slower decay of survival rates (P = 0.076 and P = 0.089, respectively; Fig. 6). Several analyses were performed also in ADC, for p_C1/g_C1 (Acinetobacter/Brevundimonas community), p_C2s2/g_C1 (Propionibacterium community) and other cases, but all failed to reach compelling results possibly due to their low sample sizes. On the other hand, among SCC the p_C1/g_C2 cluster (Enterobacter community) was found to departure from the remaining p_C1 cases with a worse survival (P = 0.011), closer to the one observed in p_C2s2 cluster. Still, the strongest divergence in SCC survival rates was observed for non-p_C1/g_C2 (Proteobacteria dominated without Enterobacter community) and p_C2s2 (P = 0.006; Fig. 6). Interestingly, p_C1/g_C2 was the cluster associated with the highest mortality rate during follow-up (approximately 5000 days' maximum for SCC and 7500 days for ADC), and the one correlated with an increased number of deaths in tumor free patients (Table 1). Conversely, non-p_C1/g_C2 was shown   www.nature.com/scientificreports www.nature.com/scientificreports/ to display a reduced mortality even when the disease progressed negatively after primary therapy and a period of complete remission ( Table 1). The small size of p_C1/g_C1 and p_C2s2/g_C1 clusters prevented in ADC an accurate evaluation of the impact of these well-defined communities in patient outcome (Table 1).

Discussion
In this study, we performed a characterization of LC microbiota using two distinct datasets and methodological approaches: a pooled sequencing of 16S rRNA in BALF samples from Portuguese cases and controls; and a surveying of bacterial RNAseq reads made available through TCGA, for which tumor sections of hundreds of patients were collected. The main advantage of the first approach was to provide a preliminary and raw overview of lung microbiota at very low cost. However, this pooled approach has several limitations starting by its inability to address inter-individual variability and to accurately pinpoint bacterial communities to each individual. Another weakness is related to sample heterogeneity, which contains several cases lacking a complete histological classification and controls that include manly subjects with other pathologies. At last, surveying 16S rRNA can be considered also a shortcoming, once it is expected to introduce some ascertainment bias in taxa identification. This caveat is attributed, on one hand, to the differential annealing affinities of universal primers used in 16S rRNA amplification, and on the other, to the distinct resolving power of covered hypervariable regions 27,31 . Conversely, in the second approach, we could benefit from a larger cohort of ADC and SCC cases, for which detected RNAseq reads are more likely to represent an accurate composition of lung microbiota. The unique disadvantage of this strategy is that in order to maximize efficiency in mapping bacterial reads, we provided a database of reference genomes 28 . This was built using taxa identified in our 16S rRNA survey, combined with HMP data and published elsewhere [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] .
Overall, Proteobacteria emerged as the predominant LC phylum, a trend captured also in a large sample of cancer patients for which non-malignant tissue sections were collected 17 . Until now, increased frequencies of Proteobacteria were mostly correlated with asthma, COPD exacerbations and advanced COPD stages 8,9,32,33 . However, given current findings a Proteobacteria enrichment could also be a feature of cancerous lungs. In-depth surveys uncovered distinctive scores of Proteobacteria defining two major clusters: a first one truly dominated by Proteobacteria (p_C1); and another one displaying intermediate frequencies of Proteobacteria, Actinobacteria, Firmicutes and Bacteroidetes (p_C2s2). Altogether, these results are suggestive of substructure in lung microbiota that as far as we could investigate is not correlated with cancer subtype, or any other evaluated clinical variable.
Most impressively, p_C1/g_C2 cluster could be related with an overall poor survival if comparing SCC cases within p_C1 group. Furthermore, p_C1/g_C2 was characterized by several Enterobacteriaceae (Enterobacter, Morganella, Serratia, Klebesiela and Kluyvera), a taxon with recognized pathogenic potential causing airway www.nature.com/scientificreports www.nature.com/scientificreports/ infections in COPD 34 , colonizing bronchi of LC patients 35 and underlying nosocomial infections with resistance to antimicrobial molecules 36 . This cluster comprised also Achromobacter, another multidrug resistant microbe previously found in airway infections of cystic fibrosis patients and among subjects with solid malignancies 37 .
Moreover, as gram-negative bacteria, Enterobacteriaceae and Achromobacter synthetize lipopolysaccharides capable of stimulating host inflammatory responses. In this respect, Enterobacteriaceae overgrowth has been described as a key event of gut dysbiosis in obesity, Crohn's disease and colorectal cancer 37,38 . In asthma and cystic fibrosis there is also a growing body of evidence for a negative effect of Enterobacteriaceae 39,40 . Therefore, a contribution of p_C1/g_C2 community to an enhanced pro-inflammatory cancer microenvironment seems like a plausible hypothesis, once it is reported to foster tumorigenesis and promote lung cells malignant transformation 4,41 . This rational is supported by reports for a worse LC prognosis when patient bronchi are colonized by Enterobacteriaceae 35 .
Nonetheless, p_C1/g_C2 cluster was also linked with an increased mortality in absence of any tumor, which suggests an increased risk of this group to other non-cancer complications. Indeed, several studies already reported diverse pulmonary infections, septicemia and enhanced death rates after cancer resection and chemotherapy in LC subjects carrying potential pathogenic microorganisms in their airways 35,42,43 . Although we were unable to rigorously address the impact of p_C1/g_C2 community in the health status of a group of individuals probably debilitated by advanced age, co-morbidities (e.g. COPD, cardiovascular disease, diabetes, etc.) and inclusively cancer treatment, our findings advocate for a differentiated medical intervention in these patients, namely in the selection of antimicrobial therapies.
Still, the overall survival of p_C1/g_C2 does not differ from p_C2s2 cluster, which advances Actinobacteria, Firmicutes and/or Bacteriodetes as additional risk factors in SCC possibly through similar mechanisms of cancer progression. On the contrary, non-p_C1/g_C2 group appears to somehow tolerate new tumor events, which leads   www.nature.com/scientificreports www.nature.com/scientificreports/ to a decoupling of its survival curve from the rapid decline of p_C1/g_C2, and from the continuous decrease of p_C2s2.
Less remarkable findings were obtained for Brevundimonas/Acinetobacter (p_C1/g_C1) community, which did not diverge from other cases concerning ADC outcomes. However, if compared with SCC p_C1/g_C2 and p_C2s2 groups those disclosed a trend for extended survival or lifetime prognosis.
We also provide additional pieces of information for a burden of microbiota in pulmonary disease. In the TCGA cohort, cases showing COPD co-morbidity were linked to Achromobacter, thus highlighting a negative effect of this taxon in SCC and in airflow obstruction. However, COPD was not associated to any identified bacterial community in particular, nor to augmented prevalence of Moraxella and Haemophilus genera as frequently observed among these patients 8,12,22 . Noticeably, only 27% of cases were presented with lung function tests, and for those with COPD (11%) their majority was classified as mild or moderate cases (FEV > 50%; GOLD 1-2 stages). These features could explain the similar alpha diversity scores obtained for patients with or without COPD, once previous reports for a microbiota loss were centered in advanced cases 22 (for opposite results 19,40 ). More striking is the lack of differentiation across LC stages if taking into account former reports for a loss of diversity between tumor and non-tumor samples 17,18 . Nonetheless, there is no proof so far for a gradual decline of microbiota with cancer progression. In contrary, previous works uncovered higher alpha diversity values in advanced cases (IIIB and IV) than in earlier disease stages 18 . In our study, SCC cases were in average more diverse than ADC, a result that can be related to a heavier smoking load of these patients since cigarette and air pollutants were found to positively affect airway microbiota richness 17,23 . However, other unknown factors must play a role in LC bacterial colonization to explain the significant differences observed between SCC and ADC in heavy smokers. Moreover and consistently with a recent study carried out in mild to moderate COPD 12 , we also reported similar diversity levels across distinct lung regions (bronchial and peripheral lung), contradicting earlier findings for a microbiota differentiation in disparate lung anatomical regions 10 .
To our knowledge this work represents the largest scrutiny of LC microbiota. Shortly, we uncovered a predominance of Proteobacteria among cancerous lungs a feature shared with other airway disorders. However, Proteobacteria abundance is not universal and rather dictates a microbiota substructure independently of LC subtype, COPD co-morbidity, smoking history, age or lung region. In SCC, we found evidence for a differential effect of bacterial communities in patient survival, particularly when stratified into an Enterobacteriaceae cluster. Given that this taxon has documented complications in pulmonary illnesses, we proposed a contribution of this cluster to an inflammatory cancer microenvironment, as well as, for other post-operative and/or therapeutic non-cancer complications. Finally, we believe that the discovery of such well-defined communities may shed light into bacteria as promising LC biomarkers for patient stratification and as future prognostic tools or even as therapeutic targets.

Data Availability
The data used in this study is included in this manuscript and in supplementary material files. Additional files used to generate data analysis are available from the corresponding author on reasonable request.