Refinement of microbiota analysis of specimens from patients with respiratory infections using next-generation sequencing

Next-generation sequencing (NGS) technologies have been applied in bacterial flora analysis. However, there is no standardized protocol, and the optimal clustering threshold for estimating bacterial species in respiratory infection specimens is unknown. This study was conducted to investigate the optimal threshold for clustering 16S ribosomal RNA gene sequences into operational taxonomic units (OTUs) by comparing the results of NGS technology with those of the Sanger method, which has a higher accuracy of sequence per single read than NGS technology. This study included 45 patients with pneumonia with aspiration risks and 35 patients with lung abscess. Compared to Sanger method, the concordance rates of NGS technology (clustered at 100%, 99%, and 97% homology) with the predominant phylotype were 78.8%, 71.3%, and 65.0%, respectively. With respect to the specimens dominated by the Streptococcus mitis group, containing several important causative agents of pneumonia, Bray Curtis dissimilarity revealed that the OTUs obtained at 100% clustering threshold (versus those obtained at 99% and 97% thresholds; medians of 0.35, 0.69, and 0.71, respectively) were more similar to those obtained by the Sanger method, with statistical significance (p < 0.05). Clustering with 100% sequence identity is necessary when analyzing the microbiota of respiratory infections using NGS technology.

However, several problems have been highlighted regarding ASV analysis [21][22][23] , such as the possibility that species of the genus Streptococcus represented by Streptococcus pneumoniae, which are important in respiratory infections, are underestimated 21 . Therefore, further validation of the method is required before use in the assessment of pneumonia with aspiration risks in which Streptococcus species are frequently detected. OTU analysis has been utilized in various fields of research 14,24,25 , and verification of the accuracy of this method is required.
A disadvantage of performing clone library analysis using the Sanger method is the reduced analysis capacity per operation compared to NGS technology; however, the Sanger method can read comparatively long sequences (700-1000 bp) with high accuracy (99.999%) 26 relative to NGS and can estimate the species level for some bacteria 27 . The combination of the clone library and Sanger methods can be used to estimate predominant bacterial species without the use of culture-based methods in clinical specimens. We have reported the usefulness and significance of this approach in evaluating respiratory infections [28][29][30][31][32][33] .
The genus Streptococcus, including Streptococcus pneumoniae or Streptococcus intermedius, is a major causative agent of bacterial pneumonia in elderly people 30,34 and in patients with lung abscess 31 . Streptococcus pneumoniae is generally the major causative agent in respiratory infections, but species such as, oral streptococci, Streptococcus pseudopneumoniae 35,36 , and S. intermedius 37,38 are important in estimating antibiotic resistance and the clinical course of the disease. The nucleotide sequences of 16S rRNA regions of these organisms, including S. pneumoniae and Streptococcus mitis, are very similar 27 , and thus correct nucleotide sequence identification is crucial; however, it is unclear whether these bacteria can be distinguished by targeted NGS analysis.
Respiratory infections exhibit diverse bacterial flora even in each sample directly obtained from the lung, which are difficult to characterize using commercially available bacterial cultures. Therefore, we retrospectively evaluated bronchoalveolar lavage fluid (BALF) samples from patients with pneumonia with aspiration risks and from patients with lung abscess, in which the genus Streptococcus plays an important role, to compare the results of the Sanger method combined with the clone library method and NGS methods. We also investigated the optimal threshold for clustering 16S rRNA gene sequences into OTUs by NGS technology by comparing the results of NGS technology with those of the Sanger method.

Methods
Study population. This was a retrospective study targeting 177 patients with community-onset pneumonia (community-acquired pneumonia and healthcare-associated pneumonia) and 61 patients with lung abscess on whom a bronchoscopy examination was performed at the University of Occupational and Environmental Health, Japan, and affiliated community hospitals between April 2010 and March 2016. Most patients were included in previous studies 30,31 . Patients less than 65 years of age and those who did not meet the definition of pneumonia with aspiration risks and lung abscess were excluded 30,31 . Two cases were also excluded during quality verification of the BALF samples. Finally, 45 of 177 patients with pneumonia with aspiration risks and 35 of 61 patients with lung abscess were included in this study (Fig. 1). This study was approved by the Ethics Committee of Medical Research, University of Occupational and Environmental Health, Japan (UOEHCRB18-016). All experiments were performed in accordance with relevant guidelines and regulations. The need for written informed consent was waived by the Ethics Committee of Medical Research, University of Occupational and Environmental Health, Japan because of the retrospective study design.
Definitions. Community-onset pneumonia was defined according to the guidelines of the Infectious Diseases Society of America/American Thoracic Society 39 . Aspiration risks were defined according to the criteria of   40 as in our previous reports 30,31 and included neurologic dysphagia, disruption of the gastroesophageal junction, or anatomical abnormalities of the upper aerodigestive tract. Lung abscess was diagnosed when the following three criteria were fulfilled: (a) the presence of new areas of infiltrations with a cavity or low-density area within the infiltrates on chest radiographs and/or computed tomography; (b) new clinical findings including at least two of the following: fever, sputum production, coughing, chest pain, and leukocytosis (peripheral white blood cell count ≥ 10,000/μL), and (c) exclusion of other causes 31 .
Sample collection. Fiberoptic bronchoscopy was performed as previously described 29 . Briefly, BALF samples were obtained using 40 mL of sterile saline from the affected lesions of patients with pneumonia or lung abscess.
Cell lysis efficiency analysis and DNA extraction. DNA samples were extracted from the BALF samples by adding sodium dodecyl sulfate (final concentration: 3.0%) and glass beads followed by vigorous shaking, as previously described 29 .
Microbiota analysis using the Sanger method. The partial 16S rRNA gene fragments (approximately 550 bp) were amplified by polymerase chain reaction (PCR) with universal primers (E341F and E907R), as previously described 29 . The amplified products were cloned into Escherichia coli using a TOPO TA cloning kit (Invitrogen, Carlsbad, CA, USA). Nucleic acid sequences of 96 randomly selected clones in each clone library were determined using a 3130xl Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). Highly accurate sequences selected by Phred quality values were trimmed from the primer and vector regions, and only sequences with good quality were used for analysis 29 . NGS library preparation and Illumina MiSeq sequencing. Extracted DNA samples were purified using Agencourt AMPure XP (Beckman Coulter, Brea, CA, USA). Two-step PCRs were performed on the purified DNA samples to obtain sequence libraries. The first PCR was an amplification step performed using a 16S (V3-V4) Metagenomic Library Construction Kit for NGS (Takara Bio, Shiga, Japan) with primer pairs, 341F (5′-TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG CCT ACGGGNGGC WGC AG-3′) and 806R (5′-GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GGG ACTACHVGGG TWT CTAAT-3′), corresponding to the V3-V4 region of the 16S rRNA gene. Thermal cycling was performed in a TaKaRa PCR Thermal Cycler Dice Gradient under the following conditions: initial denaturation at 94 °C for 1 min, followed by 28 cycles of denaturation at 98 °C for 10 s, annealing at 50 °C for 15 s, and extension at 68 °C for 15 s. PCR amplicons were purified using AMPure XP magnetic purification beads.
The second PCR was performed to add the index sequences for Illumina sequencing using a Nextera XT Index kit v2 (Illumina, San Diego, CA, USA) and Tks Gflex DNA polymerase S (Takara Bio). Thermal cycling was performed in a TaKaRa PCR Thermal Cycler Dice Gradient (Takara Bio) with the following conditions: initial denaturation at 94 °C for 1 min, followed by 8 cycles of denaturation at 98 °C for 10 s, annealing at 60 °C for 15 s, and extension at 68 °C for 15 s. PCR amplicons were evaluated for quality verification of the sequence library using an Agilent 4200 TapeStation (Agilent Technologies, Santa Clara, CA, USA).
After quality verification, mixed samples were prepared by pooling approximately equal amounts of each amplified DNA and sequenced using MiSeq Reagent Kit V3 (250 × 2 cycles) and a MiSeq sequencer (Illumina), according to the manufacturer's instructions. The obtained nucleotide sequences were subjected to assembly and clustering processes using the CD-HIT-OTU algorithm 41 . First, low-quality reads (reads containing > 10 bases with an error rate ≥ 10%, more than 4 mismatches in the overlap region, or non-overlapping paired reads) were removed and assembled. The assembled sequences with 100% homology were clustered, and chimera sequences were removed. In addition, clusters with a low frequency (39 contigs or less) of occurrence estimated using simulations from the maximum cluster size were removed. The nucleotide sequences obtained were clustered based on homologies of 100%, 99%, and 97%. A representative sequence was selected; those of OTUs showed the longest lengths.
Homology searching. In the Sanger method, the resulting high-accuracy reads were directly homology searched, whereas for NGS technology, homology searching was performed for representative sequences of each OTU. The determined sequences were compared with an in-house database containing the 16S rRNA gene sequences of 5,878 type strains using the basic local alignment search tool (BLAST; NCBI, Bethesda, MD, USA) algorithm 42 . Bacterial type strains that were top hits with ≥ 97% homology identified using BLAST were classified as presumptive species 43 , and those with < 97% homology were classified as unidentified organisms [29][30][31]44 . The 16S rRNA gene sequences of type strains were obtained from the DNA Data Bank of Japan (http:// www. ddbj. nig. ac. jp/) and the Ribosomal Database Project (http:// rdp. cme. msu. edu/).
Interpretation of the results. The bacterial phylotype exhibiting the highest proportion of the total microbiota in each BALF sample was defined as the "predominant phylotype. " In addition, phylotypes that occupied less than 5% of each microbiota were classified as "Others, " as previously described 29 .
Bray Curtis dissimilarity. The Bray Curtis dissimilarity was calculated for each specimen to assess the degree of difference between the results of the Sanger method and NGS technology. This analysis is a standardized index that has a value of 0 when the species composition is similar between two microbiota and a maximum Comparison of the results of microbiota analysis between the Sanger method and NGS. There were 6,530 total clones, and the mean nucleotide sequence length using the Sanger method was 544.3 bases (per read). NGS generated 7,637,916 pairs of reads, and the total number of contigs after filtering and assembling was 4,103,775 with a mean nucleotide sequence length of 459.5 (see Supplementary Table S1 online). OTUs were created with 100%, 99%, and 97% clustering thresholds for subsequent analysis. The total numbers of OTUs clustered at 100%, 99%, and 97% identity were 3,663, 493, and 255, respectively. The rarefaction curves of the number of OTUs identified per sample almost plateaued with ~ 5000 sequence reads (see Supplementary Fig. S1 online).

Discussion
In this study, we investigated the optimal threshold for clustering 16S rRNA gene sequences into OTUs using NGS technology (MiSeq sequencer; Illumina) by comparing the results with those obtained using the Sanger method. We demonstrated the difficulty in distinguishing the bacterial phylotype of the S. mitis group, which is the most common bacteria in respiratory infections, when the clustering threshold was set to low (< 99% OTUs). When analyzing the microbiota of respiratory infections by NGS technology, particularly in bacterial pneumonia, clustering at 100% sequence identity is necessary. Generally, S. pneumoniae is the main causative agent of community-onset pneumonia 47,48 , but we previously reported that oral streptococci may also play an important role in elderly people with pneumonia with aspiration www.nature.com/scientificreports/ risks 30,31 . We detected S. pseudopneumoniae by NGS technology as the predominant phylotype in all nine cases in which S. pneumoniae phylotypes were detected by the Sanger method (Figs. 2 and 3). It was not possible to distinguish between S. pneumoniae and S. pseudopneumoniae because these two bacterial phylotypes were identical in the NGS amplification region (V3-V4) used in this study (see Supplementary Fig. S2 online). Evaluating longer regions, such as the V3-V5 region, using the MiSeq sequencer has been reported to reduce the accuracy of the second half of the read 49 . The Sanger method revealed that S. pseudopneumoniae dominated in one case (1.3%). Thus, the Sanger method is a better method for discriminating between these two species when examined in respiratory infections in the clinical setting. NGS analysis with 99% and 97% homology revealed S. pseudopneumoniae as the predominant phylotype in 21 cases (26.3%) and 22 cases (27.5%), respectively (Figs. 2 and 3). In addition, the nine cases in which S. oralis was detected as the predominant phylotype by the Sanger method showed a concordance rate of 88.9% (8/9) with the results of NGS when the clustering threshold was set to 100%. However, at clustering thresholds under 100%, S. pseudopneumoniae was identified in all nine cases. Streptococcus oralis was identified as a minor population when the clustering threshold was above 99%. These findings indicate that the phylotype distinction depends on the clustering threshold; therefore, it is necessary to set the clustering threshold at 100% to estimate bacterial phylotypes more accurately.
Clustering of 16S rRNA sequences into OTUs removes minor artifactual sequence variants due to PCR amplification and sequencing errors when collapsing sequences into groups 16 . In this study, when the clustering threshold was set to 100%, the number of bacterial species defined as unidentified organisms increased. It is not possible to evaluate whether the bacterial species included in the unidentified organism group are an actual minor population or artifact. Using a high threshold, such as 100%, will result in the exclusion of samples because in several instances the number of reads falls considerably below the threshold in downstream processing; however, this can be compensated by re-sequencing the samples.
When the clustering threshold was set to 97%, some bacterial phylotypes, such as Corynebacterium spp. differed in the results of the Sanger method and NGS technology at the species level. Generally, the genus Corynebacterium is often regarded as an oral contaminant. In recent years, pneumonia due to Corynebacterium striatum has been reported 50,51 , and we also reported the importance of the genus Corynebacterium in pneumonia 29,44 , particularly when hospital-acquired 52 . The genus Corynebacterium has been reported to have different drug sensitivities depending on the bacterial species; therefore, species-level identification is desirable when Corynebacterium is suspected as the causative agent of pneumonia 53 . When the clustering threshold was set to 97%, Corynebacterium simulans and C. striatum were classified into the same OTUs. Comparison of the nucleotide sequences of the type strain (C. simulans, accession No. AJ012837; C. striatum, accession No. X84442) in the V3-V4 amplification region for these two strains showed mismatches of three base pairs and homology www.nature.com/scientificreports/ of more than 99%. Therefore, when the clustering threshold was set to 97%, C. simulans and C. striatum were classified into the same OTUs by NGS technology. There were several limitations to this study. First, this study was retrospective and included only patients with sufficient specimens available for analysis. Second, there was a difference in the analysis primer region between NGS technology and the Sanger method. Third, bacterial species with abundances of 5% or less were treated as "Others" as previously reported 29 , but there is no clear definition of the makeup of this group. In this study, we did not examine the minor population, but pathogenic bacteria, such as Mycobacterium tuberculosis may be included; thus, it is necessary to examine the minor population in further studies. Fourth, accurate identification of bacterial species is difficult via analysis of the 16S rRNA gene sequence alone. However, predicting the bacterial species is highly useful in deciding treatment strategies such as antibiotic selection. Therefore, it is important to utilize methods based on 16S rRNA gene sequencing that can help evaluate microbiome information as accurately as possible.
Microbiome analysis using NGS technology will continue to progress through technological innovation. When analyzing the microbiota of respiratory infections using NGS technology, it is necessary to cluster with a 100% sequence identity. Alternatively, it is necessary to use an analysis method that can maintain high accuracy for sequencing analysis, such as that for ASVs. If the analysis is performed using a low threshold on sequence clustering, important pathogens may not be identified, and/or incorrect phylotype information may be obtained. www.nature.com/scientificreports/ Figure 6. Similarity of the Sanger method and NGS technology using Bray Curtis dissimilarity in the eight pneumonia cases in which Streptococcus mitis group (other than pneumococcus) was detected predominantly by the Sanger method. Bray Curtis dissimilarity revealed that OTUs obtained at 100% clustering threshold were more similar to those obtained by the Sanger method than did those obtained at 99% and 97% thresholds (p < 0.05). P-values were determined by Kruskal-Wallis test and a post hoc Dunn's test with Bonferroni adjustment for multiple comparisons. NGS (100%) = NGS results obtained with a clustering threshold of 100%; NGS (99%) = NGS results obtained with a clustering threshold of 99%; NGS (97%) = NGS results obtained with a clustering threshold of 97%.

Data availability
The sequence read data were submitted to a public database (DNA Data Bank of Japan Sequence Read Archive under the BioProject identifier PRJDB11117).