Targeted next-generation sequencing of the 16S-23S rRNA region for culture-independent bacterial identification - increased discrimination of closely related species

The aim of this study was to develop an easy-to-use culture-free diagnostic method based on next generation sequencing (NGS) of PCR amplification products encompassing whole 16S-23S rRNA region to improve the resolution of bacterial species identification. To determine the resolution of the new method 67 isolates were subjected to four identification methods: Sanger sequencing of the 16S rRNA gene; NGS of the 16S-23S rRNA region using MiSeq (Illumina) sequencer; Microflex MS (Bruker) and VITEK MS (bioMérieux). To evaluate the performance of this new method when applied directly on clinical samples, we conducted a proof of principle study with 60 urine samples from patients suspected of urinary tract infections (UTIs), 23 BacT/ALERT (bioMérieux) positive blood culture bottles and 21 clinical orthopedic samples. The resolution power of NGS of the 16S-23S rRNA region was superior to other tested identification methods. Furthermore, the new method correctly identified pathogens established as the cause of UTIs and blood stream infections with conventional culture. NGS of the 16S-23S rRNA region also showed increased detection of bacterial microorganisms in clinical samples from orthopedic patients. Therefore, we conclude that our method has the potential to increase diagnostic yield for detection of bacterial pathogenic species compared to current methods.


UR2
Proteus mirabilis 10  Next generation sequencing (NGS) allows culture free detection of a theoretically unlimited number of pathogens and thus provides insight in the full microbiome. Since the introduction of benchtop sequencers, NGS is likely to become a diagnostic tool within the next few years in microbiological laboratories 3  analysis of large datasets requires a combination of bioinformatics skills and computational resources that is nowadays mostly absent in diagnostic (medical) microbiological laboratories. Furthermore, metagenomics approaches are time consuming as the turnaround time is approximately 4-5 days. To fill the gap between the conventional methods (culture and PCR) and metagenomics, a culture free approach using targeted NGS will be an excellent approach to detect and identify bacterial species as, compared to metagenomics, it is less complicated, and cheaper and therefore more likely to get implemented in diagnostic laboratories within a short timeframe.
The aim of the present study was to develop a rapid and easy-to use culture free diagnostic method based on NGS of PCR amplification products encompassing the whole 16S-23S rRNA region to improve the resolution of bacterial species identification. Moreover, the new method was compared with three commonly used identification methods. Finally, the feasibility of the new identification method to detect and identify bacterial species in clinical specimens was evaluated.

Results
Comparison of identification potential of the tested methods. The results of bacterial identification obtained by the four tested methods are shown in Supplementary Table S1. The rates of accurate identification to the species level using NGS of the 16S-23S rRNA region, Sanger sequencing of the 16S rRNA gene, Microflex MS and Vitek MS methods were 92.5%, 56.7%, 73.1% and 64.2%, respectively. The rates of accurate identification to the genus level using NGS of the 16S-23S rRNA region, Sanger sequencing of the 16S rRNA gene, Microflex MS and Vitek MS methods were 100%, 94.0%, 88.1% and 83.6%, respectively. At the species and genus level, assessment of the statistical significance of the differences in accuracy of the four methods for assigning species and genus showed statistically significant differences between NGS of the 16S-23S rRNA region versus all other tested methods for all bacterial isolates (P < 0.05). The only exception was observed between the two sequencing methods at the genus level (P = 0.2482). Moreover, the Sanger sequencing of the 16S rRNA gene was significantly more discriminative at the genus level than Vitek MS (P < 0.05).

Urine samples. Sixty fresh urine samples from patients suspected for urinary tract infections (UTIs) were
subjected to conventional culture-based methods and culture-independent NGS of the 16S-23S rRNA region. Bacterial culture reported the growth of predominant one or two microorganisms at ≥10 4 CFU/ml as possible cause of the suspected UTI. Growth of <10 4 CFU/ml or growth without predominant microorganism was considered inconclusive for UTI. NGS of the 16S-23S rRNA region identified from 1 to 47 different bacterial species in each PCR-positive sample (Table 1). There was an observed association between the increasing number of bacterial genera/species identified using NGS of 16S-23S rRNA region and decreasing CFU value, median: ≥10 5 CFU -2 species/genera; 10 4 CFU -5 species/genera; 10 3 CFU -5 species/genera; 10 2 CFU -20 species/genera. Twenty-nine samples were reported as clinically significant. Among them NGS of 16S-23S rRNA region identified only one bacterial species in 11 samples (UR34, UR35, UR37-UR44, UR50). In the majority of these cases (n = 10) pathogenic species identified by NGS of 16S-23S rRNA region were also identified as the cause of infection by conventional culture methods. The only exception was sample UR44, for which NGS of the 16S-23S rRNA identified solely K. pneumoniae, but culture also identified Proteus mirabilis with 10 3 CFU/ml. In 12 samples, in which 2 or more species were identified by NGS of the 16S-23S rRNA region (UR3-UR5, UR7-UR9, UR33, UR46, UR48, UR49, UR53, UR55), contigs representing the culture identified pathogenic species consisted of the highest number of reads (53.7-99.1% of all reads). For 3 samples (UR2, UR9, UR59) NGS of the 16S-23S rRNA region showed improved identification compared to conventional culture. In sample UR2 conventional identification showed only the presence of P. mirabilis, however the NGS-based method also revealed the presence of the genetically related species; Proteus vulgaris. In the sample UR9, Klebsiella oxytoca was identified by NGS of 16S-23S rRNA region additionally to E. coli. Colonies produced by these two microorganisms can be difficult to distinguish due to similar morphology on agar plates. In the sample UR59 possible misidentification by conventional culture method could occur. In this sample K. pneumoniae was identified by Vitek MS and Klebsiella variicola by NGS of 16S-23S rRNA region. It was shown very recently, since K. variicola is closely related to K. pneumoniae, it is difficult to distinguish between these two species by commonly used methods, including Vitek MS 9 . In the remaining 4 positive samples (UR1, UR36, UR52 and UR54) bacterial species that were identified by conventional culture methods as the cause of infection were also found by the NGS of 16S-23S rRNA method but were not predominant. Usually, only contigs containing commensal organisms consisted of higher number of reads. The urine samples classified by culture as "no clinical significance" (n = 31) contained numerous bacterial species, which probably represented the commensal flora. PCR-negative samples (n = 13) were only those samples with a growth density of 10 2 CFU/ml.

Blood stream infections.
Among the 23 positive blood culture samples, conventional culture methods and the NGS of the 16S-23S rRNA region approach produced the same identifications for 20 samples (  Orthopedic infections. Of the 21 clinical samples of orthopedic patients 18 were found to be culture negative and three samples were not cultured. In 5 samples (KM2, KM7, KM12, KM16 and KM20) the NGS-based method was able to detect only eukaryotic DNA and in 2 samples (KM8 and KM10) it yielded non-interpretable results most likely because of template degradation. In the remaining 14 samples the number of microorganisms detected in the orthopedic samples ranged from 1 to 3 different genera/species (Table 3). The only exception was sample KM9 in which 11 different genera/species were found. In our study, orthopedic samples had a low amount of starting material, so they were especially prone to be swamped by the contaminating DNA and result in misleading results. All negative controls contained Pseudomonas fluorescens. In a single negative control, Propionibacterium acnes was also found. Therefore, in orthopedic samples P. fluorescens (present in KM11 and KM19) and P. acnes (KM1, KM5, KM9, KM11, KM13, KM14, KM18 and KM19) were regarded as contamination introduced during sample preparation. Moreover, several bacterial genera previously reported as contamination of DNA extraction kits, PCR and other laboratory reagents were absent in all negative controls but present at orthopedic samples, including Acinetobacter, Bacillus, Chryseobacterium, Enhydrobacter, Paracoccus, Psychrobacter and Undibacterium 10 . These bacteria can also be potentially considered as contaminations of the 16-23S rRNA NGS process introduced during sample processing.

Reproducibility.
To assess the reproducibility of the method based on NGS of the 16S-23S rRNA region, 6 urine samples were used: 2 samples with bacterial count level at 10 2 CFU/ml, single samples with bacterial count levels at 10 4 and 10 3 CFU/ml, and 2 samples with bacterial count level at ≥ 10 5 CFU/ml ( Table 4). The samples were analyzed in triplicate by 3 different operators, including independent PCR amplification, PCR product purification, and NGS library preparation. Subsequently, two independent MiSeq sequencing runs were conducted. The first replicate of each sample was sequenced in the first run, while the second and third replicate of the tested samples were sequenced in parallel in the second run. When an organism identified was represented by 5% or more reads in at least one replicate of a sample, this organism was always detected in its remaining two replicates ( Table 4). The only exception was Undibacterium oligocarboniphilum, previously reported as a contaminant in DNA extraction kits, and in PCR and other laboratory reagents 10 . In the samples with bacterial count level at ≥10 5 CFU/ml, the method achieved 100% of reproducibility with respect to the bacterial composition. Moreover, in all replicates of a sample with a bacterial count level at ≥10 3 CFU/ml always the same organism was predominant and represented by 50.9-100% of reads.

Discussion
As defined in the CLSI guidelines, a species identification can be assigned when the max score is 99% or higher and if the sequence similarity between best and second best species are greater than 0.5% using DNA target sequencing 11 . However, species identification using CLSI's criteria for the 16S rRNA gene sequences was often weak because the criteria of distance scores greater than 0.5% to the next closest species was not met for most strains. In such instances, only identification to the genus level was feasible. Therefore, Park et al. 12 proposed a modified CLSI (mCLSI) method which was more practical and pragmatic for identification of species based on 16S rRNA sequences than the CLSI method. The mCLSI method assigns bacterial species when the similarity score is 99% or higher but irrespective of the similarity score differences. In our study, we applied the similarity score differences with the next closest species as ≥0.2%, which reflected at least 3 and 9 nucleotides difference by sequencing the 16S rRNA gene or the 16S-23S rRNA region, respectively. It allowed elimination of misidentification of very closely related species like S. oralis and S. mitis (Table S1) which in their 16S rRNA gene sequence differ only by 1 nucleotide. Using Sanger sequencing of the 16S rRNA gene, species was assigned with the similarity score 99.4% or higher (Table S1). Using NGS of the 16S-23S rRNA region, in a great majority of cases (n = 61), species were assigned with a similarity score above 99.2%. The exceptional 6 species with lower scores showed a similarity ranging from 96.2% to 98.8%. This could be caused by the facts that (i) a limited number of 16S-23S rRNA sequences were available in databases and (ii) the CLSI guidelines were developed for comparison of gene sequences and did not include the intergenic regions which for some genera can be endowed with higher variation. We believe that with increasing number of deposited 16S-23S rRNA sequences, it will be always possible to assign bacterial species with similarity scores of 99% or higher. We assigned bacteria to the genus level by NGS of 16S-23S rRNA region in clinical samples when the similarity score was at least 90%. This value was determined based on the results produced during comparison of the four identification methods on pure cultures (Table S1). The 16S-23S rRNA sequence alignments produced the lowest identity for second closest match within the same genus for: Actinobaculum 88% (Table S1; sample 6), Lactobacillus 89% (sample 30), Corynebacterium 91% (samples 14, 17 and 18). We applied the uniform cut-off equal to 90% for all bacteria in clinical samples. However, defining 16S-23S rRNA sequences for all or at least majority species within a genus will create interpretive criteria for defining the genus and probably will vary according to the queried microorganism.
The 16S-23S rRNA sequences can differ by length even among highly genetically related species. It is caused by size variation in 16S-23S rRNA intergenic spacer region (ISR). Staphylococcus aureus clonal complex 75 has been recently renamed as Staphylococcus argenteus and now is a novel species of the Staphylococcus genus 13 . S. argenteus showed identical or nearly identical 16S and 23S rRNA gene sequences to S. aureus but differed substantially by the ISR length. The scoring system of BLAST had been designed not to allow large gaps. The BLAST algorithm produced a set of smaller separate alignments, with the longest alignment encompassing the 23S rRNA gene (about 2, 5 kb in size). The lack of an alignment for the full-length sequences allowed distinguishing S. aureus from S. argenteus.
In our study all negative controls contained P. fluorescens. We, however, only found this microorganism in the orthopedic samples (Table 3) and not in the urine and blood samples (Tables 1 and 2). Orthopedic samples were characterized with lower starting microbial material compared to that of the urine and blood samples. This showed that the impact of contaminating sequences is greater in low biomass samples. Moreover, P. fluorescens had not been associated previously with human infections and its presence can be clearly regarded as contamination. However, in a single negative control, which was introduced to monitor the impact of contaminations on identification procedure of the orthopedic samples, P. acnes was also identified. P. acnes is a common human skin-associated organism and had been previously shown as the cause of orthopedic infections 14,15 . It will be highly important to limit the impact of such contaminations as P. acnes during samples handling as this is also a clinically significant microorganism and its contribution to an infection can be misinterpreted.
NGS of the 16S-23S rRNA region will provide enhanced information on the presence of microbial DNA within a clinically relevant time frame which is necessary for timely and accurate treatment of infections and therefore key for proper infection and antibiotic resistance control. Data of this study showed that this method correctly identified bacterial species that were identified as the cause of infection by conventional culture methods. In general, in the culture-positive samples a few additional species/genera per sample were identified by the NGS method. However, contigs representing the culture identified pathogenic bacterial species had the highest number of sequence reads. To assess the clinical relevance of the identified species/genera in samples with a low amount of bacterial DNA, a prospective clinical validation study will be carried out including samples of complex patient groups (e.g. orthopedic patients with a clinical suspicion of a prosthetic joint infection) and samples of control groups of patients without a clinical suspicion of an infection.
Currently, NGS of the 16S-23S rRNA region suffers from two major limitations. First limitation is a lack of reference database with the 16S-23S rRNA sequences and a complementary software allowing easy and reliable species identification. Second major limitation of the method is a lack of reference sequences for many bacterial species. We currently work on the development of a reference data set for assigning clinically relevant bacterial species based on the 16S-23S rRNA sequences. The quality and amount of data accumulated in the databases is particularly important for the performance of bacterial identification using sequencing analysis of the 16S-23S rRNA region. Also, this lack of reference 16S-23S rRNA sequences in the GenBank database might have potentially introduced bias in the results when comparing 16S rRNA with 16S-23S rRNA sequences.
As a proteomic tool for microbial identification, MALDI-TOF MS is superior to NGS-based methods in cost and speed. Currently, the total costs of all reagents and consumables for NGS of 16S-23S rRNA region (not including labor) amount to ~70 € per sample with turnaround time of ≤4 days. However, NGS of the 16S-23S rRNA region was culture independent and more significantly discriminative at the species and genus level than MALDI-TOF MS approaches. Moreover, NGS of 16S-23S rRNA region, unlike all other mass spectrometry methods described above, can be extremely useful for identification of rare or unknown bacteria and bacteria with unusual phenotypic profiles.
In summary, the main objective of the present study was to develop a new diagnostic method based on NGS of 16S-23S rRNA region and assess its identification potential. Its resolution power was found to be superior to the other identification approaches commonly used in routine clinical microbiology laboratories; moreover, the method was easy in use. The method correctly identified urinary tract pathogens and blood stream pathogens previously identified as the cause of UTI and blood stream infection (BSI) with conventional culture. NGS of the 16S-23S rRNA region also showed increased sensitivity in the diagnosis of bacterial microorganisms in samples collected from patients suspected of orthopedic infections. In the analyzed samples several bacterial species had been previously reported as the cause of orthopedic infections, including Haemophilus parainfluenzae 16 , Moraxella osloensis 17 and S. epidermidis 18 . Therefore, we conclude that our approach has the potential to increase diagnostic yield and will decrease time to result for detection of unexpected bacterial pathogens and bacterial species compared to current methods, thereby improving targeted antibiotic treatment. Furthermore, there is a huge potential of this method for detection of bacterial pathogens that can not be cultured at all, due to VNC state or due to antibiotics prior to collection of sample. Finally, with our method it will be possible to streamline processes in the laboratory and to implement it in several disciplines, like clinical, environmental and veterinary microbiology. However, this approach needs further validation and determination of its sensitivity. Furthermore, studies focused on the clinical relevance are necessary for determining the applicability of this NGS-based approach in routine diagnostics.

Materials and Methods
Ethics. All procedures were carried out according to guidelines and regulations of Certe concerning the use of patient materials for the validation of clinical methods, which are in compliance with the guidelines of the Federation of Dutch Medical Scientific Societies (FDMSS). The project was approved by the Certe medical staff under project submission 3305-0037. Every patient of the Certe is informed that samples taken may be used for research and publication purposes unless they indicate that they do not agree to it. Informed consent was obtained from all individuals or their guardians prior to study participation. All samples were used after performing and completing a conventional microbiological diagnosis and were coded to protect patients confidentiality.
Sequencing of the 16S-23S rRNA region and data analysis. NGS generated 25,000-50,000 or 1-2 million sequencing reads for pure culture or clinical sample, respectively, to obtain a minimum coverage of 1000 per sample. The 300-nucleotide paired-end reads were de novo assembled into contigs with the SeqMan NGen software (DNASTAR) using parameters: mer size 31 nucleotides and minimum match percentage 93%. The sizes of resulting contigs produced during analysis of pure cultures ranged from 4183 bp (Bacillus cereus) up to 4856 bp (Acinetobacter parvus) with an average size of 4423 bp (Table S1). In case of clinical samples, most often an identified bacterial species/genus was represented by a single contig of expected size, around 4.5 kb. However, in some instances only smaller contigs (ranging between 1 and 3 kb in size) represented the same bacterial species/genus present in a sample. In those cases all contigs belonging to the same organism were combined and their reads were added up. Species identification was based on alignment of contig sequences with 16S-23S rRNA sequences deposited in the GenBank database using nucleotide BLAST (Basic Local Alignment Search Tool, http://www. ncbi.nlm.nih.gov/BLAST/). When a reference 16S-23S rRNA sequence was not available in the database, a reference 16S rRNA gene sequence was used for a species identification. The contig sequences were submitted via the website and bacterial species was assigned when the similarity score was 99% or higher and the similarity score differences of the first match with the next closest species was equal to or greater than 0.2%. This reflected a 9 or more nucleotides difference for a sequence of 4423 bp (the average size of the 16S-23S rRNA amplicon). When the similarity score was between 90% and 99%, the genus could be assigned. The score below 90% was interpreted as an unidentified organism.
Statistical analysis. The four bacterial identification methods were compared using McNemar's test, a test of paired proportions. Only a single strain per species was taken (from Table S1) for calculation of the accuracy of bacterial identification at the species or genus level according to McNemar's test. When the P value was less than 0.05, we concluded that there was a significant difference between the methods. Statistical analysis was performed using the GraphPad software.