Phylogenomic analysis of Clostridioides difficile ribotype 106 strains reveals novel genetic islands and emergent phenotypes

Clostridioides difficile infection (CDI) is a major healthcare-associated diarrheal disease. Consistent with trends across the United States, C. difficile RT106 was the second-most prevalent molecular type in our surveillance in Arizona from 2015 to 2018. A representative RT106 strain displayed robust virulence and 100% lethality in the hamster model of acute CDI. We identified a unique 46 KB genomic island (GI1) in all RT106 strains sequenced to date, including those in public databases. GI1 was not found in its entirety in any other C. difficile clade, or indeed, in any other microbial genome; however, smaller segments were detected in Enterococcus faecium strains. Molecular clock analyses suggested that GI1 was horizontally acquired and sequentially assembled over time. GI1 encodes homologs of VanZ and a SrtB-anchored collagen-binding adhesin, and correspondingly, all tested RT106 strains had increased teicoplanin resistance, and a majority displayed collagen-dependent biofilm formation. Two additional genomic islands (GI2 and GI3) were also present in a subset of RT106 strains. All three islands are predicted to encode mobile genetic elements as well as virulence factors. Emergent phenotypes associated with these genetic islands may have contributed to the relatively rapid expansion of RT106 in US healthcare and community settings.


RT106 isolates are virulent in an animal model of infection.
Prior to detailed characterization of RT106 isolates, we verified the virulence of the representative strain GV599 in the Golden Syrian hamster model of acute C. difficile infection. All infected animals succumbed to disease within 6 days of spore inoculation (Supplemental Fig. S1a). Microscopy-based visualization of colonic tissue sections revealed classic C. difficile infection pathology including gross hemorrhage, epithelial erosion and inflammatory infiltrates (Supplemental Fig. S1b).
RT106 strains harbor one clade-specific novel genetic element. Whole genome sequencing was performed on all 38 RT106 strains recovered in our surveillance (Supplemental Table S2), and data were compared to 1425 publicly available C. difficile strain sequences. Based on single nucleotide polymorphism (SNP) analyses 23,24 , the strains were not clonal, and the two closest-related isolates (GV597 and GV753) were divergent by 113 SNPs. Overall, RT106 genomes were most-closely related to RT002 strains 25 .
Our 38 RT106 strains mapped closely to 33 previously sequenced RT106 strains from pediatric patients 26,27 and 23 other strains of unknown ribotype (highlighted in red in Fig. 2b). Evolutionary analysis of the 94 strains containing the entire GI1 was performed using MEGA X (Fig. 2c). We performed in silico ribotyping on the 23 strains, and 13/23 (those with currently available closed genome sequence) generated a clear RT106 PCR fragment pattern. For an additional assessment of genome relatedness, we performed in silico Multi-Locus Sequence Typing (MLST) on all 94 strains; this method differentiates organisms into Sequence Types [STs 28 ]. 92/94 strains were sequence type ST42, whereas 2/94 belonged to the closely-related sequence type ST28 29 . Taken together, all 94 strains interrogated in these analyses grouped together in a distinct RT106 clade (Fig. 2b,c) 30 .
The 46 kb genomic island 1 is unique to RT106/ST42/ST28 strains. BLASTN analysis of the 46 kb GI1 against 1425 publicly available C. difficile genome sequences at the NCBI database resulted in the identification of 265 C. difficile strains that contain either segments (> 7.7 kb, 98% identity) of or the entire genomic island. We concomitantly performed in silico MLST analysis to determine the respective sequence types, and then generated a maximum likelihood tree based on the core genome SNPs of 265 C. difficile strains harboring segments of or the entire G1 using Mega X 35 . GI1-related genes found in each strain were annotated based on gene function. Only RT106/ST42/ST28 strains harbor the complete 46 kb GI1, while other ST strains included in this analysis contain only shorter segments of the genomic island (Fig. 3). A 7.1 kb gene segment (demarcated within a black dashed box; Fig. 3) is common to all MLST sequence type strains shown. SNP analysis was performed on the 7.1 kb gene segment to generate a molecular clock of GI1 via Mega-X 36 using maximum likelihood (ML) approach (Fig. 4). The molecular clock revealed gradual and progressive acquisition of gene elements in different strains, finally leading to an intact GI1. CD105KSE6, which branches most distantly from RT106 based on the alignment of the 7.1 kb segment of GI1, contained the least number of GI1-associated genes as opposed to STs branching closer to RT106/ST42/28.
To further interrogate whether GI1 was acquired via horizontal transfer, we compared the molecular clock of the 7.1 kb GI segment ( Fig. 4) with the a molecular tree based on genes assumed to be refractory to horizontal gene transfer 37,38 . Thus, a minimum spanning tree was generated using the seven housekeeping genes utilized in MLST characterization to establish genetic relatedness of strains harboring the core 7.1 kb GI1 fragment (Supplemental Fig. S2). ST28, a sequence type that is included within the RT106 clade, is closely related to ST16, ST18 and ST46 based on sequence similarity of the core GI1 fragment (Fig. 4). However, only ST16, ST18 and ST28 are closely related based on the seven MLST gene loci; ST46 is distantly placed from ST16, ST18 and ST28 (Supplemental Fig. S2). A similar case is observed with the more predominant sequence type, ST48, within the   Fig. S2), and yet these ST strains map distantly in the core GI1-based molecular clock (Fig. 4). Since the tree topologies do not exhibit the same pattern, it is likely that GI1 is acquired laterally. Further analysis of tree topologies via likelihood ratio tests showed that the ML tree based on the 7.1 kb shared region within GI1 was significantly different from the ML tree based on core genome SNPs (P value = 6.83E−74 calculated using approximately unbiased test) and the ML tree based on the seven MLST housekeeping genes (P value = 1.17E−71). The entire GI1 is not found in any other bacteria. However, two regions (8.4 kb and 13.7 kb) within GI1 were detected in other enteric bacteria (Fig. 5). The 13.7 kb gene segment was found in Enterococcus faecium EnGen0312 UAA407 at 99% sequence identity, while the 8.4 kb gene segment occurs in the same gene order but with some sequence plasticity in Enterococcus faecium EnGen0312 UAA407, Anaerostipes hadrus BPB5-Raf3-2-5, Clostridium sporogenes YH-Raf3-2-5 and Roseburia intestinalis M50/1 strains (89.8%, 90.4%, 90.5% and 92.2% DNA sequence identity, respectively).

Phenotypic characterization of RT106 isolates.
Clade-specific properties, including those conferred by genes within GI1 could explain the emergence and spread of RT106 strains. Therefore, we assessed various virulence-associated phenotypes including antibiotic susceptibility, motility, toxin production, biofilm production and adhesion to collagen on the first 21 of the 38 RT106 strains chronologically obtained from our clinical surveillance.
RT106 strains display variable antibiotic susceptibility, with some isolates displaying multi-drug resistance. We determined the susceptibility of RT106 isolates to the antibiotics cefotaxime, vancomycin, erythromycin, clindamycin, levofloxacin, moxifloxacin, metronidazole, and tetracycline. All isolates were resistant to cefotaxime (minimum inhibitory concentration (MIC) > 32 mg/ml), but susceptible to vancomycin, metronidazole, and tetracycline (Table 1). 18/21 strains had intermediate resistance to clindamycin (MIC = 4-6 mg/ml). Three isolates (GV371, GV423, GV432) were highly resistant to erythromycin (MIC > 256 mcg/ml). Clindamycin and erythromycin belong to the macrolide-lincosamide-streptogramin B (MLS B ) group of protein synthesis inhibitors. MLS B resistance in C. difficile has been associated with the acquisition of erm genes 39 or nucleotide substitution (C → T) at position 656 within the 23S rDNA 40 . None of the RT106 strains harbor the erm genes, while only GV415 had the 23S rDNA 656C>T substitution (Supplemental Table S1); however, GV415 has low-level resistance to clindamycin (MIC = 4 mg/ml) and is susceptible to erythromycin.
All RT106 isolates, except GV597, were susceptible to the fluoroquinolone levofloxacin. GV597, GV453, GV587, and GV642 had intermediate resistance to the fluoroquinolone moxifloxacin (MIC = 4-6 mg/ml).  Table S1). The levofloxacin-resistant GV597 strain harbors an A421T mutation within the primary dimer interface of the conserved topoisomerase domain of gyrase A (Supplemental Table S1), but GyrA A421T mutation is not previously known to be associated with fluoroquinolone resistance in C. difficile. GI1 harbors a gene encoding a VanZ family protein (locus ID FE556_11215; Supplemental Table S3). VanZ family proteins were previously implicated in teicoplanin resistance 46 . The GI1-encoded vanZ gene, present in all RT106 strains, is not found within C. difficile strains 630, VPI and BI-1 (Supplemental Table S1). Consistent with this, all RT106 isolates exhibit modest increase in resistance to teicoplanin compared to reference strains (T7, BI-1, 630, VPI; Table 1); the teicoplanin CLSI and EUCAST breakpoint values for C. difficile have not been established. Cultivation of RT106 strains in sub-inhibitory concentration (MIC) of teicoplanin (0.0125 mg/mL) resulted in increased teicoplanin resistance in 7/21 strains (Supplemental Table S5).

RT106 strains display collagen-dependent biofilm formation. Biofilm formation could facilitate
intestinal colonization and persistence, and possibly contribute to recurrence 47 . RT106 strains display variable biofilm densities on an abiotic plastic surface (Fig. 6a). Since GI1 encodes a putative SrtB-anchored collagenbinding adhesin (locus ID FE556_11350; Supplemental Table S3), we tested the ability of RT106 strains to form biofilms on type I and type III collagen, the major collagen types present in the extracellular matrix of normal human intestines 48 .
Biofilm densities of the non-RT106 toxigenic C. difficile strains BI1, 630 and VPI did not increase in the presence of collagen (Fig. 6b). However, eleven RT106 strains displayed collagen-dependent increase in biofilm formation when cultured on wells coated with both type I and type III human collagen. Overall, RT106 strains, as a group, have increased likelihood of displaying collagen-dependent biofilm formation.
We also interrogated the ability of the strains to form biofilms on either human type I or type III collagen individually. Although some RT106 strains showed increased biofilm formation on either collagen type (6 to human type I collagen; 5 to human type III collagen) (Supplemental Fig. S3), the RT106 strain group did not show collagen-dependent biofilm formation when only one collagen type was used for collagen coating. Curiously, GV426, GV453, and GV457 showed synergistic increase in biofilm formation to human types I and III collagen (Fig. 6).
We also tested the ability of the 21 RT106 strains to form biofilms on rat type 1 collagen and found that ten strains formed denser biofilms on rat collagen (Supplemental Fig. S4). GV425, GV426, GV432, GV453 and GV457 consistently formed denser biofilms on human and rat type I collagen compared to uncoated wells.
All RT106 isolates tested, except GV375, GV415 and GV426, were motile (Fig. 7). We analyzed the genome of the non-motile RT106 strains for mutations in flagella-associated genes. In C. difficile 630 strain, flagella-associated genes are found in the F1 and F3 loci 50,51 . F1 and F3 loci are highly conserved in RT106; therefore, the nonmotile phenotype observed for GV375, GV415 and GV426 may possibly result from alterations in expression and/or post-translational modifications. Genome analysis of RT106 isolates revealed that all strains harbor the complete PaLoc and the gene for the TcdB1, instead of the highly toxigenic TcdB2 variant associated with select ribotypes including RT027 54,55 . We quantified secreted toxin, and observed that all RT106 strains, except GV457 and GV423, produced detectable TcdA/TcdB levels (Fig. 8). Nine RT106 isolates expressed TcdA/TcdB at levels comparable to the reference strain 630, while ten RT106 strains had similar (4/10) or higher (6/10) TcdA/TcdB levels compared to the RT027 strain BI-1.

Discussion
Consistent with broader trends in the United States, RT106 has emerged as the second leading ribotype from healthcare-associated cases in Southern Arizona [15][16][17][18][19][20][21] . Our genotypic and phenotypic characterization of multiple RT106 strains, along with the recent studies by Kociolek et al., represents an initial foray into defining key virulence properties of this clade 27,29 .
The factors contributing to the emergence and expansion of RT106 strains are presently undefined, but they appear to be distinct from those postulated for the healthcare-and US-dominant RT027 clade. First, the enhanced ability of RT027 strains to utilize trehalose, a sugar increasingly used in food products since the early 2000s, may have provided a selective advantage for this clade, although this has recently been disputed 56,57 . None of the 94 sequenced RT106 genomes harbor the Leu-1721-Ile substitution in the TreR repressor or the four-gene insertion Third, the PaLoc region of RT027 strains displays several key differences relative to the historic strain 630 (RT012) 61 . These include a point mutation in tcdC (though not in all isolates) that results in a truncated version of the anti-sigma factor TcdC, and expression of a variant of toxin B (TcdB2). TcdB2 has enhanced ability to enter host cells, is more cytotoxic, and exhibits wider tissue tropism 54,55 . In contrast to RT027 strains, the PaLoc of RT106 strains is 100% identical to 630 [62][63][64][65] ; thus, these strains encode full-length TcdC and express the TcdB1 toxin variant. Both RT027 and RT106 isolates produce variable amounts of TcdA/TcdB. Also, unlike 630 and RT106 strains, RT027 strains encode the binary toxin. Thus, toxin variations seem to be an unlikely driving force for the spread of RT106 strains. Figure 6. Clinical RT106 isolates display collagen-dependent biofilm formation. (a) 21 clinical RT106 strains (blue circles) and 3 non-RT106 toxigenic C. difficile strains (VPI, BI-1, and 630 designated as green, yellow and black circles, respectively) were cultured in uncoated or collagen-coated (combined types I and III) plastic wells for 72 h. RT106 strains displayed variable levels of biofilm on abiotic plastic wells. (b) Relative changes in biofilm densities (ΔA 570nm ) were determined by comparing A 570nm of crystal violet-stained biofilms formed on human collagen (combined types I and III) vs. on uncoated plastic wells. Filled blue circles denote P value < 0.05 determined using Student's t test to compare mean A 570nm by each strain on collagen-coated vs. uncoated wells. No difference in biofilm formation was observed when the reference C. difficile 630, BI-1 and VPI strains were cultured on wells with or without collagen. Overall, RT106 strains displayed denser biofilms on collagen-coated wells (One-sample one-tailed T-test; H alt : mean ΔA 570nm > 0; H 0 : mean ΔA 570nm = 0; P value = 0.02038). www.nature.com/scientificreports/ Detailed genome sequence analyses, however, suggest that the acquisition of novel genetic islands may be a contributor to RT106 emergence. All sequenced RT106 strains harbor a unique 46 kb genomic island (GI1) with a distinct GC content suggestive of horizontal acquisition. GI1 possesses several gene attributes that may confer competitive advantage to the RT106 clade. It harbors a vanZ allele (Locus ID FE556_11215), distinct from vanZ1 (Locus ID FE556_05915; 49% identity) present elsewhere in RT106 genome and in other C. difficile strains including the well-studied 630 66 (Supplemental Table S1). In 630, VanZ1 was previously shown to confer low level resistance to the glycopeptide antibiotic, teicoplanin, but not to vancomycin 66 . The presence of a second VanZ allele may contribute to the modest increase in teicoplanin resistance of RT106 strains. The potential selective advantage of this phenotype cannot be ruled out; while teicoplanin is not FDA-approved for use in the US, it is widely used in Europe, Asia and South America.
In addition to the strain 630 cd2831 SrtB-anchored collagen-binding adhesin homolog (99% protein identity) 67 , all RT106 strains encode a paralog within GI1 (locus ID FE556_11350; 33% protein identity with CD2831); this gene was earlier reported as an 'RT106-associated accessory gene' 29 . A subset of RT106 strains (13/94; 6 strains assayed for biofilm formation) also contains an additional paralog within GI3 (locus ID FE556_02390; 79% protein identity with GI1 locus ID FE556_11350). The robust collagen-dependent biofilm formation observed in the RT106 isolates may be due to the presence of any one or combination of these genes. Further www.nature.com/scientificreports/ investigation is required to parse the contribution of these genes to virulence. Since toxigenic C. difficile can breach the intestinal epithelium via cell junction disruption and/or epithelial cell death, thereby exposing the components of the extracellular matrix including collagen, strong vegetative cell and biofilm adhesion to collagen is a possible mechanism promoting C. difficile colonization of, and persistence in, the host. GI1 also harbors genes for anti-restriction modification (ardA; Locus ID FE556_11265, FE556_11270), multidrug resistance (mfs; Locus ID FE556_11225), methylglyoxal detoxification (gloA; Locus ID FE556_11330), and cation transport (Locus ID FE556_11135) containing a FieF domain (NCBI Conserved Domain cl30791) associated with iron-cobalt-zinc-cadmium resistance. Homologs of these genes are linked to virulence of other pathogens [68][69][70]  Fragments of GI1 were found in different C. difficile sequence types. Molecular clock analysis suggests that the complete island is a composite of sequences sequentially acquired from progenitor ST strains. The molecular clock based on the conserved GI1 segment is asynchronous with the one based on housekeeping genes ( Fig. 4 and Supplemental Fig. S2). Further, consistent with higher GC content of GI1 relative to rest of the C. difficile genome, it is likely that the progenitor ST strains acquired the DNA segments from non-clostridial organisms via horizontal gene transfer. While the complete island is yet to be found in any other microbial genome or plasmid, two gene segments (8.4 kb and 13.7 kb) were detected in other enteric bacteria (Fig. 5). For the 8.4 kb gene segment, the most closely related sequences occur in Roseburia intestinalis M50/1 strains (92.2% identity). The 13.7 kb segment displays 99% identity to sequence within a 36 kb genomic island in E. faecium. While the 8.4 kb and 13.7 kb segment in GI1 may have been derived from E. faecium, the candidate donors of the other gene segments in GI1 are presently unknown. The presence of these genetic segments in disparate enteric organisms may suggest that they confer some selective advantage within the intestinal environment.

Conclusions
Clostridioides difficile RT106 is virulent in a hamster model of infection, and all sequenced isolates within this clade harbor a unique 46 kb GI1. Consistent with the presence of genes encoding a VanZ family protein and a SrtB-anchored collagen-binding adhesin within GI1, RT106 strains had increased teicoplanin resistance and robust collagen-dependent biofilm formation, respectively. Further investigation is required to implicate GI1 genes to RT106 virulence.

Clostridioides difficile surveillance. This study, approved by the University of Arizona Institutional
Review Board, utilized to-be-discarded stool specimens from diarrheic patients at the Banner University Medical Center (BUMC) in Tucson, Arizona between August 1, 2015 and July 31, 2018. Samples were collected and www.nature.com/scientificreports/ stored at − 80 °C. From August 2015 to February 2017, tcdB-positive stool samples tested by BUMC via polymerase chain reaction (PCR) were included in the study. On March 2017, BUMC implemented the glutamate dehydrogenase (GDH) and toxin enzyme immunoassay for C. difficile testing. All GDH + samples were collected. We screened for the presence of tcdB in the GDH + /toxin− samples via PCR using the following primers: B1C (5′-GAA AAT TTT ATG AGT TTA GTT AAT AGAAA-3′) and B2N (5′-CAG ATA ATG TAG GAA GTA AGT  CTA TAG-3′) 71 . For samples received during March 2017 to July 2018, only GDH+/toxin+ or GDH + /toxin− and tcdB-PCR-positive samples were analyzed in this study.
Ribotyping of clinical C. difficile isolates. Stool samples plated on taurocholate cycloserine cefoxitin fructose agar (TCCFA) were cultured anaerobically at 37 °C. Isolated colonies were lysed with G-Biosciences Toothpick-PCR, and supernatants were used as templates for ribotyping PCR using the following primers: 16S (5′-GTG CGG CTG GAT CAC CTC CT-3′) and 23S (5′-CCC TGC ACC CTT AAT AAC TTG ACC -3′) 72,73 . Isolated colonies were also submitted to the University of Arizona Genomics Core for genomic extraction using QIA-GEN DNeasy column-based extraction kit and ribotyping PCR using the same 16S and 23S primers. PCR products were resolved via capillary electrophoresis using an AB Prism 3730 Genetic Analyzer (Applied Biosystems, Foster City, CA) and amplicon length evaluated using Marker www.nature.com/scientificreports/ using Artemis DNA Plotter. GC content (%) and relative positions of GI1, GI2, and GI3 are indicated in the map (Fig. 2a).
Maximum likelihood (ML) trees of core genomes. ML trees were constructed for two groups of genomes; (1) 94 strains identified to clade together in the composition vector tree and found to contain a complete GI1, and (2) 265 strains that contain complete and partial (> 7.7 kb and 98% identity) segments of GI1. Panseq 81 was used to determine the core SNPs. MEGA-X 36 was used to infer phylogenies by using the Maximum Likelihood method and Tamura-Nei model 82 . Trees with the highest log likelihood were shown in Figs. 2c and 3. The bootstrap consensus tree inferred from 1000 replicates was taken to represent the evolutionary history of the taxa analyzed 83 .
Molecular clock analysis. The 7.1 kb genetic region common to 265 C. difficile strains was used to deduce the possible evolutionary formation of genetic island 1 on RT106. Mega-X 36 was used to construct a timetree inferred by applying the RelTime method 84,85 to the a phylogenetic tree whose branch lengths were calculated using the Maximum Likelihood (ML) method and the Tamura-Nei substitution model 82 [88][89][90][91][92] to compare ML trees based on core genome SNPs (designated as T 1 in this analysis) and seven MLST genes (designated as T 2 ) against the ML tree based on the 7.1 kb shared region within GI1 (designated as T 0 ) with the following hypotheses: H O : T 0 and T 1 , or T 0 and T 2 , would explain the sequence diversity of the 7.1 kb shared region within GI1 equally well (T 0 = T 1 or T 0 = T 2 ).
H A : T 1 and/or T 2 does not explain the sequence diversity of the 7.1 kb shared region of GI1 (T 0 ≠ T 1 or T 0 ≠ T 2 ). All tests were performed with 10,000 resamplings using the RELL method.
Antibiotic susceptibility testing. Overnight  www.nature.com/scientificreports/ collagen type per well). Overnight cultures of C. difficile strains were diluted in BHI containing 100 mM glucose (OD 600nm = 0.1). One mL of the culture was added per well of the uncoated or collagen-coated plate and incubated anaerobically for 72 h at 37 °C. Supernatants were removed gently by tilting plates onto a collection basin. Biofilms were washed twice by gently submerging plates in glass basins of PBS. Excess PBS was removed by inverting plates onto tissue paper. Biofilms were fixed for 20-40 min at 37 °C, and then stained with 1 mL of 0.2% filter-sterilized crystal violet for 30 min. Biofilms were washed twice with PBS as described above. For quantification of biofilm growth, 1 mL of 4:1 ethanol/acetone solution was added to each sample. 100 μL aliquots were transferred to a 96-well plate, and absorbance at 570 nm (A 570nm ) was determined using BioTek Synergy automated plate reader. Relative changes in biofilm densities (ΔA 570nm ) were determined by comparing A 570nm of crystal violet-stained biofilms formed on collagen-coated vs. on uncoated plastic wells.