Genomic epidemiology of global Klebsiella pneumoniae carbapenemase (KPC)-producing Escherichia coli

The dissemination of carbapenem resistance in Escherichia coli has major implications for the management of common infections. bla KPC, encoding a transmissible carbapenemase (KPC), has historically largely been associated with Klebsiella pneumoniae, a predominant plasmid (pKpQIL), and a specific transposable element (Tn4401, ~10 kb). Here we characterize the genetic features of bla KPC emergence in global E. coli, 2008–2013, using both long- and short-read whole-genome sequencing. Amongst 43/45 successfully sequenced bla KPC-E. coli strains, we identified substantial strain diversity (n = 21 sequence types, 18% of annotated genes in the core genome); substantial plasmid diversity (≥9 replicon types); and substantial bla KPC-associated, mobile genetic element (MGE) diversity (50% not within complete Tn4401 elements). We also found evidence of inter-species, regional and international plasmid spread. In several cases bla KPC was found on high copy number, small Col-like plasmids, previously associated with horizontal transmission of resistance genes in the absence of antimicrobial selection pressures. E. coli is a common human pathogen, but also a commensal in multiple environmental and animal reservoirs, and easily transmissible. The association of bla KPC with a range of MGEs previously linked to the successful spread of widely endemic resistance mechanisms (e.g. bla TEM, bla CTX-M) suggests that it may become similarly prevalent.


Results
Global bla KPC -E. coli strains are diverse, even within the most prevalent ST, ST131, with evidence for local transmission. 45 isolates were obtained from 21 cities in 11 countries across four continents (2010-2013; previous laboratory typing results summarized in Table S1). One isolate was bla KPC -negative on whole-genome sequencing (WGS; ecol_252), potentially having lost bla KPC during storage or sub-culture in the intervening time period between when the original typing was undertaken and the subsequent DNA extraction and preparation for WGS. For one isolate the WGS data were inconsistent with the lab typing results (ecol_451), likely representing a laboratory mix-up; ecol_252 and ecol_451 were therefore excluded from subsequent analyses. The other 43 isolates were successfully sequenced (for quality metrics see Table S1). Amongst these 43 isolates, twenty-one different E. coli STs were represented (Table 1; predicted in silico from WGS), including: ST131 [n = 16], ST410 [n = 4], ST38 [n = 3], ST10, ST69 [n = 2 each] (remaining isolates singleton STs).
Of 16,053 annotated open reading frames (ORFs) identified across all KPC-E. coli isolates, only 2,950 (18.4%) were shared in all isolates ("core"), and a further 222 (1.4%) in 95-< 100% of isolates ("soft core" 27 ). At the nucleotide level there were 213,352 single nucleotide variants (SNVs) in the core genome, consistent with the previously observed species diversity 28 . Resistance gene profiles also varied markedly between strains, with some harbouring several beta-lactam, aminoglycoside, tetracycline and fluoroquinolone resistance mechanisms (e.g. ecol_224) and others containing bla KPC only (e.g. ecol_584; Fig. 1). For the 16 KPC-ST131 strains, 4,071/7,910 (51%) ORFs were core, with 6,778 SNVs across the core genome of these isolates, again consistent with previous global studies of ST131 diversity 23,24 (Figure S1). Accessory genomes were highly concordant for some (e.g. ecol_356/ecol_276/ ecol_875), but not all (e.g. ecol_AZ159/ecol_244) isolates that were closely related in their core genomes, supporting highly variable evolutionary dynamics between core and accessory genomes (Fig. 1). The geographic distribution of isolates closely related in both the core and accessory genomes supports local (e.g. ecol_AZ166, ecol_AZ167 [ST131, Beijing, China]) transmission of particular KPC-E. coli strains. The homology of genetic flanking motifs around the bla KPC genes in these closely related isolate pairs would also be consistent with this hypothesis, and less consistent with multiple acquisition events of bla KPC within the same genetic background, especially given the diversity in bla KPC flanking sequences observed across the rest of the dataset (see below).
bla KPC genes appear restricted to plasmid contexts in E. coli at present, but may exist in multiple copies on single plasmid structures or in high copy number plasmids. Thirty-four isolates (80%) contained bla KPC-2 , and nine isolates (20%) bla KPC-3 . Chromosomal integration of bla KPC has been described in other Enterobacteriaceae, Pseudomonas and Acinetobacter spp. but remains rare 5,29,30 . There was no evidence of chromosomal integration of bla KPC in either the 18 chromosomal structures reconstructed from long-read sequencing, or based on review of the annotations in the bla KPC  Estimates of bla KPC copy number per bacterial chromosome varied between <1 (ecol_879, ecol_881) and 55 (ecol_AZ152). In nine cases this estimate was ≥10 copies of bla KPC per bacterial chromosome (ecol_276, ecol_356, ecol_867, ecol_869, ecol_870, ecol_875, ecol_AZ150, ecol_AZ152, ecol_AZ159, Table S2). Six of these isolates contained bla KPC in a col-like plasmid context, in two cases the plasmid rep type was unknown, and in one case it was an IncN replicon. Plasmid copy number is associated with higher levels of antibiotic resistance if the relevant gene is located on a high-copy unit. Interestingly, high copy number plasmids are postulated to have higher chances of fixing in descendant cells, as they distribute more adequately by chance and without the requirement for partitioning systems 31 , and of being transferred in any conjugation event, either directly or indirectly 32-34 . bla KPC and non-bla KPC plasmid populations across global KPC-E. coli strains are extremely diverse. Plasmid Inc typing revealed the presence of a median of four plasmid replicon types per isolate (range: 1-6; IQR: 3-5), representing wide diversity (Table 1). However, IncN, col, IncFIA and IncI1 replicons were disproportionately over-represented in certain STs (p < 0.05; Table 1). Amongst the 18 isolates that underwent PacBio sequencing, we identified 53 closed, non-bla KPC plasmids, ranging from 1,459 bp to 289,903 bp (Table S1; at least four additional, partially complete plasmid structures were present). Of these non-bla KPC plasmids, 10 (size: 2,571-150,994 bp) had <70% similarity (defined by percent sequence identity multiplied by proportion of query length demonstrating homology) to other sequences available in GenBank, highlighting that a proportion of the "plasmidome" in KPC-E. coli remains incompletely characterized. For the other 43 plasmids, the top match in GenBank was a plasmid from E. coli in 35 cases, K. pneumoniae in 5 cases, and Citrobacter freundii, Shigella sonnei, Salmonella enterica in 1 case each (Table S3).
Twenty-two bla KPC plasmid structures were fully resolved (17 from Pacbio data only, four from Illumina data only, 1 from both PacBio and Illumina data), ranging from 14,029 bp to 287,067 bp (median = 55,590 bp; IQR: 23,499-82,765 bp). These bla KPC -containing plasmids, and six additional cases where bla KPC was identified on a replicon-containing contig, were highly diverse based on Inc typing (Table S1). IncN was the most common type (n = 8/28 type-able bla KPC structures; 29%), followed by small, col-like plasmids (n = 6/28 [col-like plasmids with single replicons only]; 21%). Other less common types were: A/C2, FII(k), U (all n = 2); and L/M, P, Q1 and R  58 . Undmbers in square brackets represent the known subset of bla KPC plasmids in each cell. Exact test compares presence/absence of each Inc type by ST. The replicon type specifically associated with bla KPC could not be evaluated in 15 isolates, due to limitations of the assemblies. a one multi-replicon plasmid also containing IncFIA. b one multi-replicon plasmid also containing IncFIA and IncR. c one multi-replicon plasmid also containing IncFIB. d one multireplicon plasmid also containing repA.
(all n = 1). Four (14%) bla KPC plasmids were multi-replicon constructs, namely: col/repA, FIB/FII, FIA/FII, and FIA/FII/R.  (Table S4)  In the first instance, genetic similarities were identified between Plasmid-9, pKPC-FCF/3SP, pKPC-FCF13/05, pCF8698, pKP1433 (representing a hybrid IncN), and bla KPC plasmids from isolates ecol_516, ecol_517, ecol_656, and ecol_736 (this study). Plasmid-9 contains duplicate Tn4401b elements in reverse orientation with four different 5 bp flanking sequences in an atypical arrangement within a group II intron 35 . The backbone structures of the other plasmids in this group are consistent with a separate acquisition event of a Tn4401b element between the pld and traG regions within an ancestral version of the Plasmid-9 structure, with the generation of a flanking TTCAG target site duplication (TSD) (labelled as Plasmid 9-like plasmid (hypothetical), Fig. 2). International spread followed by local evolution both within and across species would account for the differences between plasmids, including: (i) nucleotide level variation (observed in all plasmids); (ii) small insertion/deletion events (observed in all plasmids); (iii) larger insertion/deletion events mediated by transposable elements (e.g. pCF8698_KPC_2); and (iv) likely homologous recombination, resulting in clustered variation within a similar plasmid backbone (e.g. ecol_656/ecol_736), as well as more distinct rearrangements, including the formation of "hybrid" plasmids (e.g. pKP1433) (Fig. 2).
In Plasmid-12 (FJ223605), Tn4401b has inserted into a hybrid Tn2-Tn3-like element (with associated drug resistance genes including bla TEM-1 , bla OXA-9 , and several aminoglycoside resistance genes), albeit in the absence of target sequence duplication, possibly as the result of an intra-molecular, replicative transposition event generating mismatched target site sequences (L TSS = TATTA; R TSS = GTTCT). This complex is in turn located  Table S2), core and accessory genome components. For the accessory genome panel, blue represents annotated regions that are present, and white those that are absent. between two IS15DIV (IS15Δ)/IS26-like elements flanked by 8 bp inverted repeats, and located between the traI (891 bp from 3′ end) and pld loci (~28 Kb; Fig. 3A). The backbone components of the IncN Plasmid-12 are consistent with those seen in an NIH outbreak 5 and in a rearranged version in a University of Virginia outbreak (CAV1043; 2008) 6 . From this study, plasmids from ecol_224, ecol_881, ecol_AZ159, ecol_422, and scaffolds from ecol_AZ151, ecol_744, ecol_AZ150 all share near identical structures to Plasmid-12, with clustered nucleotide level variation present in the traJ-traI genes, consistent with a homologous recombination event affecting this region, and evidence of sporadic insertion/deletion events (Fig. 3A). However, the bla KPC-Tn4401 structures in these isolates are almost entirely degraded by the presence of other mobile genetic elements (MGEs), including Tn2/Tn3-like elements, ISKpn8/27 and Tn1721. In ecol_224, bla KPC-2 has been inserted into the IncN backbone as part of two repeat, inverted Tn3-like structures, flanked by a TTGCT TSD, and closer to traI (136 bp from 3′ end) than the aforementioned IS15DIV (IS15Δ)/IS26-like complex in Plasmid-12 (Fig. 3B). Although it is not possible to accurately trace the evolutionary history of this genomic region given the available data, the presence of shared signatures of this structure in ecol_422, ecol_744, ecol_881, ecol_AZ159, ecol_AZ150 and ecol_AZ151 suggest a common acquisition, and multiple subsequent rearrangements mediated by the presence of the large number of MGEs flanking bla KPC-2 . KPC  ). These three isolates additionally contained FIA, FIB, FII, X3 and X4 replicons, suggesting stable persistence of a clonal strain + plasmids over time, consistent with both SNV/core and accessory genome analyses (Fig. 1, Figure S1).

Col-like plasmids may represent an important vector of transmission for bla
The other two col-like plasmids effectively represent short stretches of DNA encoding different mobilization genes (mbeA/mbeC/mbeD) harnessed to Tn4401/bla KPC modules. The 5 bp sequences flanking Tn4401  Geographic origin, dates of isolation and overall alignment of plasmid/contig structures. Plasmid sequence names in red are those from this study, derived from PacBio data and closed (ecol_224, ecol_422, ecol_881, ecol_AZ159) or incomplete plasmid structures (ecol_744, ecol_AZ151, ecol_AZ150) derived from Illumina data. Aligned bars adjacent to plasmid names represent plasmid sequences: light grey denotes regions with 100% sequence homology; black represents nucleotide diversity between sequences; and thin lines represent indels. Coding sequences are represented by fat arrows below individual sequence bars and are colour-coded as per the colour key. The inset schematic describing genetic variation between sequences depicts examples of evolutionary events identified: (a) single nucleotide level change, (b) small indels (≤100 bp), (c) large indels (>100 bp), (d) recombination events. Panel 3B. Close-up of the region between traI and pld containing bla KPC-2 in study isolates only. Coding sequences are colour-coded as in Fig. 3A; sequence regions referred to in the text are annotated.
were consistent with direct, intermolecular transposition in both cases (ecol_870: TGTTT-TGTTT; ecol_867: TGTGA-TGTGA). A col/repA co-integrate plasmid was also observed in this dataset (ecol_AZ161), in which Tn4401b was inserted between colE3 signature sequences and a Tn3 element (Tn4401 TSS: AGATA-GTTCT). The formation of such co-integrate plasmid structures in E. coli has also been previously described 36 , including that of a fused col/pKpQIL-like plasmid structure (pKpQIL being historically associated with bla KPC ) 37 .
Col-like plasmids have been associated with KPC-producers in other smaller, regional studies 21, 38 . Of concern, these small vectors have been shown to be responsible for the inter-species diffusion of qnr genes mediating fluoroquinolone resistance, even in the absence of any obvious antimicrobial selection pressure 39 . The significant association of col-like plasmids with particular E. coli STs (predominantly ST131) in this study could be one explanation for the disproportionate representation of bla KPC in this lineage.
From the full set of GenBank plasmids and in vitro transposition experiments carried out by others, 30 different types of 5 bp TSS pairs have been characterized, seven in the experimental setting only 40 . The downloaded plasmids come from a range of species and time-points (2005-2014), although they may under-represent wider Tn4401 insertion site diversity as a result of sampling biases. Our data however would be consistent with significant Tn4401 mobility within E. coli following acquisition of diverse Tn4401 isoforms and/or represent multiple importation events into E. coli from other species.
This apparent diversity in independently acquired MGEs around the bla KPC gene extends the means by which bla KPC can be mobilized. Interestingly, as observed previously 43 , all the degraded Tn4401 sequences in this dataset were associated with variable stretches of flanking Tn2/3-like sequences, suggesting that the insertion of Tn4401 into a Tn2/Tn3-like context may have enabled the latter to act as a hotspot for the insertion of other MGEs 6 . A particular finding of note is the association with IS26, which has been linked to the dissemination of several other resistance genes in E. coli, including CTX-M ESBLs 24,44 ; is able to increase the expression of closely co-located resistance genes 45 ; participates in co-integrate formation and hence plasmid rearrangement 46 ; and enhances the occurrence of other IS26-mediated transfer events into plasmids harbouring IS26 46 .

Discussion
This study of KPC-E. coli obtained from two global resistance surveillance schemes has demonstrated the diversity of genetic structures associated with bla KPC at all genetic levels, including: (i) host bacterial strain; (ii) plasmid types; (iii) associated transposable MGEs, including transposons and insertion sequences; and (iv) bla KPC alleles. This has previously been observed within institutional, poly-species outbreaks, particularly for non-E. coli Enterobacteriaceae 5, 6 , as well as in a more recent study of nine KPC-E. coli from the US 47 . We have identified global and regional bla KPC spread at the strain and plasmid levels, including signatures consistent with inter-species spread of plasmids, over short timeframes. Although the geographic reach of sampling has been more substantial than any other similar study, there are some limitations in the sampling consistency of both surveillance schemes 22 (e.g. isolates from China were only submitted in 2008, 2012 and 2013).
We utilized long-read sequencing on only a subset of isolates, given resource limitations, allowing us to completely resolve chromosomal and plasmid structures in less than half the isolates. Nevertheless, despite this drawback, we have highlighted the extraordinary diversity present. This study, along with other recent analyses utilizing long-read sequencing to resolve antimicrobial resistance plasmids 5,6 , also demonstrates the difficulty in making evolutionary comparisons for MGEs, given the absence of effective phylogenetic methods/tools to characterize their genetic histories which commonly involve genetic rearrangements, and evolutionary events that are not restricted to single nucleotide mutations.
This study has demonstrated the particular association of bla KPC in E. coli with IncN plasmids, previously associated with the spread of other antimicrobial resistance elements 48 , as well as col-like plasmids, which are small, potentially highly mobile, and generally high copy-number units. The traditional association of bla KPC with Tn4401 has apparently been eroded in E. coli, with the complete Tn4401 structure absent in 50% of strains investigated. This finding is in contrast to most global descriptions of K. pneumoniae where bla KPC has been stably associated with largely intact Tn4401 isoforms for more than a decade. Instead, other shorter MGEs, such as Tn2/ Tn3-like elements and IS26, appear to be commonly involved in bla KPC dispersal in E. coli. These MGEs have been associated with the spread of multiple resistance mechanisms, such as bla TEM and bla CTX-M , and will potentially similarly contribute to the dissemination of bla KPC in E. coli. We did not undertake any functional assays investigating the dynamics of bla KPC transmission in E. coli to support this hypothesis, but this would be illuminating work for future study.
The global emergence and spread of bla KPC in E. coli has been driven by multiple mechanisms, including local and international spread of highly genetically related strains, exchange of plasmids with other Enterobacteriaceae and between E. coli lineages, transposition events within the species, and a breakdown of the traditional association of bla KPC with Tn4401. The genetic flexibility observed is impressive, especially given the timeframes and number of KPC-E. coli characterized. Tracking the spread of resistance genes given such multi-level genetic variability is complicated, even with a high-resolution typing method such as WGS. The association of E. coli, both a common pathogen and commensal in a wide range of environmental/animal reservoirs, with MGEs (col-like plasmids, IS26) that have been shown to facilitate the dissemination of other successful resistance genes even in the absence of antimicrobial selection pressures, may represent a difficult situation to control.  Table S1). Isolates had been previously characterized using partial, sequenced-based typing methods, including multi-locus sequence typing (MLST; Achtman scheme), fimH typing, PCR for beta-lactamases, strain/plasmid PFGE (Table S1) 22 .

Methods
DNA extraction and sequencing. All isolates were sequenced on the Illumina MiSeq; a subset of 18 were purposively selected for PacBio sequencing, to capture potential diversity across a range of years of isolation, geographic location, standard ST, plasmid size and resistance gene content (based on laboratory typing). DNA for sequencing was extracted from sub-cultures of bacterial stocks (frozen at −80 °C; cultured overnight on Columbia blood agar at 37 °C) using the Qiagen Genomic tip 100/G extraction kit, as per the manufacturer's instructions (Qiagen, Hilden, Germany; catalogue no: 10243).
DNA libraries for MiSeq sequencing were generated and normalized using 300 base, paired-end Nextera XT DNA library preparation kits (Illumina, San Diego, CA, USA). PacBio sequencing on the subset of strains was performed as previously described 49 ; in these cases, the same DNA extract was used for both Illumina and PacBio sequencing approaches.
Sequence data processing. Illumina (short-read data). Mapping-based approaches: Prior to reference-based mapping to the O150:H5 SE15 E. coli reference genome (Genbank accession: NC_013654), Illumina data were trimmed using cutadapt version 1.5. SE15, which is ST131, was chosen as the reference given the largest number of strains sequenced (and in the dataset) came from this ST. Repetitive regions of the reference were identified using self-self BLASTn analysis with default settings; these regions were then masked prior to mapping and base calling. Properly paired sequence reads were mapped to the reference using Stampy (v1.0.17) (Supplementary methods).
Single-nucleotide variants (SNVs) were determined across all mapped non-repetitive sites using SAMtools (version 0.1.18) mpileup. mpileup was run twice to separate high-quality base calls from low-quality base calls; variant call format (VCF) files of annotated variant sites were created using GATK (v1.4.21). Base calls derived from these two VCF files were filtered to retain only high quality calls (Supplementary methods).
Core variable sites (site called in all sequenced isolates, excluding "N" or "-" calls) derived from mapping to the SE15 reference were "padded" with invariant sites in a proportion consistent with the GC content and length of the reference genome (4.72 Mb, 51% average GC content), to generate a modified alignment of input sequences to generate phylogenies. Phylogenies were reconstructed using RaxML (Version 7.7.6) 50 , with a generalized time reversible model, four gamma categories (allowing for variable rates of mutation between sites), and bootstrapped 100 times.
De novo assemblies of Illumina data: De novo assemblies of short-read Illumina data for all isolates were generated using the A5-MiSeq pipeline (version 20140604; default settings) 51 , which includes adapter/low-quality region read trimming steps (Trimmomatic), initial contig assembly, crude scaffolding, misassembly correction and final scaffolding. We used the unscaffolded contigs file in subsequent analyses (*.contigs.fasta).
PacBio (long-read data). DNA library preparation for and sequencing on the PacBio RSII were performed in accordance with the manufacturer's instructions, using P5-C3 sequencing enzyme and chemistries respectively, and following a 7-50 kbp fragment selection step (full details in ref. 49). De novo assemblies were constructed using HGAP3 (version 2.2.0) 52 , resulting in phased chromosomal and plasmid contigs, which were manually closed by resolving and trimming overlapping repeats at the contig ends. Illumina reads for the respective isolate were then mapped to the resulting closed PacBio assemblies using bwa-MEM (version 0.7.9a-r786, default settings) 53 . Read pileups were visualized in Geneious 54 ; mismatches between the sequence derived from mapping and the reference PacBio assemblies were inspected manually to identify the correct structure. Finally, to capture small plasmids that may have been filtered out due to size-selection of DNA fragments >7 kbp prior to PacBio sequencing, unmapped Illumina reads (extracted using the SAMtools view command, with the -f 12 flag) derived from this process were de novo assembled using the A5-MiSeq pipeline 20140604 51 ; any assembled contigs were manually closed by assessing for and trimming overlapping repeats (100% match over a length of ≥100 bp; no match to any other contig in the assembly). The final consensus chromosomal and plasmid sequences derived from these processes were used in analyses and submitted to GenBank.
Automated and manual annotation of de novo assemblies. All plasmid structures and de novo assemblies were annotated using PROKKA 55 , with subsequent manual refinement of annotations for regions of interest using BLASTn 56 and the NCBI bacterial and ISFinder databases 57 . Alignments of sequence structures were visualized and modified in Geneious. bla KPC -containing contigs identified in the de novo assemblies derived from Illumina data were manually inspected for overlapping repeats as above; if these were present, these contigs were considered additional putative KPC plasmids and trimmed and closed as above.
Core/accessory genome comparisons. These were undertaken using the pangenome pipeline, ROARY 27 , by inputting the *.gff files generated from the PROKKA annotation of each of the Illumina de novo assemblies (default settings). Comparisons were made separately for all isolates and the ST131 subset. The output gene_pres-ence_absence.csv files were processed using the pheatmap function in R. Resistance genes were identified using ResistType, a command-line tool developed in-house that identifies the presence of reference loci in WGS data using both BLASTn against de novo assemblies and mapping-based approaches, and estimates copy number by comparing contig coverage at any given locus with median contig coverage [scripts, reference resistance gene database and manual available at: https://github.com/hangphan/resistType]. These features were plotted on the maximum likelihood phylogenies using the Ape package in R.
Comparisons with publicly available KPC plasmid sequences. All complete KPC RefSeq plasmid sequences available in GenBank in May 2015 were identified using the search terms "plasmid" + "KPC" + "complete sequence". The resulting list was filtered manually to exclude any additional sequences present that were not complete plasmid sequences. In total 63 plasmid sequences were included (Table S5).
For the IncN plasmid comparisons, we included the following from our dataset: (i) six cases where PacBio sequencing had fully resolved the bla KPC IncN plasmid (ecol_224, ecol_422, ecol_517, ecol_656, ecol_881, ecol_ AZ159); (ii) two cases where the IncN rep and bla KPC were co-located on the same, incomplete contig (ecol_516, ecol_736); and (iii) three cases where bla KPC was present in isolates containing an IncN rep and on contigs that showed high similarity to the IncN plasmid backbones under scrutiny (ecol_744, ecol_AZ150, ecol_AZ151).