Horizontally acquired papGII-containing pathogenicity islands underlie the emergence of invasive uropathogenic Escherichia coli lineages

Escherichia coli is the leading cause of urinary tract infection, one of the most common bacterial infections in humans. Despite this, a genomic perspective is lacking regarding the phylogenetic distribution of isolates associated with different clinical syndromes. Here, we present a large-scale phylogenomic analysis of a spatiotemporally and clinically diverse set of 907 E. coli isolates, including 722 uropathogenic E. coli (UPEC) isolates. A genome-wide association approach identifies the (P-fimbriae-encoding) papGII locus as the key feature distinguishing invasive UPEC, defined as isolates associated with severe UTI, i.e., kidney infection (pyelonephritis) or urinary-source bacteremia, from non-invasive UPEC, defined as isolates associated with asymptomatic bacteriuria or bladder infection (cystitis). Within the E. coli population, distinct invasive UPEC lineages emerged through repeated horizontal acquisition of diverse papGII-containing pathogenicity islands. Our findings elucidate the molecular determinants of severe UTI and have implications for the early detection of this pathogen.


Supplementary Note 1: Inclusion criteria for isolate collections
The main dataset of this study (385 invasive UPEC isolates, 337 non-invasive UPEC isolates, and 185 fecal isolates giving a total of 907 isolates) comprised genomic data of 14 UPEC isolate collections, 2 fecal isolate collections, and reference strains. Criteria for the inclusion of UPEC isolate collections were the availability of associated metadata on clinical syndromes and/or medical diagnosis, i.e., asymptomatic bacteriuria (ABU), cystitis, pyelonephritis, urinary-source bacteremia, or urosepsis. In addition to isolates sequenced as part of this study, public genomic data of 10 UPEC isolate collections were identified and included.
Asymptomatic bacteriuria (ABU) describes the bacterial colonization of the urinary tract without causing inflammatory responses. The diagnosis of ABU is based on one (in men) or two consecutive (in women) positive urine cultures (typically >100,000 CFU/ml) from individuals without signs or symptoms of urinary tract infection 1 . Cystitis is the inflammation of the bladder or lower urinary tract caused by bacterial infection. The diagnosis of cystitis is based on a positive urine culture (typically >1,000 CFU/ml) from a patient presenting with dysuria, urgency, frequency, pyuria, and/or suprapubic pain. Pyelonephritis is the inflammation of the kidneys or upper urinary tract caused by bacterial infection. The diagnosis of pyelonephritis is based on a positive urine culture (typically >1,000 CFU/ml) from patients presenting with flank pain and/or costovertebral angle tenderness, fever (>38.0 °C), and pyuria, with or without typical symptoms of lower urinary tract infection. Around 30% to 60% of all patients with pyelonephritis develop bacteremia 2-4 : such urinary-source bacteremia is indicated by concurrent positive urine and blood cultures with isolates of the same bacterial species.
To provide phylogenomic context to our analysis, genomic data of two fecal isolate collections (MN_fec, KTE_fec) were included. These originated from healthy volunteers, patients without acute infection, or patients with cystitis (Supplementary Table 1). Fecal isolates from patients with cystitis corresponding to the urinary clone of the same patient (based on whole-genome sequence analyses) were excluded. All fecal isolates were originally used as control strains for genetic comparisons with UPEC isolates. Because the human microbiota presumably presents a reservoir of UPEC, fecal isolates were not considered in genome-wide association studies.

Supplementary Note 2: Distribution of clinical phenotypes and papGII+ isolates among phylogroups.
Both UPEC and fecal isolates were predominantly assigned to phylogroup B2 (64% and 41%, respectively; Supplementary Table 2). A majority of the remaining UPEC isolates were part of the D group (15%), whereas most remaining fecal isolates were part of phylogroups A (22%), B1 (15%), or D (15%). Phylogroup C consisted exclusively of UPEC isolates. Clustering of fecal isolates in sublineages of phylogroups A and B1 hinted toward the existence of a truly commensal E. coli type not able to colonize the urinary tract (Fig. 1). Notably, the phylogeny of groups A and B1 presented a deep branching population structure. In contrast, most isolates in phylogroups B2 and D grouped into few lineages of highly related isolates, in line with a previously reported expansion of a few successful pandemic lineages 5 . Invasive UPEC isolates were more commonly associated with phylogroups B2 and D (86%) than non-invasive UPEC isolates (71%). Almost all (96%) papGII+ isolates were part of phylogroups B2, D, or F.

Supplementary Note 3: Distribution of iron uptake systems
Seven of the 22 investigated iron uptake systems were significantly more prevalent in invasive compared to non-invasive UPEC isolates (Supplementary Table 10a). In addition to the iuc locus, which reached pan-genome wide significance, these included sit, chu, cjr, ireA, iha, and fyu/ybt. Among isolates of phylogroup B2, iuc, iha, and ireA were significantly more prevalent in invasive than in noninvasive UPEC isolates (Supplementary Table 10b). The yersiniabactin gene cluster fyu/ybt, encoded by the high-pathogenicity island HPI, and the heme uptake system chu, which are both considered to play a major role in extra-intestinal virulence 6,7 , were present in >90% of all isolates in phylogroups B2, C, F or phylogroups B2, D, F, G respectively (Supplementary Table 10c).

. Geographical origin of E. coli collections included in the main dataset of this study.
Clinical phenotypes and the number of isolates in each collection are illustrated by different colors and scales of the circles. The isolation time span of each collection is annotated. Asterisks indicate multicenter studies. The map was created in R using the maps package 8 . ABU: asymptomatic bacteriuria, UPEC: uropathogenic E. coli. These include collections of E. coli isolates associated with pyelonephritis (MC_pye) and urinarysource bacteremia (HVH_urb, UHS_urb, BUTI_uro, UZA_uro). The number of isolates and isolation time span of each collection is annotated. (a) Proportion of the dominant clonal complexes (CC). The eleven dominant CCs accounted for 79% to 87% of isolates in each of the five collections. More recent collections consisted of a higher proportion of CC131 and CC69 isolates. (b) Proportion of papGII+ lineages. The proportion of papGII+ isolates that were not part of the 14 identified papGII+ lineages as well as the proportion of afaVIII-positive isolates are also indicated. One isolate (U6_BUTI_Uro, CC23, collection BUTI_uro, not part of a papGII+ lineage) carried both papGII and afaVIII and was assigned to "other papGII+ isolates". Source data are provided in Supplementary Data 1. (a) Comparison of the region surrounding the papGII-containing PAI of isolate 6-Pyelo (CC144) with the chromosomal backbone of isolate 75-Pyelo (CC144), which lacked insertions at the ula site. PAI6-Pyelo-ula (Type I papGII+ PAI) was integrated between the genes ulaE and ulaD. The ulaRGABCDEF operon (yellow) encodes an L-ascorbate phosphotransferase transport system. (b) Comparison of the region surrounding the papGII-containing PAI of isolate US12 (CC14) with the chromosomal backbone of isolate US31 (CC14), which lacked insertions at the gln site. PAIUS12-gln (Type IV papGII+ PAI) was integrated between glnP and glnH of the glnHPQ operon (purple), which encodes an ABC glutamine transport system. Integrase genes (int) and the pap operon are marked. The gradient scale shows the level of nucleotide identity. Sequence comparisons were performed using EasyFig 10 . Supplementary Fig. 6. Midpoint-rooted maximum-likelihood tree for intS and intB integrase genes identified on 42 resolved papGII-containing pathogenicity islands (PAIs).
Integrase genes were aligned using Muscle and the phylogeny reconstructed in Mega-X using a GTR model. DNA sequences of the integrase genes clustered according to their integration site. Integrase genes identified at the tRNA-pheU and tRNA-pheV sites fell in the same cluster, often sharing an identical sequence. An exception was the integrase gene found in PAIUS29-pheV. Supplementary Fig. 7. Tanglegram depicting the core genome phylogeny of 35 representative papGII+ isolates and the similarity of 42 papGII+ pathogenicity islands (PAIs) identified in their genomes.
Left: midpoint-rooted maximum-likelihood phylogeny of 35 high-quality genome assemblies from papGII+ isolates based on a separate 3.412 Mbp core genome alignment (reference genome CFT073, GCA_000007445.1). Clonal complexes (CCs) are annotated. The scale bar indicates the number of substitutions per site. Right: sequence similarity of 42 resolved papGII+ PAIs identified in the 35 assemblies based on pairwise mash distance. Six isolates (V7_BUTI_Uro, 19-Pyelo, CFT073, 6-Pyelo, 75-Pyelo, 67-Pyelo) contained more than one papGII+ PAI. papGII+ PAI types are annotated. Dotted lines connect PAIs with the position of the associated isolate in the phylogeny. The tanglegram was created using the R package phylools 11 .  Isolates of different clonal complexes carrying resolved PAIs of the same papGII+ PAI type are connected with lines colored according to the papGII+ PAI type ( Supplementary Fig. 8b). Line width and style indicate the alignment coverage of the most similar (based on alignment coverage) papGII+ PAIs in isolates of respective clonal complexes. High similarities indicate potential horizontal transfer events of papGII+ PAIs. Isolates in papGII+ lineages (red shades) and predominant clonal complexes (inner ring) are shown. The tree was visualized using iTOL 12 . (a) Genetic organization of afaVIII operons. afaVIII operon assemblies were resolved in 17 out of 18 afaVIII+ isolates. In 16/17 isolates, the operon was complete; in one isolate (RT_ABU_54), the operon was disrupted by an IS element inserted between afaDVIII and afaEVIII. Percentages indicate the amino acid sequence identities of each predicted gene product from the 16 resolved and complete afaVIII operon assemblies. AfaAVIII, AfaBVIII, AfaCVIII, and AfaDVIII protein sequences were conserved, whereas AfaEVIII occurred in different variants. (b) Multiple sequence alignment of 16 AfaEVIII predicted protein sequences from complete and resolved afaEVIII operon assemblies. Three sequences corresponded to an AfaVIII-variant previously described as BmaE (M-agglutinin) 13 . Sequence types (ST) and clonal complexes (CC) are annotated. The alignment was performed using Muscle 9 and visualized in JalView 14 .
The pathogenicity-resistance hybrid island PAIUS26-argW of isolate US26 (ST88, CC23) was inserted at the tRNA-argW site. The island contained the operon afaVIII (colored in red) as well as multiple antibiotic resistance genes and the mercury resistance operon merRTPCAD. The tRNA genes, integrase genes, and the virulence-associated gene agn43 encoding Antigen 43 are labeled. The gradient scale shows the level of nucleotide identity. Sequence comparisons were performed using EasyFig 10 .

Supplementary Fig. 14. Distribution of putative urovirulence factors by clonal complex (CC) and clinical phenotype.
Heatmap showing the prevalence of selected VAGs described in previous studies 7,15-17 as putative urovirulence factors among common CCs (>10 isolates) according to the scale bar. The score indicates the average number of putative urovirulence factors per isolate. On a genome-wide level, only papGII and iuc (shown at the bottom) were significantly associated with invasive UPEC (Fig. 2).
(a) Genetic organization and sequence comparison of three iuc loci variants identified in isolates CFT073, IAI39 and US06. Numbers indicate amino acid sequence identities. Genes with >98% amino acid sequence identities are labeled in the same color. The three variants were found to be highly conserved in the investigated genomic dataset of 907 isolates, with almost all (97%) identified iuc loci sharing >99% nucleotide sequence coverage and >99% nucleotide sequence identity to either of the three variants. BLAST hits with sequence coverage of <99% were often due to incomplete assemblies. (b) Clustering of resolved mobile genetic elements (MGEs) containing the iuc locus as identified in complete or near-complete genome assemblies by genetic similarity. MGEs were hierarchically clustered based on pairwise mash distance. The identified iuc variant and the presence of papG variants on the MGEs are indicated. shiFp/iuc/iutA2 was identified on plasmids, while shiF/iuc/iutA1 and shiF/iuc/iutA2 were identified on pathogenicity islands (PAIs). PAIs with the papGIV locus only contained remnants of the papGIV operon. Four islands (PAIUS12-pheV-2, PAIUS29-pheV-2, PAIUS31-pheV-2, PAIVR50-pheV-2) were integrated together with a second PAI at the respective integration site. (c) Comparison of selected iuc-containing PAIs. The integrase gene, iuc locus, and selected virulenceassociated genes on the islands are labeled. The gradient scale shows the level of nucleotide identity. Sequence comparisons were performed using EasyFig 10 .
Maximum-likelihood phylogenetic tree of 333 papGII+ isolates and isolate 495_PUTI_Fec (clade I) as outgroup based on 192,889 variable sites identified in a core genome alignment (2.573 Mbp). Clinical phenotypes of isolates are indicated at branch tips. Dominant clonal complexes (CC) and corresponding sequence types (STs), papGII-containing pathogenicity island (papGII+ PAI) insertions sites, and papGII+ PAI types are annotated. Some isolates carried more than one papGII+ PAI. papGII+ PAI type and insertion site could not be identified for all isolates (no symbol). The papGII+ PAI types were either identified in high-quality assemblies or predicted by mapping reads to resolved islands. The papGII+ PAI integration sites could not be detected in highly fragmented assemblies. The scale bar indicates the number of substitutions per site. Nodes with bootstrap values <70 are labeled in red. The branch length of the outgroup was reduced (dashed line). The tree was visualized using iTOL 12 . ABU: asymptomatic bacteriuria Supplementary Fig. 17. Synteny plots of whole genomes and genomic subregions of isolates US16 and US32 depicting a large chromosomal inversion between two pap sites.
US16 and US32 are part of papGII+ lineage CC73-L1. Isolates of this sublineage carry complete papGII and papGIII operons in inversed orientation on distinct pathogenicity islands (PAIs) integrated at the tRNA-pheU and tRNA-leuX integration sites, respectively. The depicted inversion of a 167 kb fragment between the two pap operons included the tRNA-pheU and tRNA-leuX genes, respective PAI integrase genes and papG alleles, and suggested a recombination event with the pap loci acting as inverted repeats. The papGIIIUS16-associated papIBAHCDJKE genes were duplicated into both pap loci of US32. The papGII gene in US32 was hence associated with a different papA allele than in US16 and was integrated into a different PAI. The gene papA encodes the immunogenic major fimbrial subunit PapA. Isolate 194-Pyelo of the same sublineage carried a PAI (PAI194-pheU) that shared all genes with PAIUS32-pheU but contained a papF allele identical to the papGII locus in US16, suggesting an independent recombination event. PAIs are labeled (blue bars) and papGII-associated genes (red) and papGIII-associated genes (green) colored according to their association in US16. Chromosomal positions of the subregions are labeled. The gradient scale shows the level of nucleotide (BLASTn) or amino acid (tBLASTx) identity. Sequence comparisons were performed using EasyFig 10 .