An Escherichia coli ST131 pangenome atlas reveals population structure and evolution across 4,071 isolates

Escherichia coli ST131 is a major cause of infection with extensive antimicrobial resistance (AMR) facilitated by widespread beta-lactam antibiotic use. This drug pressure has driven extended-spectrum beta-lactamase (ESBL) gene acquisition and evolution in pathogens, so a clearer resolution of ST131’s origin, adaptation and spread is essential. E. coli ST131’s ESBL genes are typically embedded in mobile genetic elements (MGEs) that aid transfer to new plasmid or chromosomal locations, which are mobilised further by plasmid conjugation and recombination, resulting in a flexible ESBL, MGE and plasmid composition with a conserved core genome. We used population genomics to trace the evolution of AMR in ST131 more precisely by extracting all available high-quality Illumina HiSeq read libraries to investigate 4,071 globally-sourced genomes, the largest ST131 collection examined so far. We applied rigorous quality-control, genome de novo assembly and ESBL gene screening to resolve ST131’s population structure across three genetically distinct Clades (A, B, C) and abundant subclades from the dominant Clade C. We reconstructed their evolutionary relationships across the core and accessory genomes using published reference genomes, long read assemblies and k-mer-based methods to contextualise pangenome diversity. The three main C subclades have co-circulated globally at relatively stable frequencies over time, suggesting attaining an equilibrium after their origin and initial rapid spread. This contrasted with their ESBL genes, which had stronger patterns across time, geography and subclade, and were located at distinct locations across the chromosomes and plasmids between isolates. Within the three C subclades, the core and accessory genome diversity levels were not correlated due to plasmid and MGE activity, unlike patterns between the three main clades, A, B and C. This population genomic study highlights the dynamic nature of the accessory genomes in ST131, suggesting that surveillance should anticipate genetically variable outbreaks with broader antibiotic resistance levels. Our findings emphasise the potential of evolutionary pangenomics to improve our understanding of AMR gene transfer, adaptation and transmission to discover accessory genome changes linked to novel subtypes.

which is only possible with genome sequencing to allow profiling of all AMR genes 18,19 . Recent work applied core genome MLST (cgMLST) of 2,512 genes, but computational limitations meant examining 288 ST131 genomes where only a single specimen per rST was examined across 1,230,995 SNPs from a 2.33 Mb core genome, within a larger set of 9,479 diverse E. coli 20 . Given that rST1503 alone may account for ~81% of ST131 and that outbreaks may comprise a single rST 21 , our understanding of E. coli ST131 transmission dynamics and diversity within single STs may limit inferences of past, present and emerging MDR outbreaks.
A deeper investigation of MDR ST131's population structure, selective processes and ESBL gene evolution can illuminate its mechanisms of AMR, host colonisation and pathogenicity 10,14 . Exploring the evolutionary origins, transmission and spread of outbreaks requires extensive sampling to link variation at AMR genes with inferred adaptive and epidemiological patterns 22 , and previous work suggests a high-resolution large-scale approach to bacterial epidemiological based on genomic data address these questions 23 .
Deducing evolutionary relationships based on the core genome permits the discovery of novel accessory genome events 24 . ST131 evolution has been punctuated by plasmid conjugation, plasmid recombination and mobile genetic element (MGE) rearrangements, particularly of the cefotaximase (CTX-M) class of ESBL genes, bla CTX-M-14/15/ 27 25-27 that allow third-generation cephalosporin-resistance 28 . These bla CTX-M gene changes correlate strongly with ST131 subclade differentiation, such that the most common one (C2) is typically bla CTX-M-15 -positive 29 . ESBL and other virulence factor genes likely drive extraintestinal niche colonisation but vary across environments depending on MGE-driven mobility 10,15,29,30 . When coupled with host immunity, this environmental niche effect results in negative frequency-dependent selection (NFDS) in the ST131 accessory genome, leading to a variable AMR gene repertoire 31 that has not yet to be explored within ST131's subclades. In addition, applying an evolutionary pangenomic approach with core and accessory genome variation within subclades may inform on the origin of new genetic ST131 lineages.
Here, we aggregated all available ST131 Illumina HiSeq read libraries, and automated quality-control, genome de novo assembly, DNA read mapping and ESBL gene screening in the largest ST131 set examined thus far to reconstruct a core genome phylogeny and evaluated the epidemiology of clades and subclades. We established that the two most common C subclades (C1 and C2) co-circulated globally and that their ESBL gene composition was flexible. We hypothesise that the diversity of accessory genomes in isolates with near-identical core genomes due to ST131's ability to retain newly acquired genes may be driven by environmental pressures.

Results
Collation, screening and generation of 4,071 high quality draft ST131 genome assemblies. We collated accession IDs and linked metadata for 4,071 high quality de novo genome assemblies whose DNA was isolated in 1967-2018 from diverse sources across 170 BioProjects (Supplementary Table S1) following thorough filtering steps (Fig. 1 Table S2). The sole PacBio assembly (AR_0058) had five contigs with a N50 of 4,923,470 bp and was 5,132,452 bp long.
We assembled the pangenome of 4,071 ST131 isolates using NCTC13441 as a reference with Roary v3.11.2 32 resulting in 26,479 genes, most of which were rare ( Supplementary Fig. S1). Bla CTX-M-15 -positive NCTC13441 was isolated in the UK in 2003, and was from subclade C2. 3,712 genes present in all isolates formed the core genome, and of 22,525 CDSs in the accessory one, 1,018 were shell genes in 15-95% of isolates and 21,507 (81% of the total) cloud genes in less than 15% of isolates ( Supplementary Fig. S2). Cloud gene rates were a function of sample size, which explained most (r 2 = 0.846, p = 0.00012) of cloud gene number variation, but not that of core (r 2 = 0.162), soft core (r 2 = 0.258) nor shell (r 2 = 0.001) genes.
Epidemic subclades C1 and C2 co-circulate globally but with stable frequencies. NFDS in the accessory genome driven by AMR gene acquisition, ecological niche colonisation ability and host antigen recognition has stabilised the relative frequencies of ST131 and its clades over time relative to other STs 29,31 . Here, this pattern could be present for the clades A, B, C and three main C subclades, C1_6, C2_4 and C2_9 spanning 2002-2017 (1,596 out of 1,614 isolates that had year of isolation data) if their relative rates stabilised after emergence ( Supplementary Fig. S4).
Of the 1,724 isolates with geographic information, 819 were from Europe, 499 North America, 294 Asia, 80 Oceania, 20 South America, and 12 Africa (Fig. 3) -the remaining 2,347 isolates (58%) had no geographic data (Supplementary Table S3). C1_6 was more common in North America (OR = 1.57, 95% CI 1.25-1.96, p = 0.0004) and rarer in Europe (OR = 0.67, 95% CI 0.53-0.81, p = 0.0004). C2_4 was more frequent in Asia (OR = 1.75, 95% CI 1.18-2.56, p = 0.019) and less so in North America (OR = 0.61, 95% CI 0.40-0.91, p = 0.042). Consequently, there was limited global population structure separating C1 and C2 isolates, suggesting that they have co-circulated globally for some time. Of those assessed, four were long read libraries (PacBio) and the rest were short paired-end reads (Illumina HiSeq). The adapters of the four PacBio and 4,147 Illumina reads were trimmed using Cutadapt and Fastp, respectively. The resulting adapter-free reads were assembled using Unicycler. An iterative genome quality check eliminated three PacBio and 77 Illumina libraries, yielding 4,071 as the final collection. Cleaned reads after Quast filtering were examined with Roary using Prokka annotation to evaluate the pangenomic diversity. Phylogenetic construction was performed by RAxML on the core genome. The assembled genomes were annotated and screened for AMR genes (including bla CTX-M-14/15/27 ) and their context. Genetically distinct clusters from the phylogeny were determined using Fastbaps. Distances between the core and accessory genomes of isolate pairs were estimated using Poppunk based on k-mer differences.   www.nature.com/scientificreports www.nature.com/scientificreports/ Based on common ancestry with Clade B, the origin of Clade C was in North America because Clade B isolates from 1967-1997 were solely isolated in the USA until one isolate in Spain in 1998. This fitted previous work timing the origin of C to 1985, fimH30 to 1986, and the FQ-R C1/C2 ancestor to 1991 21 (or 1986 29 ), consistent with a North American source. However, the earliest C representative here was from Norway in 1999 and was a bla CTX-M -negative bla TEM-1B -positive FQ-R genome from C2_9 (ERR1912633 34 ).
The origin of C2_4 was unclear: although the earliest isolate was in 2008 from the USA, the most basal branches within C2_4 had isolates spanning a range of continents, and C2_4's long ancestral branch implied that it originated prior to 2008 ( Supplementary Fig. S5). The isolates most closely related to C2_4 were a group of 11 assigned to C2_10 that had limited country and year of isolation data bar one from the UK in 2011, one from the USA in 2009, one from the Netherlands ( Supplementary Fig. S3). The next most closely related group were nine C2_9 isolates that had no geographic data bar two from Australia in 2017.  Fig. S6) such that the rates were highest in C2_4 (93.8%) followed by C0 (90%), C2_9 (82.6%) and C1_6 (57%) (Supplementary Fig. S7). The earliest bla CTX-M -positive Clade C genome was from Canada in 2000 (ERR161284, C2_9 12 ,). 88% (339 of 386) of C2_4 isolates and 81% (1,338 of 1,651) of C2_9 isolates were bla CTX-M-15 -positive with limited geographic or temporal structure (Fig. 4). This reiterated that the C2 ancestor was bla CTX-M-15 -positive whose gains of other bla CTX-M genes were likely rare local events.
C1_6 had a different bla CTX-M gene rates to Clade C: bla CTX-M-27 (38%) was more common than bla CTX-M-14 (14%) or bla CTX-M-15 (6%) ( Table 2)     Previous work showed that C1 subclade C1-M27 with the prophage-like genomic island M27PP1 was a key driver of MDR ST131 36 . Here, 468 (42%) of C1_6 isolates had M27PP1, and 97% of these were bla CTX-M-27 -positive (397 of 410 with bla CTX gene data) as expected (Supplementary Table S3) 36 . These formed a single clade (Fig. 3) that has spread globally since the earliest date for a C1-M27 isolate here in 2004 in Japan (DRR050997 36 ,). 74 of these C1-M27 isolates subsequently acquired the related prophage-like genomic island M27PP2, most likely as a single ancestral event given their monophyly here. Five subclade C2 isolates and one from Clade A had M27PP1: all three of the C2_9 isolates with bla CTX gene data were bla CTX-M-27 -positive, and the earliest of these was from Japan in 2010 (DRR051016) 36 .  Table S4). 90 genomes from C2_9 had a bla CTX-M-15 gene in a transposition unit (TU) flanked by a 1,658 bp 5′ ISEcp1 and 3′ orf477 as a 2,971 bp ISEcp1-bla CTX-M-15 -orf477Δ TU followed by a 3′ 5.8 Kb Tn2 (Supplementary Fig. S8, Supplementary Table S5), verified previously using long reads 26 . Normally this TU is on an IncF plasmid 35 , but selected C2_9 assemblies had a chromosomal insertion of this TU at the mppA gene ( Supplementary Fig. S9) as shown previously 21 . Here, 13 additional genomes with source information had this insertion, indicating that C2_9 had spread to Thailand (as in 14 ), Singapore and the Democratic Republic of Congo by 2014 37 , consistent with a global spread. One C2_4 isolate from Pakistan in 2012 also had this TU inserted at mppA (SRR1610051 38 ), affirming that insertions at mppA recur due to local homology to ISEcp1's 3′ inverted repeat 39,40 . Inter-clade but not intra-clade accessory genome divergence. Accessory genome composition varies across genetically distinct groups due to ecological niche specialisation 15 : this was supported here at the clade level across the 4,071 isolates by a positive correlation of pairwise core and accessory genome distances measured with Poppunk 23 (Fig. 5). This matched work on an E. coli dataset with n = 218 ST131 29 , and was evident in a www.nature.com/scientificreports www.nature.com/scientificreports/ higher shell gene number in Clade B compared to Clades A and C (Table 3), suggesting higher diversity in Clade B (Supplementary Fig. S10).

Diverse genomic locations of the bla
Previous work on E. coli 41 employed a metric of pangenome openness (alpha) that was similarly applied to our Roary pangenome results here to compare with previous findings that n = 648 Clade C isolates had a marginally more open genome (smaller alpha) than n = 140 from Clade B, and both were more open than n = 70 Clade A isolates 31 . Although our initial results revealed more open pangenome (alpha) in Clade B (0.762) than Clade A (0.807) or Clade C (0.806) or the whole collection (0.823), alpha was higher for small (<250) sample sizes here (Fig. 6) as indicated before 41 . Alpha estimates averaged across the sample size placed Clade C as more open than Clade A or Clade B (Table 3), highlighting a partial dependence of alpha on sample size that was removed once the sample sizes >250 when the relative rate of new genes became constant (Supplementary Fig. S11). The average alpha for 250 ≤ n ≤ 386 showed Clade C (0.716) was more open than Clade B (0.753) than Clade A (0.795) or the whole collection (0.771) (Supplementary Fig. S12) The upper limit of n = 386 was corresponded to the smallest group size, which was for C2_4.
Within the C subclades, the pairwise core and accessory genome distances were not correlated: the accessory genomes varied extensively even with nearly identical core genomes (Fig. 5). Subclade C2_4 had a more open pangenome (0.696) than subclade C1_6 (0.755) or subclade C2_9 (0.822), which was evident when adjusting for the differing sample sizes (Fig. 6), though the average alpha placed C2_4 (0.754) and C1_6 (0.749) as about equally more open than C2_9 (0.799, Table 3). As above, the relative alpha levels were retained when alpha was   and was more independent from sample number once the number of genomes examined about >250. Note that the x-axis' log 10 scale.
averaged from >250 isolates up to the smallest (subclade C2_4) sample size with C2_4 (0.705) and C1_6 (0.731) as more open than C2_9 (0.742) (Supplementary Fig. S12). Given that the high accessory genome diversity within subclades independent of core genome composition, the observed shell gene numbers were compared to expected values adjusted for sample number and gene frequency category change to investigate shell gene overlap across clades and subclades (see Methods). Pooled groups with divergent accessory genomes should have more net shell genes, whereas similar accessory genomes should have fewer shell genes. Clades B and C together had 6% less shell genes, whereas Clade A had an excess of 1% when combined with Clade C and 6% with Clade B (Supplementary Table S6). Within subclades, there was a small shell gene excess for C1_6 combined with C2_9 (3%), but C2_4′s shell gene composition differed from both C1_6 (22% excess) and C2_9 (23% excess, Supplementary Table S6). The same trend was observed for C2_4 combined with Clade A (41% excess) or Clade B (5%) in contrast to C1_6 (16% with Clade A, -8% with Clade B) and C2_9 (16% with Clade A, -11% with Clade B), indicative of more unique shell genes in subclade C2_4.

Discussion
By collating all available ST131 genomes to produce 4,071 high-quality draft assemblies, we reconstructed their phylogenetic relationships to show that ST131 was dominated by subclades C1 and C2. For isolates with bla CTX-M gene data, subclade C2 was 98% bla CTX-M-15 -positive in contrast to C1 that had either bla CTX-M-27 (66%) or bla CTX-M-14 (24%) genes. Although the subclade C1 ancestor may have been bla CTX-M-14 -positive, bla CTX-M-27 's increasing levels in C1 and its higher ceftazidime resistance due to a D240G substitution also in bla CTX-M-15 42 indicated it will become more common. Although the subclades had different origins and ancestral ESBL gene compositions, both have become global with relatively consistent frequencies and minor differences in rates due to differing evolutionary patterns after emerging 14 . This worldwide co-circulation suggested newer lineages could become globally disseminated, with implications for infection control if they have altered host adhesion abilities (like fimH30 43 ) or AMR variants (like FQ-R or bla CTX-M-15 ). This was highlighted by the emergence of C2_4, the C2_9 subgroup with an ISEcp1-bla CTX-M-15 -orf477-Tn2 TU mppA chromosomal insertion, and many other contemporary examples such as bla OXA-48 -producing ST131 44 . Within C1, the bla CTX-M-27 -positive C1-M27 lineage emerged in this study as an increasingly common cause of infection globally. Tracking plasmid, MGEs and ESBL genes must be a key component of disease monitoring to consider potential future bacterial outbreaks' spectrum of AMR.
Horizontal DNA transfer allows E. coli adapt to new ecological niches and contributes to its dynamic accessory genome 45 where the cloud gene number increases with isolate number and diversity. Our analysis of this large collection's core (3,712) and accessory (22,525) genes extended previous work showing that 283 predominantly ST131 isolates had 16,236 genes in an open pangenome with a core of 3,079 genes 46 , 21% less than the core genome count here. Nonetheless, ST131's accessory genome may be streamlined: a more genetically diverse set of 1,509 E. coli including 266 ST131 had a core genome of 1,744 genes and a 62,753 cloud genes 29 , and an E. coli-Shigella core genome had 2,608 genes among a total of 128,193 genes 41 .
NFDS posits that genes associated with adaptation to new hosts, antibiotics and competitors using the same resources remain at intermediate frequencies 31 . Pangenome openness and shell gene sharing across clades supported inter-clade structure resulting from ecological specialisation 47 , with Clade A more different to Clades B and C. Within subclades, isolates with minimal core genome differences could have divergent accessory genomes, implying that plasmid, ESBL gene and MGE changes may be detected better using pangenomic approaches than assessing the core alone 48 .
Global coordination of data processing and bioinformatic interpretation can help identify, trace and control disease outbreaks 49 , for which resolving recent transmissions may be limited by sampling 50 . Expanding numbers of non-human isolates and the diversity of geographic regions sampled would help clarify potential sources of E. coli's ESBL genes, for which there was no evidence of retail meat 51 or livestock 52 as reservoirs for blood stream infections thus far, though transfer of bacteria has occurred 53 . Better epidemiological information coupled with genome-sequencing 54 could allow inference of adaptations across lineages 55 , such as bla CTX-M-15 -positive ST8313, a putative descendant of ST131 subclade C2 5 .

Methods
Study selection and data extraction. Data on 4,870 E. coli ST131 genomes and linked metadata was collected using an automated text-mining algorithm using a Python implementation of Selenium (Selenium-python. readthedocs.io) from Enterobase (https://enterobase.warwick.ac.uk) 56 on the 10 th of September 2018 as previously described 57 . This was used to download read libraries the European Nucleotide Archive (ENA) (www.ebi. ac.uk/ena) 58 and NCBI Short Read Archive (SRA) databases as FASTQ files, restricted to complete libraries not labelled as "traces" (Fig. 1). Of the initial 4,870 read libraries, 4,264 were paired-end (PE) Illumina HiSeq ones and four were PacBio, in addition to the PacBio-sequenced NCTC13441 genome used as a reference in this study. 495 libraries predominantly from Illumina MiSeq platforms were not examined to avoid platform-specific artefacts.
Illumina HiSeq read data quality control, trimming and correction. Of the above 4,264 PE Illumina HiSeq read libraries, 4,147 passed stringent quality control. This was implemented using Fastp v0.12.3 59 to trim sequencing adapters, remove reads with low base quality scores (phred score <30) or ambiguous (N) bases, correct mismatched base pairs in overlapped regions and cut poly-G tracts at 3′ ends (Supplementary Table S2). Individual bases in reads were corrected by BayesHammer in SPAdes v3.11 60 . Quality control metrics were examined at each step: across the whole collection as a batch report using MultiQC v1.4 61  www.nature.com/scientificreports www.nature.com/scientificreports/ Illumina HiSeq read library genome assembly. The 4,147 Illumina HiSeq libraries passing quality control were de novo assembled using Unicycler v4.6 in bold mode to merge contigs where possible 62 . This used SPAdes v3.12 63 to generate an initial assembly polished by Pilon v1.22 64 , which ran iteratively until no further corrections were required by the optimal assembly. This approach was similar to Enterobase's 56 , though Enterobase used BBMap in BBTools 65 , SPAdes v3.10 and BWA 66 during assembly 20 .
Reference PacBio genome quality control and assembly. The ST131 reference genome NCTC13441 was isolated in the UK in 2003 and was in subclade C2 67 . It had a 5,174,631 bp chromosome with 4,983 protein-coding genes and one pEK499-like type IncFIA/FIIA plasmid with two bla CTX-M-15 gene copies (accession ERS530440). Although four further PacBio read libraries were initially included to test genome assembly contiguity and ESBL gene context using longer read libraries, only one passed assembly annotation screening (AR_0058, accession SRR5749732 38 ). Its adapters were removed using Cutadapt v1.18 68 followed by excluding duplicate reads with Unicycler v0.4.6. Base correction was implemented during genome assembly with Unicycler via SPAdes v3.12, and the genome assembly was iteratively polished by Racon v1.3.1 until no further corrections were required 69 . This 5,132,452 bp assembly had five contigs and 5,506 genes was assigned to C1 and had no ISEcp1.
Quality checking and annotation identifies 4,071 genome assemblies for investigation. For the 4,147 Illumina HiSeq assemblies and single PacBio assembly, quality was verified with Quast v5.0 70 based on the N50, numbers of predicted genes and open-reading frames, and numbers of contigs with mis-assemblies. The quality of the short read de novo assemblies was comparable to previous work whose requirements required assembly length in the range 3.7-6.4 Mb with <800 contigs and <5% low-quality sites 20 . Initial annotation of 4,147 Illumina HiSeq assemblies using Prokka v1.10 71 suggested 77 assemblies had a distinct gene composition and should be excluded because they were either genetically divergent, did not assemble adequately, or had sub-standard read libraries. As a result, 4,070 Illumina HiSeq genome assemblies were selected (Supplementary Table S2) and aligned against the reference genome NCTC13441 and PacBio assembly AR_0058 (Supplementary  Table S3). These assemblies had lengths in the range 4.3-6.1 Mb with a mean and standard deviation of 5,137 ± 121 Kb. This identified 4,829 genes on average per assembly (range 3,942 to 5,749, Supplementary  Fig. S1). The variation in numbers of genes per assembly was largely explained by the total assembly length (r 2 = 0.959). 53% of the 4,071 had no source data and 12% of the remainder had a non-human source.
Pangenome analysis to identify the core and accessory genomes. We created a pangenome based on the 4,072 annotation files using Roary v3.11.2 32 with a 100% BLAST v2.6.0 identity threshold using the MAFFT v7.310 setting 72 . The resulting concatenated core CDS alignment spanning 1,244,619 bases and 3,712 genes scaffolded using NCTC13441 was used for core genome analyses. 242 soft core genes were found that may have been due to assembly errors or other artefacts. Pangenomes for each clade, C subclade and various combinations were also created for accessory genome evaluation.
Phylogenetic reconstruction, population structure and subclade assignment. A maximum likelihood phylogeny was generated based on the core genome alignment of 4,071 genome assemblies with NCTC13441 as a reference across 30,029 SNPs (with 26,946 alignment patterns) for 50 iterations of RAxML v8.2.11 with a GTR model and gamma substitution rate heterogeneity 73 . 88% (3,585) of the assemblies were genetically unique. The total execution time on an Ubuntu v16.04 computer server with 256 Gb RAM using 52 threads was 24.4 days. Phylogenies were drawn and annotated using iTol v4.3.2 74 .
Clade classifications were initially based on published ST131 fimH phylogenetic analyses associating clade A with fimH41, B with fimH22, B0 with fimH27, and C with fimH30 21 . To classify the large number of isolates in the C subclades, we clustered the 30,029 core genome SNPs as a sparse matrix using a hierarchical Bayesian clustering algorithm implemented in Fastbaps v1.0 (Fast Hierarchical Bayesian Analysis of Population Structure 75 ) in R v3.5.3 with packages ape v5.3, ggplot2 v3.1.1, ggtree v1.14.6 76 , maps v3.3.0 and phytools v6.60. This used default parameters except for a Dirichlet prior variance of 0.006.
The C1-M27 lineage was identified based on the prophage-like 11,894 bp M27PP1 region using the 3′ end of accession LC209430 36 , which was largely intact in the isolates with this element. The 19,352 bp M27PP2 prophage-like region was also examined using the 5′ end of the same accession. The monophyletic nature of the C1-M27 group was verified by examining their phylogenetic proximity, which contrasted with the paraphyletic M27PP2-positive C1 isolates, as well as other isolates from Clade A and C2 that were diverse. ESBL gene screening and contig visualisation. We screened for contigs with bla CTX-M-14/15/27 genes across the 4,071 assemblies' 505,761 contigs using BLASTn 77 alignment of these three genes individually, and the Comprehensive Antibiotic Resistance Database (CARD) v3.0 requiring 100% identity for any match with a contig. Selected bla CTX-M-14/15/27 -positive contigs were visualised using the Multiple Antibiotic Resistance Annotator (MARA) 78  Pangenome analysis to find shared and unique accessory genomes. Roary assigned genes to the core (c) and the accessory genomes, including the soft core (s), shell (a) and cloud (d). The expected shell gene number (E[a p ]) from the Roary output for a given pooled set of isolates (p) taken from groups i = 1..k was determined based on the shell gene number of group i (a i ) weighted by the sample size (n i ) corrected for the deficit in   1 (Supplementary Fig. S11). This power-law regression approximated Heaps' law (from n = ĸN gamma for alpha = 1−gamma) such that an open pangenome has alpha <1 and a closed one alpha >1 80 . Previously, diverse E. coli had alpha = 0.625 where alpha had a partial negative correlation with N 41 . Similarly, alpha was ~0.877 for ST131 clade C, ~0.898 for B, ~0.958 for A, and ~0.951 for all ST131, suggesting alpha was higher when genetically distinct clades were combined 31 .
Accessory genome composition across clades and subclades. The relative pairwise genetic distances of the core (π) and accessory (a) genomes were compared for each clade, each C subclade and all bla CTX-M -positive clade C isolates using Poppunk (Population Partitioning Using Nucleotide Kmers), which can distinguish closely related genomes 23 . Poppunk used variable length pangenome k-mer comparisons with Mash v2.1 81 and a Gaussian mixture model to examine the correlation of π and a per sample pair. This annotation-and alignment-free strategy complemented the approaches of Fastbaps, RAxML and Roary.

Data availability
All raw sequence data (reads and/or assembled genomes) for the E. coli genomes analysed in this publication are in Supplementary Table S3, with their Bioproject IDs and associated study DOIs in Supplementary Table S1. The genome assemblies of the 4,071 E. coli ST131 are on Zenodo at https://zenodo.org/record/3341533 for 2,948 and at https://zenodo.org/record/3357944 for the remaining 1,123 files. The 4,071 E. coli ST131 genome annotation files are on Zenodo at https://zenodo.org/record/3341535 for 4,069 and at https://zenodo.org/record/3357914 for the final two files. An interactive version of the phylogeny generated by Poppunk for the 4,071 ST131 assemblies is on MicroReact at https://microreact.org/project/oD6K_fL2d -this includes a Newick tree file for download.