Long-read-sequenced reference genomes of the seven major lineages of enterotoxigenic Escherichia coli (ETEC) circulating in modern time

Enterotoxigenic Escherichia coli (ETEC) is an enteric pathogen responsible for the majority of diarrheal cases worldwide. ETEC infections are estimated to cause 80,000 deaths annually, with the highest rates of burden, ca 75 million cases per year, amongst children under 5 years of age in resource-poor countries. It is also the leading cause of diarrhoea in travellers. Previous large-scale sequencing studies have found seven major ETEC lineages currently in circulation worldwide. We used PacBio long-read sequencing combined with Illumina sequencing to create high-quality complete reference genomes for each of the major lineages with manually curated chromosomes and plasmids. We confirm that the major ETEC lineages all harbour conserved plasmids that have been associated with their respective background genomes for decades, suggesting that the plasmids and chromosomes of ETEC are both crucial for ETEC virulence and success as pathogens. The in-depth analysis of gene content, synteny and correct annotations of plasmids will elucidate other plasmids with and without virulence factors in related bacterial species. These reference genomes allow for fast and accurate comparison between different ETEC strains, and these data will form the foundation of ETEC genomics research for years to come.


Results
Genome analysis of eight representative ETEC isolates. Eight ETEC strains representing the seven major ETEC lineages (L1-L7) comprising isolates with the most prevalent virulence factor profiles were sequenced, assembled, circularised and manually curated (Table 1).
L3 includes two different representative strains, one CS7 and one CFA/I positive strain. All chromosomes except one were circularised (E1779). The average length of the chromosome was 4,927,521 bases (4,721,151,162) with an average GC content of 50.7% (50.4-50.9%) and the number of CDS ranging from 4409 to 4924 (Table S1). Each ETEC reference genome contains between two and five plasmids encompassing plasmid-specific features. Some of which carried virulence genes and/or antibiotic resistance genes ( Table 2, Additional File 2).
Comparative genomics of the chromosome. The chromosomes of the reference strains were aligned and compared using progressiveMauve (v2.4.0, URL: http:// darli nglab. org/ mauve/ mauve. html) 29 , and the overall structure is conserved across all eight chromosomes ( Figure S1). In total, 8348 chromosomal genes were identified in the eight ETEC strains with 3179 genes considered part of the core genome shared by all eight reference strains. The majority of human commensal Escherichia coli (E. coli) strains belong to subgroup A 30,31 . However, ETEC strains fall into multiple phylogenetic groups (A, B1, B2, D, E, F and CladeI with the majority found in the phylogenetic groups A and B1 18 . The phylogenetic group of the eight ETEC reference strains have previously been determined using the triplex-PCR scheme 32 . The ETEC references were re-analysed using ClermonTyping 33 and it was determined that strain E1373 belongs to the phylogenetic group E while the other reference isolates belong to groups A and B1 (Table 1).
Plasmids. The plasmids of each isolate were annotated using Prokka followed by manual curation of the annotations including genes part of the conjugation machinery and known plasmid stability genes. Virulence factors (including CFs, toxins, EtpBAC and EatA), putative virulence factors (e.g. CexE) and antibiotic resist-  34 as well as complete and partial insertion elements and prophages were manually annotated. The plasmids were designated pAvM_strai-nID_integer, e.g. pAvM_E925_4 (Additional file 2). The first plasmid reported in this study starts at 4 as three previous plasmids E873p1-3 already have been deposited to GenBank related to a different project 8 . Plasmids were typed by analysing the presence and variation of specific replication genes to assign the plasmids to incompatibility (Inc) groups. The Inc groups of the ETEC reference plasmids were first determined using PlasmidFinder and further classified into subtypes using pMLST 35 . The replicons identified are IncFII, IncFIIA, IncFIIS, IncFIB, IncFIC, IncI1 and IncY. Plasmids with replicon IncY, IncFIIY or IncB/O/K/Z mainly harboured plasmid associated genes, such as stability and transfer genes. Importantly, replicons FII, FIB and I1 are strongly associated with virulence genes as genes encoding all CFs, toxins and virulence factors EatA and EtpBAC are present on these plasmids. The majority of all ETEC plasmids analysed here (17/29) belong to IncFII, of which six of the IncFII plasmids have an additional IncFIB replicon. In six of the ETEC reference strains two or three IncFII replicons are present, for example, in strain E925, the plasmids pAvM_E925_4 and 7 both belong to IncFII. However, the plasmids were further subtyped to FII-111 and FII-15, respectively, ( Table 2 and Additional  file 3), explaining the plasmid compatibility.
Virulence factors. The CFs expressed by the selected reference strains are CFA/I, CS1-CS3, CS5-CS7 and CS21. Three of the strains (E925, E1649 and E1779) express both LT and ST, two strains (E2980 and E1441) express LT and the strains E36 and E562 express STh, while E1373 express STp (Table 1). A plasmid can harbour multiple virulence genes, usually a CF locus and genes encoding one or two toxins. Interestingly, plasmids do not often harbour multiple CF loci, but on individual plasmids (in the ETEC reference strains described here). Exceptions for this is strain E1779 in which CS5 and CS6 loci are located on the same plasmid (pAvM_ E1779_19). In both E925 (L1) and E1649 (L2) the genes encoding CS3 (cstABGH), ST (estA) and LT (eltAB) are located on the same plasmid, both with the FII replicon and of roughly the same size ( Table 2). Blastn comparison between the plasmids and additional plasmids that harbour the same virulence genes shows that they are highly conserved (Fig. 1). The results correspond with the close genetic relationship and common ancestry of lineage 1 (L1) and lineage 2 (L2) 18 .
Besides CFs and toxins, additional virulence factors were identified in the majority of the strains (Table 2), with eatA and etpBAC being the most commonly found.
EatA is an immunogenic mucinase that contributes to virulence by degrading MUC2 which is the major protein component of mucus in the small intestine 37,38 . The etpABC genes encode an adhesin located on the tip of the flagella and mediate adherence to host cells 39,40 . Four reference strains (E925, E1649, E36 and E562) harbour both eatA and etpBAC. In three strains the eatA and/or etpBAC are located on the same plasmid with an FII or FII + FIB replicon along with additional ETEC virulence genes, except in E562 and E1373, where eatA and etp-BAC are located on an I1 + FII (pAvM_E562_23) and I1 (pAvM_E1373_16) plasmid, respectively, which mainly contains plasmid associated genes including genes encoding the pil operon and tra-operon (pAvM_E562_23). Furthermore, a less explored putative virulence factor is CexE, which is an extracytoplasmic protein dependent  www.nature.com/scientificreports/ on the expression of the CFA/I regulator cfaD 41 , and was first identified in H10407 42 . Corroborating earlier findings, the CFA/I positive E36 (L3) and E562 (L6) isolates harbour cexE (pAvM_E36_12 and pAvM_E562_25).
In addition, cexE is present in pAvM_E925_6, pAvM_E1779_19 and pAVM_E2980_14, pAvM_E1441_18 and pAvM_E1373_28. CexE has previously also been identified in several CS5 + CS6 positive ETEC and shown to be upregulated in the presence of bile and sodium glycocholate-hydrate 43 . Bile is known to be involved in the regulation of several ETEC CFs 44,45 . The location of cexE seems to be conserved across specific strains. In pAvM_E36_12, pAvM_E1441_18, pAvM_E1779_19 and pAvM_E562_25 cexE is located upstream of the aat-PABC locus, whereas in pAvM_E925_6 and pAvM_E2980_14 cexE is located downstream of rob (an AraC family transcriptional regulator) in the opposite direction. The pAvM_E925_4 harbours the aatPABC locus, however, cexE is located on a different plasmid (pAvM_E925_6) in this strain.
Comparison of plasmids with the same virulence profile. ETEC isolates within a lineage share the same virulence profile, specifically the same CF profile ( Figures S2-S3). We verified that our selected isolates grouped within previously described lineages with confirmed virulence profiles by phylogenetic analyses (Figures S2-S3). Blastn of each of the CF positive plasmids from each reference genome were performed, and the best hit(s) were used for subsequent analysis (Fig. 1). Most of the plasmids identified as related to the ETEC reference plasmids were not annotated, hence, when needed these were annotated using the corresponding ETEC reference plasmids annotation as a high priority when running Prokka. We show that plasmids with the same CF and toxin profile from the same lineage are often conserved (Fig. 1 (Fig. 1a). Furthermore, high coverage and similarity were found between the plasmids of isolates E1441 (L4), and PacBio sequenced plasmids of ETEC isolates ATCC 43886/E2539C1 and 2014EL-1346-6 20 . These isolates were collected in the seventies 46 and 2014 (from a CDC collection), respectively, and assigned as O25:H16 which is the O group determined for E1441 in silico ( Fig. 1e). Plasmids of E2980 (LT + CS7, L3) were validated by the PacBio sequenced plasmids of ETEC isolate E2264 (Fig. 1d). Similarly, two plasmids of E1779 (LT, STh + CS5 + CS6, L5) was identified in E2265 (LT, STh + CS5 + CS6 28,43 , although E1779 harboured two additional plasmids. Several additional L5 ETEC genomes have been sequenced within the GEMS study 47 , and high plasmid similarity and conservation in CS5 + CS6 positive L5 isolates was evident (Fig. 1f).
Overall the results show that ETEC plasmids are specific to lineages circulating worldwide and conserved over time (Fig. 1, Figures S2-S3, and Figures S4-S11 for more extensive plasmid annotation). Thus, the plasmids of major ETEC lineages must confer evolutionary advantages to their host genomes since they are seldom lost.
Antibiotic resistance. E. coli can become resistant to antibiotics, both via the presence of antibiotic resistance genes and the acquisition of adaptive and mutational changes in genes encoding efflux pumps and porins which allows the bacterium to pump out the antibiotic molecules effectively 48,49 .
Antibiotic resistance genomic marker(s), both chromosomally located and on plasmids, were identified using the CARD database 34 (Table 2, Figures S12 and S13 and Additional file 2). Similar to other studies, IncFII and B/O/K/Z plasmids were found to harbour genes conferring antibiotic resistance 50 . Furthermore, the phenotypic antibiotic resistance profile was determined with clinical MIC breakpoints based on EUCAST (The European Committee on Antimicrobial Susceptibility Testing) 51 (Table S2). Phenotypic antibiotic resistance profiles (Table S2) were supported mainly by the findings of antibiotic resistance genes, efflux pumps and porins (Figures S4 and S5 and Table S3), although some differences were found. All ETEC reference strains are phenotypically resistant to at least two antibiotics of the 14 tested (Table S2). Resistance against penicillin's, norfloxacin (Nor) and chloramphenicol (Cm) is most common among these strains. Two of the strains, E1441 and E2980, harbour more than four antibiotic resistance genes as well as multiple efflux systems and porins (Figure S12, Figure S13 and Table S3). The plasmid pAvM_E1441_17 carries aadA1-like, dfrA15, sul1 and tetA(A) resistance genes (Table 2), where the first three genes are in a Class 1 integron which confers resistance to streptomycin, trimethoprim, and sulphonamide (sulphamethoxazole). The gene tetA(A) is part of a truncated Tn1721 transposon 52 . The E1441 strain was verified as resistant to tetracycline (Tet) and sulphamethoxazole-trimethoprim (Sxt) while streptomycin was not tested. A mer operon derived from Tn21 is also present in the resistance region of pAvM_E1441_17 ( Table 2), indicating that the plasmid would also likely confer tolerance to mercury, although this was not confirmed. Interestingly, this multi-replicon (FII and FIB) plasmid also harbours the lng locus encoding CS21, one of the most prevalent ETEC CFs. In isolate E2980 virulence plasmid pAvM_E2980_15 harboured multiple resistance genes in the same region (bla TEM-1b , strA, strB and sul2) conferring resistance to ampicillin, streptomycin and sulphonamides. E2980 was found to be resistant to ampicillin (Amp) and oxacillin (Oxa), which can be broken down by the beta-lactamase Bla TEM-1b , ( Table 2, Tables S2 and S3). E562 harbours three antibiotic resistance genes, ampC located in the chromosome and the tet(A) and bla TEM-1b genes on an FII plasmid (pAvM_E562_27). The mer operon derived from Tn21 is also present in the region (Table 2 and  Table S3). The phenotypic resistance profile of E562 matches the genomic profile with resistance to tetracycline (Tet), ampicillin (Amp), amoxicillin-clavulanic acid (Amc) and oxacillin (Oxa) ( Table S2). The plasmid pAvM_ E36_13 contains a complete copy of Tn10, which encodes the tet(B), tetracycline resistance module. Although the pAvM_E1373_29 phage-like plasmid is cryptic, related plasmids such as the pHMC2-family of phage-like plasmids 53 (described below), can harbour resistance genes such as bla CTX-M-1 4 54  Phenotypic resistance to ceftazidime (Caz) and ceftriaxone (Cro) was not found in the isolates, which were consistent with the absence of extended-spectrum beta-lactamase (ESBL) resistance genes in the sequence data.
Resistance to chloramphenicol (Cm) was found in five isolates, but none of the resistant isolates contained known resistance genes suggesting that chromosomal mutations or presence of efflux pumps may account for this reduced susceptibility.
The ETEC reference strains contain several efflux systems which could explain why the genotypic and phenotypic antibiotic resistance profile did not match for all antibiotics. All of the isolates harbour multiple efflux pumps located on the chromosome and plasmids (Table S3 and Figure S12). In E925, a non-synonymous mutation in acrF was identified (G1979A) resulting in a substitution from arginine to glutamine (A360Q). The effect on the expression and/or function of the AcrEF efflux pump was not verified.
Phenotypic resistance to norfloxacin (Nor) was found in 6 of the isolates. The isolates were analysed for chromosomal mutations likely to confer quinolone resistance, using ResFinder but mutations in gyrA were only found in one strain, E2980, at position S83A which may confer resistance to nalidixic acid, norfloxacin and ciprofloxacin. However, E2980 was sensitive to nalidixic acid. Both mutation(s) that alter the target (gyrA and parC), as well as the presence of efflux pumps, can confer resistance to fluoroquinolones. The majority of the isolates are moderately resistant to norfloxacin (and nalidixic acid), both quinolones, which is most likely due to the presence of two efflux pumps, AcrAB-R and AcrEF-R, as only one mutation was identified in gyrA of isolate E2980 where usually at least two or more mutations are needed to confer augmented resistance 57 .

Identification of phage-like plasmids in ETEC.
Two of the ETEC reference strains (E1649 and E1373) harboured phage-like plasmids (pAvM_E1649_9 and pAvM_E1373_29) which encode for DNA metabolism, DNA biosynthesis as well as structural bacteriophage genes (capsid, tail etc.). Both pAvM_E1649_9 and pAvM_ E1373_29 contain genes associated with plasmid replication, division and maintenance (i.e. repA and parAB). Phage-like plasmids are found in various bacterial species, such as E. coli, Klebsiella pneumoniae, Yersinia pestis, Salmonella enterica serovar Typhi, Salmonella enterica serovar Typhimurium, Salmonella enterica serovar Derby and Acinetobacter baumanii 58 . The plasmid pAvM_E1649_9 belong to the P1 phage-like plasmid family (Fig. 2a and Figure S14a) while pAvM_E1373_29 belongs to the pHCM2-family (Fig. 2b and Figure S14b) that can be traced back to a likely phage origin similar to the Salmonella phage, SSU5 53 . Both phage-plasmids thus contain replication and/or partition genes of plasmid origin and a complete set of genes that are phage related in function and properties ( Fig. 2 and Figure S14). Significantly, phage-like plasmid pAvM_E1373_29 falls more within the E. coli lineage of pHCM2 phage-like plasmid rather than those found in Salmonella species. This indicates that phage-like plasmids have diversified within the bacterial species they were isolated.
Blastn searches confirmed high similarity (at least 80% at the DNA across much of the sequence) of pAvM_ E1373_29 to several phage-like plasmids found in E. coli including ETEC O169:H41 isolate F8111-1SC3 20,59 , several bla CTX-M-15 positive phage-like plasmids (pANCO1, pANCO2 56 and PV234a), as well as a plasmid found in E. coli ST648 from wastewater and ST131 isolate SC367ECC 60 . The P1 phage-like plasmid pAvM_1649_9 is most similar to p1107-99 K, pEC2_5 isolated from human urine and p2448-3 from a UPEC ST131 isolate isolated from blood. The similarity is most pronounced at the amino acid level. Conservation and synteny are evident when pAvM_1649_9 is compared to P1 phage.

Prophages present and their cargo genes. Prophages may insert into chromosomes and bring along
genes required for lysogeny and lytic cycles and cargo genes that are often picked up when DNA is compacted into the capsid. Cargo genes can significantly benefit the host bacterium by providing additional elements to defence against phage or immune evasion and finally, environmental survival. PHASTER analyses identified prophages in the chromosomes of all ETEC reference isolates and some of the plasmids (Table S4) www.nature.com/scientificreports/ lurite resistance operons in isolates E925, E36, E2980 and E1373 were all located in prophages. In addition, eatA (in E925, and E1649) and estA (STh) genes (E36) were prophage cargo genes. Many prophage cargo genes identified in this study have properties related to inhibition of cell division. Among these are a variety of kil genes which can enhance host bacterial survival in the presence of some antibiotics 61 . Some genes that are core entities within many prophages, such as zapA (from E1779_Pph_6), dicB www.nature.com/scientificreports/ and dicC (found in phage E1779_Pph_7), also have similar effects as they can inhibit cell division in the presence of antibiotics which raise the broader question in terms of how they are beneficial to the host bacterium. A different gene of interest is the yfdR gene identified in E1779_Pph_7 (gene E1779_04412). YfdR curtails cellular division by inhibiting DNA replication under stress conditions encountered by the bacterial cell. Similarly, the iraM gene located in phage E1441_Pph_2 plays a role in RpoS stability.
OmpX homologs were found in numerous phages in this study. They are trans-membrane located and play a role in virulence as well as antibiotic resistance 62 . PerC is often associated with EPEC plasmids, where it seems to have a regulatory role for the attaching and effacing gene, eaeA 63 . The presence of a protein (PerC-family activator) containing the same PFAM domain (PF06069) as PerC in EPEC as cargo within an ETEC strain phage, E1779_Pph_7 located on the chromosome, is intriguing. Its ability to regulate other virulence genes is yet to be determined. Within the same phage, a gntR-like regulatory gene was identified. This gene plays a role in gluconate utilisation and induction of the Entner-Doudoroff pathway 64 .

Discussion
ETEC strains have previously been shown to fall into globally spread genetically conserved lineages which encompass strains with specific virulence factor profiles 18 . The currently widely used ETEC reference strains H10407 (CFA/I) and E24377A (CS1 + CS3) are highly divergent from other strains with the same virulence profile sequenced more recently 18 and highlights the need for relevant and representative ETEC reference strains and genomes. The long-and short-read sequenced strains presented here comprise complete reference genomes with separate chromosomal and plasmid sequences that allow more detailed studies of ETEC and E. coli phylogeny. The reference strains are representative isolates of their respective lineage and cluster phylogenetically together with different ETEC isolates sequenced by several other groups ( Figure S2).
Previous studies confirmed that ETEC belongs to lineages that have spread globally. These analyses were mainly dependent on the shared core genome of chromosomal genes while conservation of plasmids was indicated by the association between the plasmid-borne toxin and CFs and lineage 18 . Analysis of the plasmids sequenced in the present study showed that the conservation within ETEC lineages also include plasmids. The role of toxin-antitoxin (TA) systems in the maintenance of these plasmids (or presence in the chromosome) have not been considered here in detail, however multiple TA systems were identified across the ETEC plasmids presented (Additional File 2) and their potential involvement will be re-visited in a further paper.
Blast analyses confirm that the plasmids identified in this study are often highly homologous to other plasmids present in GenBank. For instance, the 94.5 kb plasmid pAvM_1441_18 was 98% identical to two 96 kb and 82 kb plasmids belonging to ETEC O25:H16 isolates ATCC 43886/E2539C1 and 2014EL-1346-6 sequenced by PacBio by Smith et al. 20 , (Fig. 1e and Figure S6). Plasmid pAvM_E1441_18 is the major virulence plasmid of this lineage carrying genes encoding LT and CS6.
The larger plasmid in E1441 (pAvM_E1441_17) carries both the genes for ETEC CF CS21 and antibiotic resistance determinants. Furthermore, complete conjugation machinery was present suggesting that this is most likely a self-transmissible plasmid, though this was not confirmed. Movement of such a plasmid would result in the spread of ETEC virulence genes and AMR determinants.
Interestingly, Wachsmuth et al. 46 analysed transfer frequencies in ETEC O25:H16 isolates (the same serogroup was identified in E1441) and found evidence that resistance to tetracycline and sulfathiazole was transferred but not the genes encoding LT 46 . The same study found evidence of two large plasmids of similar size 46 corroborating our findings of two plasmids of similar size in E1441, one with eltAB and cssABCD without the tra-operon (pAvM_E1441_18) and the other putatively mobile plasmid (pAvM_E1441_17) carrying the sul1 and tet(A) genes as well as the lng operon encoding CF CS21. Since ATCC 43886/E2539C1, E1441 and 2014EL-1346-6, have been isolated in the 1970s, 1997, and 2014, respectively, our findings indicate that E1441 represent an ETEC lineage with stable plasmid content and putative ability to transfer antibiotic resistance and the CS21 operon by transfer of one of the plasmids. Furthermore, pAvM_E1441_17 is a multi-replicon plasmid. Multi-replicon plasmids have been described as a way to broaden their host range, i.e. possibility to be transferred between bacteria of different phylogenetic groups 65,66 . Whether this plasmid type is found in other E. coli remains to be investigated but the finding that the L4 lineage retains both plasmids in isolates collected over time and worldwide indicate a strong selective force to keep the extra-chromosomal contents of both plasmids.
The ETEC O169:H41 isolate F8111-1SC3 plasmid unnamed 2 20,59 is highly similar to pAvM_E1373_28 ( Fig. 1h and Figure S9). The F8111-ISC3 isolate is part of a CDC collection of ETEC isolates from cruise ship outbreaks and diarrheal cases in US 1996-2003. The antibiotic resistance profiles of these isolates were determined 59 and most isolates of O group 169 were tetracycline resistant consistent with the findings of the tet gene in E1373 isolated in Indonesia in 1996. ETEC diarrhoea caused by O169:H41 and STp CS6 isolates is repeatedly reported to cause diarrhoea, particularly in Latin America 47,67-69 . Among the cruise ship isolates is the sequenced and characterised virulence plasmid pEntYN10 encoding STp and CS6, described as unstable and easily lost in vitro 67,70 . The E1373 plasmid; AvM_E1373_28 is highly homologous to pEntYN10 (Fig. 1h and Figure S9) and the virulence profile of ETEC O169: H41 is conserved in isolates collected globally. Hence, the instability of the plasmid is incongruent with current data indicating that plasmids are stable within this lineage and serotype.
Interestingly, two distinctive extra-chromosomal elements which are highly similar to P1 and SSU5 phage were identified among the 8 ETEC reference strains sequenced (Fig. 2, Figure S14 and Table S4). The SSU5-like element carries several genes that allow it to be functional as a plasmid and belongs to the pHCM2-like family of Phage-Plasmids (Fig. 2b) 53 . These plasmids are devoid of virulence factors, transposons and antibiotic markers but, they contain a significant number of DNA metabolism and biosynthesis genes and they may contain bacteriophage inhibitory genes that have not yet been identified. Interestingly, several SSU5 phage-like plasmids have been shown to carry the ESBL gene bla CTX-M15 in extra-intestinal pathogenic E. coli isolates 55  www.nature.com/scientificreports/ www.nature.com/scientificreports/ seems to be absent or low in ETEC and the SSU5 phage-like plasmid pAvM_E1373_29 does not contain antibiotic resistance genes. A recent study investigating the distribution of phage-plasmids show that the phage homologs tend to be more conserved and the plasmid homologs more variable 71 . This is also seen in the phage-plasmids identified here, e.g., genes that could be advantageous to the host cell linked to metabolism and biosynthesis.
To summarise, we provide fully assembled chromosomes and plasmids with manually curated annotations that will serve as new ETEC reference genomes. The in-depth analysis of gene content, synteny and correct annotations of plasmids will also help to elucidate other plasmids with and without virulence factors in related bacterial species. The ETEC reference genomes compared to other long-read sequenced ETEC genomes confirm that the major ETEC lineages harbour conserved plasmids that have been associated with their respective background genomes for decades. This supports the notion that the plasmids and chromosomes of ETEC are both crucial for ETEC virulence and success as pathogens.

Methods
Selection of strains. Initially one to two ETEC strains within each of the lineage (L1-L7)-specific CF profile were chosen from the University of Gothenburg large collection of ETEC strains 18 for PacBio sequencing. The seven linages encompass clinically relevant ETEC strains expressing the most common virulence factor profiles, i.e. toxin and CF profile 18 . The strains were selected based on the location and year of isolation to represent strains isolated from patients with diarrhoea from diverse geographical locations and at different time-points. After the genomes had been sequenced, assembled, circularised and annotated a second selection was made for manual curation of the genomes. This selection was made based on the quality of the genome assembly and the circularisation. The whole genomes of the ETEC reference strains were compared with one or two other longread sequenced ETEC strains belonging to the same lineage by progressiveMAUVE (v2.4.0, URL: http:// darli nglab. org/ mauve/ mauve. html) 29 and showed that the strains are colinear ( Figure S15). One representative ETEC genome from each lineage was annotated, with emphasis on the plasmids. The physical ETEC reference strains are available upon request.
Phenotypic toxin and CF analyses. ETEC isolates were identified by culture on MacConkey agar followed by an analysis of LT and ST toxin expression using GM1 ELISAs 45 . The expression of the different CFs was confirmed by dot-blot analysis 45 . Isolates had been kept in glycerol stocks at − 70 °C, and each strain has been passaged as few times as possible.
Antibiotic susceptibility testing. All ETEC isolates were tested against 14 antimicrobial agents and their minimum inhibitory concentration was determined by broth microdilution using EUCAST methodology 51 . The antimicrobial agents were: ampicillin, amoxicillin-clavulanic, oxacillin, ceftazidime, ceftriaxone, doxycycline, tetracycline, nalidixic acid, norfloxacin, azithromycin, erythromycin, chloramphenicol, nitrofurantoin and sulfamethoxazole-trimethoprim. All antibiotics were purchased from Sigma-Aldrich. The E. coli ATCC 25922 was used as quality control. The MIC was recorded visually as the lowest concentration of antibiotic that completely inhibits growth.
DNA extraction and sequencing. Strains from each lineage (L1-L7) were SMRT-sequenced on the PacBio RSII. A hybrid de novo assembly was performed combining the reads from both the SMRT-sequenced and Illumina sequenced strains.
For Single-Molecule Real-Time (SMRT) sequencing (Pacific Bioscience) long intact strands of DNA are required. The genomic DNA extraction was performed as follows. Isolates were cultured in CFA broth overnight at 37 °C followed by cell suspension in TE buffer (10 mM Tris and 1 mM EDTA pH 8.0) with 25% sucrose (Sigma) followed by lysis using 10 mg/ml lysozyme (in 0.25 Tris pH 8.0) (Roche). Cell membranes were digested with Proteinase K (Roche) and Sarkosyl NL-30 (Sigma) in the presence of EDTA. RNase A (Roche) was added to remove RNA molecules. A phenol-chloroform extraction was performed using a mixture of Phenol:Chloroform:Isoamyl Alcohol (25:24:1) (Sigma) in phase lock tubes (5prime). To precipitate the DNA 2.5 volumes 99% ethanol and 0.1 volume 3 M NaAc pH 5.2 was used followed by re-hydration in 10 mM Tris pH 8.0. DNA concentration was measured using NanoDrop spectrophotometer (NanoDrop). On average 10 μg for PacBio sequencing. Library preparation for SMRT sequencing was prepared according to the manufacturers' (Pacific Biosciences) protocol. The DNA was stored in E buffer and sequenced at the Wellcome Sanger Institute. Isolates were sequenced with a single SMRTcell using the P6-C4 chemistry, to a target coverage of 40-60X using the PacBio RSII sequencer.
Assembly. The resulting raw sequencing data from SMRT sequencing were de novo assembled using the PacBio SMRT analysis pipeline (https:// github. com/ Pacifi cBio scien ces/ SMRT-Analy sis) (v2.3.0) utilising the Hierarchical Genome Assembly Process (HGAP) 72 . For all samples, the unfinished assembly produced a single, non-circular, chromosome plus some small contigs, some of which were plasmids or unresolved assembly variants. Using Circlator 73 (v1.1.0), small self-contained contigs in the unfinished assembly were identified and removed, with the remaining contigs circularised. Quiver 72 was then used to correct errors in the circularised region by mapping corrected reads back to the circularised assembly. As the strains had also been short read sequenced, and this data is of higher base quality, the short reads from the Illumina sequencing were used in combination with the long reads using Unicycler 74 to generate high-quality assemblies.
Fully circularised chromosomes and plasmids were achieved for the majority of the strains. Cross-validation of the assemblies was performed where two or three strains of a lineage were sequenced ( Figure S15). A single assembly from each lineage was chosen to act as the representative reference genome, with priority given to assemblies with the most complete and circularised chromosome and plasmids. In total, one chromosome and www.nature.com/scientificreports/ 5 out of the 29 plasmids could not be circularised (independent on the two strains that were sequenced initially) out of the 8 selected representative strains. These are indicated in Table 2 and Table S1. Between two and five plasmids were identified in the eight strains. Shorter contigs that could not be assembled properly contained phage genes and are included in the genomes and annotated as prophages Table S4). Socru was used to validate the assembly of the chromosome, they all have biologically valid orientation and order of rRNA operons with a type GS1.0, which is seen in most E. coli in the public domain 75 . A multiple alignment of the chromosomes ( Figure S1) was generated using progressiveMauve 29  oriT prediction. The location of the oriT in the plasmids, if present, was predicted using oriTFinder 91 with Blast E-value cut-off set to 0.01.
Genomic antibiotic resistance profiling. The identification of antibiotic resistance genes, located on both the chromosome and plasmid(s) as well as the presence of efflux pumps and porins known to confer resistance to antibiotics. The results were obtained by running ARIBA 82 using the CARD database 92 with the default settings (minimum 90% sequence identity and no length cut-off). ARIBA combines a mapping/alignment and targeted local assembly approach to identify AMR genes and variants efficiently and accurately from paired sequencing reads. The heatmaps were visualised using Phandango (v.1.3.0, URL: https:// james hadfi eld. github. io/ phand ango/#/) 93  Virulence gene prediction. The ETEC assemblies from the ETEC-NCBI collection (Additional file 4) were screened using abricate 95 with default settings against the ETEC virulence database (https:// github. com/ avonm/ ETEC_ vir_ db) for virulence gene (including eatA and etpBAC) predication. A subset of the isolates in the ETEC-NCBI dataset have previously been analysed for the presence of EatA where a sample with negative PCR but positive western blots were included as positive 80 . Here, only isolates harbouring the eatA and etpBAC genes are considered positive.
Prophage prediction. The complete FASTA sequence of each ETEC reference genome was searched for phage genes and prophages using PHASTER (phaster.ca) 96 . The identified intact prophages are listed in Table S4. All prophage contained cargo genes but only recognisable genes are stated, not any hypothetical. Additional questionable and not intact prophages were identified but have not been included here. The prophages have been given a specific identifier name and are also annotated as a mobile_element in the submitted chromosome and or plasmid(s) of each strain. www.nature.com/scientificreports/ Insertion sequences. Insertion sequences in the plasmids as well as surrounding the CS2 loci located on the chromosome of E1649 were annotated using both Galileo AMR software 97 and the ISFinder database 98 . Complete and partial IS elements were annotated (> 95% identity with hits in ISFinder) along with the present genes encoding transposases. Three new insertion sequences were detected in this analysis and were submitted to ISFinder as TnEc2, TnEc3 and TnEc4. Transposons and other mobile elements (integrons and group II introns) were also identified using Galileo AMR and blastn against public databases.

Data availability
The datasets supporting the conclusions of this article are included within the articles and its additional files. The sequencing data generated in this study has been submitted to EMBL (Additional file 4 and 5). The physical ETEC reference strains can be requested by contacting the corresponding author Astrid von Mentzer (avm@ sanger.ac.uk or mentzerv@chalmers.se). The database used for annotating ETEC virulence factors, ETEC virulence database, including the LT and ST alleles can be found in the github repositories: https:// github. com/ avonm/ ETEC_ vir_ db and https:// github. com/ avonm/ ETEC_ toxin_ varia nts_ db. An interactive version of the core genome phylogeny of the 1,065 E. coli and ETEC isolates along with the ETEC reference strains ( Figure S2) reported here is accessible at https:// micro react. org/ proje ct/ 2ZZza HzeXb MEw9U 2MAk7 pK? tt= cr. obtaining clinical isolates collected as part of this study should be addressed to the correspondingauthor. Exchange of clinical isolates should always be in agreement with the University of Gothenburg.