Enteropathogenic E. coli (EPEC) are a cause of moderate to severe diarrhoea in young children, primarily in developing countries1. The Global Enteric Multicenter Study (GEMS), an epidemiological study of children with moderate to severe diarrhoea and children with no diarrhoea, has demonstrated that EPEC is a leading cause of lethality associated with diarrhoea among children that are less than 12 months of age2,3. By definition, EPEC contain the locus of enterocyte effacement (LEE) pathogenicity island, which encodes a type III secretion system (T3SS) involved in the pathogenesis of these organisms47. The LEE region is a defining feature of the attaching and effacing E. coli (AEEC), which includes EPEC and the Shiga toxin-producing enterohaemorrhagic E. coli (EHEC), which are associated with severe food-related illness worldwide811. EPEC are further categorized by the presence or lack of the plasmid-encoded bundle-forming pilus genes (BFP)8,12, which are commonly found on the EPEC adherence factor (EAF) plasmid and confer localized adherence (LA) to the surface of intestinal epithelial cells1316. The BFP operon is frequently identified in EPEC associated with diarrhoeal illness, and these isolates are termed typical EPEC (tEPEC)8,17. E. coli that possess the LEE region, but do not contain the BFP or Shiga toxin genes (LEE+/stx–/bfp–), are commonly termed atypical EPEC (aEPEC)17. Previous studies investigating the genetic diversity of aEPEC have demonstrated that LEE+/stx–/bfp– isolates are a diverse group that can include among them isolates that are more related to other E. coli pathovars and commensal isolates18,19. The aEPEC can also include EHEC and EPEC that have lost the Shiga-toxin genes and BFP genes during passage through a host or the environment or after culture in the laboratory18,19.

Investigation of the genetic and virulence factor diversity of tEPEC has focused mainly on isolates within two lineages, EPEC1 and EPEC220, as defined by multi-locus sequence typing (MLST)20. MLST and phylogenetic analysis have also described additional tEPEC lineages, EPEC3 and EPEC420, as well as EPEC5 and EPEC6, which comprise aEPEC isolates19, suggesting that there is probably greater genetic diversity among EPEC isolates than originally anticipated. Until the recent comparative genomic analysis of a collection of diverse AEEC isolates18, which included additional EPEC1, EPEC2 and the first EPEC4 genomes described, the genome sequences available for EPEC isolates were limited to E2348/69, B171, E22 (a rabbit EPEC isolate) and E110019 (an aEPEC isolate)21,22. Even with recent sequencing, the majority of the EPEC genomes sequenced are historical isolates from developed countries, and little is known regarding the genomic diversity of recent EPEC isolates from developing countries, where EPEC has been identified in the recent landmark GEMS analysis as an important pathogen of children, with tEPEC associated with the greatest amount of mortality2.

In the present study we sequenced the genomes and performed comparative genomic analysis of 70 EPEC isolates from children less than 5 years of age enrolled in GEMS2. Phylogenomic analysis of these 70 EPEC isolates highlighted the considerable evolutionary diversity and variability of EPEC virulence mechanisms in more recent EPEC isolates from developing countries. By comparing the genomes of 24 EPEC from lethal cases (LI), 23 EPEC from non-lethal symptomatic cases (NSI) and 23 EPEC from asymptomatic cases (AI), we identified the genes that are more frequently associated with EPEC from different clinical outcomes. Genomic studies such as this provide valuable insight into the diversity and virulence mechanisms of an E. coli pathogen that is associated with increased risk of death among infants in developing countries3. The findings of this study can be used to generate improved methods for molecular diagnostics of EPEC that will provide information regarding the evolutionary history of an isolate as previously described18. The genes that were identified as more frequently associated with lethal or symptomatic EPEC isolate genomes may be further characterized to obtain a deeper understanding of the EPEC pathogenesis and provide additional targets for vaccine and therapeutic development.


Phylogenomic analysis of GEMS site EPEC isolates associated with different clinical outcomes

To investigate the genomic diversity and virulence mechanisms of EPEC isolated from individuals with differing clinical severity we sequenced the genomes of 70 EPEC from multiple geographic sites included in GEMS3. The 70 EPEC isolates were obtained from cases of diarrhoeal illness in children classified as LI or NSI, or as controls with asymptomatic (AI) outcomes. There were a total of 24 EPEC isolates from LI cases, 23 from NSI cases and 23 from AI cases. The 24 EPEC isolates from LI cases were all tEPEC, and 20 of 23 (87%) of the EPEC from NSI cases and 17 of 23 (74%) of the EPEC from AI cases were tEPEC.

Phylogenomic analysis of the 70 EPEC isolate genomes, together with a collection of previously sequenced AEEC isolates and diverse E. coli and Shigella18, demonstrated that there is greater genomic diversity among recent EPEC isolates from Africa and Asia than in prototype E. coli isolates2,3,23 (Fig. 1). The 70 EPEC isolates were present in E. coli phylogroups A, E, B1 and B218,24, demonstrating considerable genomic diversity for E. coli belonging to a single pathovar (Fig. 1 and Tables 1 and 2). The majority of the isolates were in phylogroups B2 (55.7%, 39/70) and B1 (34.3%, 24/70), each of which included multiple E. coli isolates from various pathovars, as well as laboratory-adapted and commensal E. coli (Fig. 1 and Table 2). Overall, the phylogenomic lineages were not geographically confined, with the exception of the isolates belonging to EPEC lineages in phylogroup A (EPEC5, EPEC10), which were restricted to only two sites (The Gambia and Kenya) (Fig. 1 and Table 2).

Figure 1: Phylogenomic analysis of the 70 EPEC isolates associated with clinical outcomes of differing severity compared with select previously sequenced AEEC genomes and a reference collection of 25 diverse E. coli and Shigella isolate genomes.
figure 1

The whole-genome assemblies were aligned using Mugsy44 as previously described18. The regions of sequence that aligned in all genomes were concatenated into a single 820,355-bp sequence for each genome, and the concatenated sequences were used to generate a maximum-likelihood phylogeny with 100 bootstrap replications, which was constructed using RAxML v.7.2.845, and visualized using FigTree v.1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). Bootstrap values of ≥80 are designated on the tree by a filled circle. Genomes examined in this study that were obtained from lethal cases (LI) are indicated in orange, those from non-lethal symptomatic (NSI) cases are indicated in green and isolates from asymptomatic (AI) cases in blue. The presence of bfpA is indicated by a star symbol. The four novel EPEC phylogenomic lineages identified in this study are indicated by an asterisk.

Table 1 Genome characteristics of the isolates sequenced in this study.
Table 2 Distribution of the 70 EPEC isolates from different clinical outcomes analysed in this study.

An MLST-based phylogeny was also constructed using anchor isolates of the previously described EPEC lineages, EPEC1–EPEC619,20. This allowed the identification of relationships among the 70 EPEC isolates sequenced in the current study to the previous MLST-defined EPEC lineages (Fig. 1 and Supplementary Fig. 1). Remarkably, only 16 (22.9%) of the isolates sequenced were present in the two main previously identified MLST-based lineages of EPEC, EPEC1 and EPEC2, with eight in each lineage (Fig. 1 and Tables 1 and 2). An additional eight genomes (11.4%) were in the EPEC4 lineage (Fig. 1 and Tables 12), which has previously been described by MLST and a single genome has been sequenced18,20. Another three genomes of isolates 103338, 401140 and 401210 grouped in the MLST-based phylogeny with an isolate previously designated as EPEC5 using MLST20 (Supplementary Fig. 1). The remaining 43 genomes formed novel EPEC phylogenomic lineages. This finding indicates that there is considerable uncharacterized EPEC genomic diversity identified in this study (Fig. 1). To extend the established MLST-based nomenclature, we are designating four previously undescribed phylogenomic lineages, which each contain five or more genomes, EPEC7–10 (Fig. 1 and Supplementary Table 1). Eleven of these genomes were in the EPEC7 phylogenomic lineage and B1 phylogroup (Table 2). In phylogroup B2 there were ten genomes forming the EPEC8 phylogenomic lineage and six in the EPEC9 phylogenomic lineage (Fig. 1 and Table 2). The remaining two genomes belong to the EPEC10 lineage, which was designated when combined with three previously sequenced LEE+/stx–/bfp–isolates18 (Fig. 1 and Table 2). The four newly described EPEC lineages contain 41.4% (29/70) of the isolates, highlighting the undescribed diversity of global EPEC isolates.

In addition to these novel lineages, there were 14 genomes not assigned to phylogenomic lineages EPEC1–10, which thus represent unclassified EPEC isolates. These isolates were distributed throughout the E. coli phylogeny (Fig. 1 and Table 1). Of these 14 unclassified EPEC isolates, only one was associated with an LI case, two with NSI cases and 11 with AI controls (Fig. 1 and Table 2), and six of these isolates were bfpA– (Fig. 1). Thus, the unclassified EPEC isolates comprised nearly half (11/23, 48%) of the AI isolates, whereas the LI and NSI isolates were primarily associated with phylogenomic lineages that contained one or more tEPEC. These distributions suggest there may be an optimal EPEC genomic content that is required for the greatest virulence.

Distribution of EPEC virulence-associated genes

The expanded genome phylogeny described here identified a previously unrecognized phylogenetic distribution of EPEC isolates; however, it was unclear whether these differences extended to the known EPEC virulence factors. In addition to the T3SS encoded by the LEE pathogenicity island5,25, present in all genomes sequenced in this study, there were additional virulence-associated secretion systems detected in the isolates sequenced in this study (Supplementary Table 1). Among these regions was a type II secretion system (T2SS) and a type VI secretion system (T6SS), both of which exhibited phylogroup- and lineage-specific distributions (Supplementary Table 1). Investigation of the sequence diversity of previously characterized T3SS effectors demonstrated that the effectors exhibited greater similarity by phylogenomic lineage than by clinical outcome (Supplementary Fig. 2).

Phylogenetic analysis of the bfpA nucleotide sequences present in each of the 61 bfpA+ genomes sequenced in this study, with 11 reference bfpA alleles20,26 and 31 bfpA alleles from previously sequenced EPEC genomes18, demonstrated that the majority of the bfpA genes belonged to one of three main phylogenetic groups as defined by Blank and colleagues26 (Supplementary Fig. 3a). Each of the phylogenetic groups of bfpA contains isolates from diverse phylogenomic lineages and clinical outcomes (LI, NSI and AI). This is in contrast to the intimin gene, eae, from the LEE pathogenicity region, which exhibits greater phylogenomic lineage specificity (Supplementary Fig. 3b). This difference suggests that bfpA, and by extension the entire bfp operon and possibly the entire EAF plasmid, have been lost and acquired multiple times by E. coli isolates belonging to diverse EPEC phylogenomic lineages.

Interestingly, all of the LI isolates analysed in this study were found to be bfpA+ by PCR, as previously described18, with the exception of isolate 100414, which was bfpA– (Table 1 and Supplementary Table 1). However, on detailed examination of the genome sequence, EPEC isolate 100414 was determined to encode a bfpA orthologue with 72% nucleotide identity to bfpA of the E2348/69 EAF plasmid, pMAR222. The 100414 bfpA allele exhibited greater phylogenetic similarity to a bfpA-like sequence from the LEE-negative EAEC isolate 101-121 (Supplementary Fig. 3a).

Identification of EPEC genes associated with different clinical outcomes

To identify whether there are genes that are more prevalent among the 70 EPEC from different clinical presentations, we used large-scale BLAST score ratio (LS-BSR) analysis27,28 to analyse the whole genome content. The LS-BSR analysis places predicted homologous genes from each genome into gene clusters that have ≥90% nucleotide identity29. For the 70 genomes analysed in this study, 12,964 gene clusters were identified and 1,080 gene clusters were present in all 70 genomes analysed (LS-BSR ≥ 0.9). These gene clusters represent the conserved EPEC core genome. This is a more conservative approach than was previously used to define the E. coli species core genome and so the absolute number of genes is smaller than the E. coli core genome defined previously21,30.

A comparison of gene cluster prevalence in LI genomes versus AI genomes demonstrated a significant correlation (P < 0.05) of 367 gene clusters (Table 3 and Supplementary Table 2). Among the gene clusters represented in a greater number of LI than AI genomes were genes of the EAF plasmid, flagellin, an allele of the T3SS effector NleG, as well as many hypothetical and phage-associated genes (Supplementary Table 2). There were 111 clusters that were significantly more prevalent in LI genomes or in NSI genomes (Table 3). Among the genes that were more prevalent among the LI genomes were many that encoded hypothetical proteins, putative transcriptional regulators, a putative T3SS effector EspJ and putative phage-associated genes (Supplementary Table 2). Similarly, there were 118 gene clusters that were statistically more prevalent in NSI genomes versus AI genomes (Table 3).

Table 3 Number of gene clusters identified using LS-BSR that are significantly correlated with one clinical outcome when compared to another clinical outcome.

Although we identified gene clusters with a significant correlation with one symptomatic group compared to another symptomatic group (Table 4 and Supplementary Table 3), there were no gene clusters that were detected in all of the LI genomes that were absent from all of the NSI and AI genomes. The absence of universal clinically associated genes may partly be a result of the vast genomic diversity of the isolates associated with each of the clinical outcomes (Fig. 1 and Supplementary Table 3). However, there were 428 gene clusters that were statistically (P < 0.05) more prevalent among the symptomatic (LI and NSI) compared to asymptomatic (AI) genomes, and 40 of these gene clusters had a P value of <0.005 (Table 4 and Supplementary Table 4). These gene clusters that were more prevalent among symptomatic compared to asymptomatic group genomes included numerous hypothetical proteins and phage and plasmid-associated genes (Supplementary Table 4). When the distribution of these 428 gene clusters was compared by hierarchical cluster analysis, the EPEC isolates formed three main groups that included all of the genomes, except three isolates that were outliers (Fig. 2). Group I contained nine of the ten EPEC8 isolates and the only EPEC8 isolate that was not within group I was part of group III and associated with an asymptomatic outcome (Fig. 2). Thus, all of the EPEC isolates of group I were associated with symptomatic outcomes (five LI and four NSI). Meanwhile, group II contained 18 isolates, all belonging to E. coli phylogroup B2. Seven of these isolates (39%) were associated with symptomatic outcomes, while the other 11 (61%) EPEC isolates were from asymptomatic outcomes (Fig. 2). The largest group was group III, which contained 40 isolates, including 31 (78%) from symptomatic outcomes and nine (22%) from asymptomatic outcomes (Fig. 2). The EPEC isolate genomes of group III primarily belonged to phylogroups B1 and A, with the exception of four EPEC9 isolates and seven EPEC4 isolates from phylogroup B2 (Fig. 2).

Table 4 Number of gene clusters identified using LS-BSR that are significantly correlated with genomes of a particular clinical outcome.
Figure 2: Identification of genes associated with symptomatic and asymptomatic EPEC isolates.
figure 2

The plot is a hierarchical cluster analysis of the 428 LS-BSR gene clusters that were significantly (chi-square test or Fisher's exact test, P < 0.05) more prevalent in genomes of symptomatic (LI and NSI) compared to asymptomatic (AI) cases for all 70 EPEC genomes analysed. The LS-BSR gene clusters, generated using a clustering threshold of 90% nucleotide identity, that were significantly (chi-square test or Fisher's exact test P < 0.05) associated with genomes of symptomatic compared to asymptomatic cases, were compared by hierarchical clustering41. Hierarchical clustering with Pearson correlation and average linkage was performed using MeV42. Each column represents a genome, and each row is an LS-BSR gene cluster. The gene clusters that were present with an LS-BSR value of ≥0.9 are indicated in blue, and the gene clusters that were absent (LS-BSR value of <0.9) in white. Red boxes indicate three groups of genomes, designated I, II and III, and red asterisks identify the nodes that separate the genomes into the three groups. The colour-coded rectangles at the top of the plot denote the phylogenomic lineage, and the colour-coded squares indicate the clinical outcome of each isolate. The colour coding of each symbol is given in the key at the top of the figure. A star symbol denotes the presence of bfpA in each genome.

To investigate whether there were similar trends observed when comparing only the tEPEC isolates, we excluded the nine aEPEC isolates. Comparison of the tEPEC from the three different clinical outcomes (LI versus NSI, LI versus AI and NSI versus AI) identified fewer gene clusters that were significantly (P < 0.05) associated with one clinical outcome over another than were identified when comparing all 70 EPEC genomes (see Table 3 and Supplementary Table 3 for a clinical presentation and Table 4 and Supplementary Table 5 for symptomatic versus asymptomatic comparisons). These findings suggest there is an increased genomic diversity associated with the aEPEC isolates.

Hierarchical cluster analysis of the presence of the 258 gene clusters significantly associated with only tEPEC of symptomatic or asymptomatic outcomes separated the tEPEC isolates into two similarly sized groups (Supplementary Fig. 4). tEPEC group I contained 34 isolates, including 20 (59%) from symptomatic (LI or NSI) outcomes and 14 (41%) from asymptomatic outcomes (Supplementary Fig. 4). Meanwhile, tEPEC group II contained 27 genomes; 24 (89%) from symptomatic outcomes and only three (11%) from asymptomatic outcomes (Supplementary Fig. 4). Within each of these tEPEC groups the isolates were present in subgroups based on phylogenomic lineage. There were 20 gene clusters that were present in all of the genomes of tEPEC group I that were absent from all genomes of tEPEC group II (Supplementary Fig. 4 and Supplementary Table 5) including gene products predicted to be involved in propanediol utilization (Supplementary Table 5), which has been implicated in Salmonella for its role during survival in the host31,32.

EPEC-specific genes associated with different clinical outcomes

To identify genes that were associated with EPEC isolates of different clinical outcomes, while taking into account the considerable underlying genomic diversity of these isolates, we performed LS-BSR analysis using a decreased clustering threshold of 80% nucleotide identity to combine potential alleles. Commensal genomes were included (E. coli HS (NC_009800.1), K-12 (NC_000913.3) and SE11 (NC_011415.1)) in the analysis as a metric for counter selection. This approach provided the opportunity to identify genetic features that were present only in the EPEC, regardless of phylogenomic lineage. For this analysis there were 12,196 total gene clusters. Of those, there were 6,474 gene clusters (53%) that were present in one or more of the EPEC genomes that were absent (LS-BSR < 0.8) from all of the commensal isolates. Using this EPEC-only data set and examining all 70 EPEC isolate genomes, the number of gene clusters that were significantly (P < 0.05) associated with one clinical outcome over another ranged from 39 to 198 (Table 3 and Supplementary Table 6). Similarly, when comparing only the 61 tEPEC genomes, the number of genes associated with genomes of one clinical outcome over another was lower, ranging from 7 to 134 (Table 3 and Supplementary Table 7). Furthermore, the number of genes significantly associated with symptomatic (LI and NSI) compared to asymptomatic (AI), or lethal (LI) compared to non-lethal (NSI and AI) genomes was decreased (Table 4 and Supplementary Table 8). The number of gene clusters associated with symptomatic or asymptomatic genomes was 246 when comparing all 70 EPEC isolates (Table 4, Supplementary Table 8 and Supplementary Fig. 5) and 141 when comparing only the 61 tEPEC isolates (Table 4, Supplementary Table 9 and Supplementary Fig. 6).

Many of the gene clusters that were associated with one clinical outcome were annotated as hypothetical proteins (Supplementary Table 4). To examine the potential function of the predicted peptides, the gene clusters were examined for protein domains identified in membrane-associated or secreted proteins, which would suggest they might be directly involved in surface expression or survival. Of the 39 to 246 gene clusters that were identified as significantly associated with one clinical outcome in the analysis of all 70 EPEC (Tables 3 and 4), the number of gene clusters with protein domains of secreted or surface-associated proteins ranged from 11 to 50 (Supplementary Table 10). Similarly, of the 7 to 141 gene clusters significantly associated with one clinical outcome in the analysis of only the tEPEC genomes (Tables 3 and 4), the number of gene clusters containing membrane-associated or secreted protein domains was low, ranging from 2 to 31 (Supplementary Table 10). Among the gene clusters that were significantly more prevalent in symptomatic compared to asymptomatic genomes were hypothetical proteins, a putative yfdA, an acetyltransferase, a putative pyridoxamine 5-phosphate-dependent dehydrase, a glycosyl transferase family protein, and plasmid conjugal transfer-associated proteins (Supplementary Tables 8 and 9). These analyses provide targets for the functional characterization of these gene products in pathogenesis.


The whole-genome sequencing and phylogenomic analysis of 70 EPEC isolates from children enrolled in GEMS2,3 demonstrated that E. coli clinical isolates identified as EPEC based only on their virulence factor content exhibit considerable genomic diversity. Phylogenomic analysis demonstrated that 61% (43/70) of the EPEC isolates examined occupy previously undescribed phylogenomic lineages. This study may have identified newly circulating EPEC in the GEMS sites, but may also highlight the dynamic evolutionary processes that are at work in E. coli pathogens. Of note, a recent study on EPEC demonstrated a shift in the epidemiology from tEPEC to aEPEC isolates1, but this study focused on the tEPEC isolates associated with an adverse outcome. The current study is not meant to be a comprehensive genomic view of all the tEPEC collected with GEMS, but a focused attempt to identify genetic factors associated with the isolates from the most severe outcomes.

The EPEC genome comparisons demonstrated that the degree of genomic difference was greater when comparing the extremes of the clinical presentation, LI to AI genomes, than it was when comparing LI to NSI, or NSI to AI (Table 2). This emphasizes the finding from the phylogenomic analysis that isolates associated with a particular clinical outcome can occur in distantly related EPEC phylogenomic lineages (Fig. 1). Thus, the smaller number of genomic differences identified between the lethal and non-lethal EPEC isolates suggests the differences in the illness severity caused by these isolates may have less to do with the bacterium and more to do with host factors including, but not limited to, co-morbidities, the microbiome, diet, breast-feeding and access to medical care, among other factors. Overall, these findings suggest that there is not a single gene or genomic region that is responsible for particular EPEC isolates causing more severe clinical outcomes, but it may instead require a collection of genomic regions acting in concert, as well as responding to host factors that will result in more severe infection by EPEC. The gene clusters that are more prevalent in the genomes of EPEC from different clinical outcomes provide a genomic view of what potentially makes certain EPEC isolates more virulent. Among these were many genes with unknown functions, including some that contain predicted protein domains of membrane-associated or secreted proteins that can be investigated for their contribution to the virulence mechanism of EPEC and potentially other pathogenic E. coli. A recent study by Hazen et al.33 describes the comparative transcriptome analysis of four prototype EPEC isolates: E2348/69 (EPEC1), B171 (EPEC2), C581-05 (EPEC4) and E110019 (prototype aEPEC isolate)33. That study identified that there is also transcriptional variation among these prototype isolates33. Further investigation is required to examine the transcriptional variation among the new EPEC lineages described in the current study. The combination of genomics and transcriptomics will provide further insight into the conserved and expressed EPEC features involved in virulence

Large-scale comparative genomic studies that assess the diversity of disease-causing bacteria associated with multiple types of clinical outcome, such as this, provide a framework for understanding the processes that underlie the evolution of pathogenesis. This study describes a number of phylogroup- and lineage-specific differences in the virulence factor and genome content, which suggests that EPEC isolates have continued to acquire genetic changes since their initial acquisition of some of the pathovar-defining features. These studies can also provide insight into the ongoing evolution of the virulence mechanisms of disease-causing bacteria. The emergence of diarrhoea-causing EPEC and the severity of illness attributed to these isolates depend on a suite of genes that includes both lineage-specific virulence factors and genes encoded by plasmid and phage. These regions will provide fertile ground for the examination of EPEC pathogenesis and the development of a possible vaccine against EPEC in the future.


Bacterial isolates

The bacterial isolates analysed in this study, and the details of each of the genomes sequenced, are listed in Table 1 and also described in a companion study34. The EPEC (LEE+/bfpA+/stx–) isolates analysed in this study were obtained from GEMS as previously described2,23. A total of 24 tEPEC isolates from lethal cases (LI) were obtained, representing all tEPEC isolates associated with a lethal outcome in GEMS2,23. The isolates from lethal outcomes were from only five sites of the seven in GEMS (The Gambia, Mail, Mozambique, Kenya and Pakistan), so there is an over-representation of isolates from Africa. A matching scheme using geography and clinical parameters of the subject was used to select one EPEC isolate from a non-lethal symptomatic case (NSI) and one EPEC isolate from an asymptomatic case (AI) representing controls for each tEPEC from a lethal case as previously described34. One NSI case and one AI case served as controls for two different LI cases, resulting in 23 EPEC from NSI cases and 23 EPEC from AI cases. A tEPEC isolate (bfpA+) was obtained from 20 of the NSI cases and 17 of the AI cases, with the remaining EPEC cases containing an aEPEC (bfpA–). The recent publication by Donnenberg et al.34 describes the case-control aspect of this study and the comparison of the isolates that were directly matched based on patient and clinical parameters. In the current study we delve into the phylogenomic content of the isolates, irrespective of matching criteria and only consider the genotypic presentation of EPEC and the outcome of the infection.

Genome sequencing and assembly

Genomic DNA was isolated from each strain by growing a single colony that was PCR-positive for the LEE-encoded gene escV and/or the EAF plasmid gene bfpA, overnight, in Luria-Bertani (LB) medium at 37 °C with shaking. The genomic DNA was isolated from the overnight culture using the GenElute Genomic kit (Sigma-Aldrich), then sequenced and assembled as previously described18,34.

Phylogenomic analysis

The 70 EPEC genomes sequenced in this study were compared with 37 previously sequenced E. coli and Shigella genomes by whole-genome phylogenomic analysis as previously described18,35.

Gene alignments and phylogenetic analyses

The individual gene phylogenies of eae and bfpA were generated as described previously18. The nucleotide sequences were aligned in MEGA536 using the ClustalW algorithm37. A maximum-likelihood phylogeny was then constructed using the Kimura two-parameter model of distance estimation38 with 1,000 bootstrap replications.

A phylogenetic analysis of seven conserved housekeeping genes that have been used for MLST was generated for the isolates characterized in this study compared to a collection of previously sequenced EPEC and other E. coli isolates as previously described18,20. The EPEC1-4 reference sequences types (STs) included in the phylogeny are those identified by Lacher et al.20 while the EPEC5 and EPEC6 reference sequences were described by Tennant and co-workers.19

BSR analysis

The presence or absence of known virulence-associated genes in the genome sequences generated in this study was determined using BLAST score ratio (BSR) analysis, as described previously18,27,28. The protein-encoding genes that were considered present with significant similarity had BSR values of ≥0.8, while those with BSR values <0.8 but ≥0.4 were considered to be present but divergent.

The level of similarity of protein-encoding genes was compared across genomes in this study using a large-scale BLAST score ratio (LS-BSR) analysis as previously described18,28,29. The gene clusters were assigned using a stringent nucleotide identity threshold of ≥90% (Data Set S1), or using a more inclusive nucleotide identity threshold of ≥80% (Data Set S2). The LS-BSR analysis performed using the more inclusive clustering threshold of ≥80% included the 70 genomes in this study and three commensal genomes: E. coli HS (NC_009800.1), K-12 (NC_000913.3) and SE11 (NC_011415.1). The predicted protein function of each gene cluster was determined using an ergatis-based39 in-house annotation pipeline40.

Hierarchical cluster analysis41 of the LS-BSR gene clusters associated with particular clinical outcomes was performed using Pearson correlation with average linkage using MeV42. The gene clusters compared were considered either present (blue) with an LS-BSR of ≥0.9 (with 90% clustering threshold) or ≥0.8 (with 80% clustering threshold) or absent (white) when <0.9 or <0.8.

Statistical analysis

Statistical significance of the prevalence of predicted gene clusters among genomes associated with different symptomatic groups was determined using the Pearson's chi-square test with Yates' continuity correction when the number of genomes was five or more, or the Fisher's exact test when the number of genomes in one or both groups being compared was less than five, calculated using R v. 3.1.143. P values of <0.05 were considered statistically significant.

Accession numbers

The genome sequence assemblies generated in this study were deposited in GenBank under the accession numbers listed in Table 1.