Whole Genome Sequencing demonstrates that Geographic Variation of Escherichia coli O157 Genotypes Dominates Host Association

Genetic variation in an infectious disease pathogen can be driven by ecological niche dissimilarities arising from different host species and different geographical locations. Whole genome sequencing was used to compare E. coli O157 isolates from host reservoirs (cattle and sheep) from Scotland and to compare genetic variation of isolates (human, animal, environmental/food) obtained from Scotland, New Zealand, Netherlands, Canada and the USA. Nei’s genetic distance calculated from core genome single nucleotide polymorphisms (SNPs) demonstrated that the animal isolates were from the same population. Investigation of the Shiga toxin bacteriophage and their insertion sites (SBI typing) revealed that cattle and sheep isolates had statistically indistinguishable rarefaction profiles, diversity and genotypes. In contrast, isolates from different countries exhibited significant differences in Nei’s genetic distance and SBI typing. Hence, after successful international transmission, which has occurred on multiple occasions, local genetic variation occurs, resulting in a global patchwork of continental and trans-continental phylogeographic clades. These findings are important for three reasons: first, understanding transmission and evolution of infectious diseases associated with multiple host reservoirs and multi-geographic locations; second, highlighting the relevance of the sheep reservoir when considering farm based interventions; and third, improving our understanding of why human disease incidence varies across the world.


SNP data
The SNP data for the Scottish phylogenetic tree ( Fig. 1 and Fig. 2) is in Supplementary Table S2 (Supplementary_Table_S2.xlsx).
The SNP data for the international phylogenetic tree (Fig. 4) as well as locations in Sakai (NC_002695.1) of discriminatory SNPS for clades A-G and corresponding probes is in Supplementary Table S3 (Supplementary_Table_S3.xlsx).

Current File
In the current file are Suppl. Figs S1 -S4 and Suppl. Tables S4 to S9.  Fig. 1). The tree was drawn utilising 871 phylogenetically informative SNPs obtained from Panseq. A Neighbor-Joining method was employed 1 . The optimal tree with the sum of branch length = 1.249 is shown. The evolutionary distances were computed using the Maximum Composite Likelihood method 2 and are in the units of the number of base substitutions per site. All positions containing gaps and missing data were eliminated. There were a total of 882 positions in the final dataset. Evolutionary analyses were conducted in MEGA6 3 . Fig. S1b shows the bootstrap consensus tree of this dataset, inferred from 500 replicates taken to represent the evolutionary history of the taxa analyzed 4 . Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The evolutionary distances were computed using the Maximum Composite Likelihood method 2 and are in the units of the number of base substitutions per site.

Fig. S2
illustrates the phylogeny of the above E. coli O157 isolates in radiation format where the isolate labels are coloured to represent their LSPA6 lineage, whether their tir 255 SNP is A or T, which of Manning's clades they belong to and their clade membership according to Fig. 3 in the current paper.

Fig. S3a
depicts the bootstrap phylogenetic tree of the combined international and Scottish E. coli O157 isolates (see also Fig. 3). A Neighbor-Joining method was employed 1 . The tree was drawn utilising 813 phylogenetically informative SNPs obtained from Panseq (Suppl. Table S3). The optimal tree with the sum of branch length = 0.7884 is shown. The evolutionary distances were computed using the Maximum Composite Likelihood method 2 and are in the units of the number of base substitutions per site. All positions containing gaps and missing data were eliminated. There were a total of 812 positions in the final dataset. Evolutionary analyses were conducted in MEGA6 3 . Fig.  S3b shows the bootstrap consensus tree of this dataset, inferred from 500 replicates taken to represent the evolutionary history of the taxa analyzed 4 . Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The evolutionary distances were computed using the Maximum Composite Likelihood method 2 and are in the units of the number of base substitutions per site.            b tir 255T>A polymorphism analysis was used to detect a single nucleotide polymorphism (A/T) located in the tir gene 9, 12 .
c Clade Identification (Manning) utilised an in silico probe based method based on four SNPs 13 to identify the main Clades (1, 2, 3 and 8 independently and 4, 5, 6, 7 and 9 as a group) published previously 10 .The probes were developed using the Sakai reference genome (BA000007.2). d SBI results were concatenated using the terminology described previously 14 and restated here. When either or both bacteriophage-chromosome insertion site locus junctions were detected, the locus was considered occupied. When an intact locus insertion site product was detected without amplification of either bacteriophage insertion site junction, the locus was considered unoccupied. A modified genotyping code was assigned to each isolate using the characters A, S, W, Y, 1a, 2a and 2c to represent argW, sbcB, wrbA, yehV, stx1a, stx2a and stx2c, respectively. An N was assigned to any isolate that did not contain any of the three stx genes. Statistical differences in the frequencies of SBI types between animal host species and between countries were determined using the chi-square test.