Global population structure and genotyping framework for genomic surveillance of the major dysentery pathogen, Shigella sonnei

Shigella sonnei is the most common agent of shigellosis in high-income countries, and causes a significant disease burden in low- and middle-income countries. Antimicrobial resistance is increasingly common in all settings. Whole genome sequencing (WGS) is increasingly utilised for S. sonnei outbreak investigation and surveillance, but comparison of data between studies and labs is challenging. Here, we present a genomic framework and genotyping scheme for S. sonnei to efficiently identify genotype and resistance determinants from WGS data. The scheme is implemented in the software package Mykrobe and tested on thousands of genomes. Applying this approach to analyse >4,000 S. sonnei isolates sequenced in public health labs in three countries identified several common genotypes associated with increased rates of ciprofloxacin resistance and azithromycin resistance, confirming intercontinental spread of highly-resistant S. sonnei clones and demonstrating the genomic framework can facilitate monitoring the spread of resistant clones, including those that have recently emerged, at local and global scales.


Supplementary
: Partitions detected in the discovery genomes using fastbaps (Bayesian hierarchical clustering on SNV matrix). a fastbaps partitions (levels 1, blue; and 2, green) plotted against the maximum likelihood phylogeny. b Boxplots of pairwise SNV distances (log scale, N=1,873,081 pairwise distances) between discovery genomes at the two levels of fastbaps partitioning. Boxes indicate the median (bold line), 25 th to 75 th percentiles (box), and the 5 th and 95 th percentile (whiskers), with outliers shown as points. Note that this form of clustering results in two levels that while hierarchical (level 2 nested within level 1) and generally monophyletic on the tree, are not coherent in terms of genetic divergence (approximated by SNV distance), which is a desirable property of a genotyping scheme. Boxes indicate       This study used WGS to investigate isolates from two outbreaks in California in the context of local historical isolates: an outbreak of ciprofloxacin resistant (Cip R ) cases in San Francisco, and an unrelated outbreak of Shiga toxin-positive cases in San Diego and San Joaquin. Genome-wide mapping-based SNV calling and phylogenetic analysis showed the two outbreaks formed tight clusters that were distant from one another, and from local historical isolates, in the phylogenetic tree. They also constructed additional genome-wide phylogenies incorporating 188 isolates from prior studies of globally distributed S. sonnei 1 , in order to identify which of the main lineages (I, II, III, Global III, IV) their outbreak clusters belonged to; this identified the San Diego/San Joaquin as lineage III, and the San Francisco Cip R outbreak as belonging to Global III and clustering with the South-Asia clade of isolates sharing the same three QRDR mutations explaining the Cip R phenotype.
We downloaded the Illumina reads for the California isolates via the accessions indicated and genotyped them using Mykrobe (compute time, 1 minute per genome). Supplementary Figure 7 reproduces Figure 3 from Kozyreva et al, onto which we have annoted the Mykrobe-derived genotypes. This identified the historical isolates as lineage 1 (n=1), lineage 2 (n=3) or clades within lineage 3 (genotypes 3.1 and 3.7, n=6), distinct from the two outbreaks and consistent with the whole-genome phylogeny. Genomes from the San Francisco Cip R outbreak were identified as 3.6.1.1, the well-described CipR clade, and carrying its typical 3 QRDR mutations; those from the San Diego/San Joaquin outbreak were identified as genotype 3.7.18, a subclade of Global III (3.7). Genotype 3.7.18 was defined on the basis of isolates from Latin America, consistent with epidemiological investigations of the outbreak which implicated travel from Mexico as the likely route of introduction of the outbreak strain into California. Thus findings from genotyping replicate those derived by the authors using more complex comparative methods.
Notably, Kozyreva et al found that all genomes from their outbreak carried the Shiga-toxin virulence genes stx1A and stx1B. We screened all 3.7.18 genomes in our dataset (n=297) but were unable to detect these genes in our genomes, suggesting that the presence of these virulence factors was limited to this outbreak, and is not a widespread feature of the genotype. This study examines 25 S. sonnei genomes isolated between 2016 and 2019 in Switzerland; 14 were resistant to third-generation cephalosporins and 11 were sensitive. The authors sequenced the genomes of these isolates and subjected them to cgMLST using Enterobase 9 , which identified 18 cgSTs. They then constructed a tree comprising their novel genomes, the global collection of S. sonnei from Holt et al 2012 1 and other isolates from Enterobase with matching cgSTs by extracting SNVs from the cgMLST loci. Based on the SNV tree the authors identified four monophyletic clusters associated with different ESBL genes, three (clusters 1-3) grouping with Global III and one (cluster 4) grouping with Lineage III but outside Global III. Notably two of these clusters comprised multiple cgSTs; i.e. cgST alone was not enough to identify the clusters. The authors concluded that there were multiple plasmids carrying ESBL genes present in different clusters, with no single plasmid responsible for ESBL strains in Switzerland. They also identified a small number of isolates with matching cgSTs in the Enterobase database, noting that (i) all Swiss ESBL cgSTs identified in 2019 were also detected in the same year in UK and France; and (ii) cluster 1 shared a cgST with two older isolates from Iran and Egypt.

Example 2 -Study of ESBL S. sonnei in Switzerland
We downloaded the assemblies for the Swiss genomes using the accessions provided in the study, simulated reads, and genotyped them using Mykrobe (compute time, 30 seconds per genome).
The results are indicated in the 'Genotype' and 'Genotype Name' columns in Supplementary Data 3, and identified five genotypes amongst ESBL-positive strains, which map to the four clusters identified by cgST plus an additional singleton strain (3.6.1.1). Genotyping yielded essentially the same information as the combined cgMLST and tree analysis, while also providing clearer links to known MSM clusters circulating in Europe and identifying cluster 4 and the singleton strain as ciprofloxacin resistant: (i) cluster 4 belongs to genotype 3.6. While both cgSTs and genotypes provide a straightforward way of searching databases for related strains, this example demonstrates that cgMLST can over-classify genomes in ways that obscure close relationships, as genomes belonging to the same genotype frequently belong to multiple cgSTs that were not readily identified as related without specific further analysis (by inspecting pairwise SNV distances, locus-variant distances or hierarchical clustering). In contrast, the hierarchical nomenclature used in our genotyping scheme ensures that defining genotypes at higher resolution has no disadvantage, as relationships between genotypes can easily be recognised without need for comparative analysis. This study describes three cases of S. sonnei in the local MSM community in Switzerland. To determine the relationships of these isolates to one another, to sporadic S. sonnei in Switzerland, and to recently sequenced MSM-associated isolates from the UK, the authors sequenced their three cases plus another seven non-MSM-related local S. sonnei, and inferred a tree incorporating these plus a selection of 32 publicly available MSM-associated sequences from the UK 10 using genome-wide read mapping, SNV calling and neighbour-joining tree construction. Based on the tree, the authors concluded that: (i) the MSM cases were not closely related to the local non-MSM strains; (ii) two MSM cases were closely related to one another, and clustered with the previously reported UK MSM cluster 1 10 ; (iii) the third case was not related to this pair of cases, but clustered with UK MSM cluster 7.