Population structure analysis and laboratory monitoring of Shigella by core-genome multilocus sequence typing

The laboratory surveillance of bacillary dysentery is based on a standardised Shigella typing scheme that classifies Shigella strains into four serogroups and more than 50 serotypes on the basis of biochemical tests and lipopolysaccharide O-antigen serotyping. Real-time genomic surveillance of Shigella infections has been implemented in several countries, but without the use of a standardised typing scheme. Here, we study over 4000 reference strains and clinical isolates of Shigella, covering all serotypes, with both the current serotyping scheme and the standardised EnteroBase core-genome multilocus sequence typing scheme (cgMLST). The Shigella genomes are grouped into eight phylogenetically distinct clusters, within the E. coli species. The cgMLST hierarchical clustering (HC) analysis at different levels of resolution (HC2000 to HC400) recognises the natural population structure of Shigella. By contrast, the serotyping scheme is affected by horizontal gene transfer, leading to a conflation of genetically unrelated Shigella strains and a separation of genetically related strains. The use of this cgMLST scheme will facilitate the transition from traditional phenotypic typing to routine whole-genome sequencing for the laboratory surveillance of Shigella infections.


Supplementary Methods
Other studied genomes EnteroBase 1 was queried on 11 November 2020, to identify genomes with new HC1100/HC400 combinations (i.e., not present in our reference and routine datasets) for a given serotype, determined by in silico analysis of the rfb cluster. We selected at least one genome for each unique combination and a maximum of five if the genomes concerned appeared to come from epidemiologically unrelated isolates. In addition, if a serotype was represented by only one strain in our collection, we selected another genome from EnteroBase, when possible. This selection resulted in 81 Shigella genomes (reference+ dataset). The 27 enteroinvasive E. coli (EIEC) included in this study belonged to the various previously described EIEC clusters [2][3][4][5] .
We discarded four (ECOR 7, 23, 32, 43) of the 72 E. coli strains from the ECOR collection 6 from this study due to discrepant results for MLST and/or Clermont typing between the studies of Galardini and coworkers 7 , Clermont and coworkers 8 , and EnteroBase 1 . These 81 additional Shigella genomes, the 68 E. coli genomes from the ECOR collection, and the 27 EIEC genomes are listed in Supplementary Data 1.
The third cluster, HC2000_192, comprised S. boydii prov. E1621-54 (now proposed as S. boydii 22, Supplementary Notes section "Updating the Shigella typing scheme) and all serotypes and subserotypes of S. flexneri, with the exception of S. flexneri 6, which grouped in S1 (Fig. 4).
This cluster seems to correspond to the Cluster 3 reported by Pupo and coworkers 11 , except that S. boydii 12 rather than S. boydii prov. E1621-54 was reported in Cluster 3 in this previous study (Supplementary Notes section "Discrepancies with published studies"). This HC2000_192 cluster, hereafter referred to as S3, could be divided into seven distinct HC1100 clusters (Fig. 4a). One of these S3 subclusters, HC1100_11429, contained exclusively S. boydii prov. E1621-54. The other six HC1100 clusters contained two or more S. flexneri serotypes per cluster. Connor and coworkers 12 previously subdivided >350 genomes of S. flexneri 1-5, X, Y into seven phylogenetic groups (PGs), based on a Bayesian analysis of population structure. As 140 S. flexneri genomes from our study were included in the analysis by Connor and coworkers 12 , we compared the clustering by cgMLST HC1100 to that obtained by PG.

Discrepancies with published studies
The SB12 cluster was not identified by Pupo and coworkers 11 , or by Yang and coworkers 13 .
Instead, the single S. boydii 12 strain studied in each of these studies was assigned to their cluster 3 (our cluster S3), in place of S. boydii prov. E1621-54 (not included in their studies).
As S. boydii prov. E1621-54 is serologically related to S. boydii 12 (ref. 14 ), it seems likely that their strains were actually S. boydii prov. E1621-54 but were mistakenly serotyped as S. boydii 12 due to the cross-agglutination between the two serotypes. We also encountered this problem when typing Shigella isolates collected from children in Niger 15 . The nine indole-negative isolates obtained, reported to be S. boydii 12 at the time, were sequenced and found to belong to HC1100_11429, within cluster S3. Following the use of a typing serum against strain E1621-54, they were reclassified as S. boydii prov. E1621-54 (now proposed as S. boydii 22, Supplementary Notes section "Updating the Shigella typing scheme"). Furthermore, Pupo and coworkers 11 did not report S. boydii serotype 11 in both clusters S1 and S2, but only in S2. A genomic study of 117 Shigella isolates belonging to validated serotypes found seven Shigella clusters following a SNV-based ML phylogeny analysis 5 . The SB12 cluster was not found, and their S. boydii 12 strain instead clustered with their cluster 5 (S2 in our study). The analysis of this genome (SRR2994194) with cgMLST and rfb cluster analysis revealed that it was actually S. dysenteriae 2. Their S. boydii 9 strain was also placed in their cluster 11 (our S1), whereas we found this serotype in our S2. The genome they reported (DRR015923) was actually a contamination of S. boydii 9 (S1) with S. boydii 10 (S2). A recent preprint paper was more comprehensive and used a selection of 485 Shigella genomes, selected from over 17,000 publicly available genomes based on ribosomal MLST and ST types 16 . An undescribed phylogeny identified three Shigella clusters and seven outliers. The three clusters and five outliers were similar to those from our study. The two remaining, still assigned to Shigella, were clusters CSB13 and CSB13-atypical, actually corresponding to E. albertii, and attaching and effacing (A/E) E. coli, respectively. The status of the provisional serotypes was, however, not addressed adequately, and the continual reporting of different provisional or novel serotypes that are actually identical creates unnecessary confusion for the international surveillance of Shigella infections. This issue has now been dealt with by our work (Supplementary Notes section "Updating the Shigella typing scheme").

Genomic analysis of metabolic markers used in the current Shigella typing scheme
An inability to use/ferment mannitol is a key marker of the S. dysenteriae serogroup. However, it was also observed in some strains of S. boydii 14, S. flexneri 4, and S. flexneri 6 (ref. 17 ). The mannitol (mtl) operon contains three genes: mtlA encoding the mannitol permease, mtlD encoding the mannitol-1-phosphate dehydrogenase, and mtlR encoding the mannitol repressor 18,19 . Different mechanisms of mannitol (mtl) operon disruption associated with the mannitol-negative trait have occurred independently in S. dysenteriae. In the four different subclusters of S1 containing this serotype, remnants of the mtlA and mtlD genes, flanked by Insertion Sequences (ISs) IS1 and IS3, are present in S1a genomes; a remnant of mtlR disrupted by one IS1 is present in S1c genomes; and the entire mtl operon is lacking in S1d genomes, represented only by S. dysenteriae 7. The analysis of a PacBio sequence of S. dysenteriae 7 strain ATCC 9052 (ref. 20 ), not included in our study, showed a large IS-driven deletion of the entire operon. In the single S2 subcluster containing S. dysenteriae, S2d, mtlR and a remnant of mtlD were present at a chromosomal location different from that in E. coli K-12. The mannitol operon was entirely lacking in the three outlier groups of S. dysenteriae (SD1, SD8, and SD10).
All strains from the S2 cluster were indole-positive, whereas those from S1 were indolenegative, with one exception: S. dysenteriae 7 (indole-positive), assigned to S1d. The loss of indole production has been reported to be due to IS-mediated insertions and/or deletions damaging the tryptophanase (tna) operon in a limited number of strains 21 . This operon, which is 3,144 bp long in E. coli K-12, contains three genes: tnaA encoding a tryptophanase, tnaB encoding a permease, and tnaL encoding a 25-residue leader peptide 21 . We confirm here that the loss of indole production is associated with the insertion of an IS1 at base 55 of tnaL in the genomes from subclusters S1a, S1b and S1c. In subcluster S1e, containing rare S. flexneri 6 genomes, the tna alteration occurred independently, but also involved IS1 integration, this time in the promoter of tna. Additional damage (deletions, other IS insertions) was also observed in individual genomes from these subclusters.

Aerogenic strains of S. boydii 14 and S. dysenteriae 3
By definition, Shigella strains do not produce gas from carbohydrate fermentation (except for S. flexneri 6), and S. dysenteriae strains do not ferment mannitol. The status of aerogenic strains of S. boydii 14 long remained a matter of debate. First described in 1943 by Sachs, an aerogenic and mannitol-negative strain referred to as "Enterobacterium A12" (ref. 22 ), was considered, by several authors, to be a variant of S. boydii 14 (refs. 23-25 ), but was considered by Ewing 26,27 to be an E. coli O32. We considered two such aerogenic isolates, including the original Sachs strain A12 (CIP 53.44), in the reference dataset. They were grouped with typical S. boydii 14 genomes in the S1b subcluster (including, in particular, the reference strain of this serotype, 2770-51). All five recent S. boydii 14 isolates from our routine surveillance dataset were also atypical, as they did not ferment mannitol, and three were aerogenic. Another atypical biotype was described for S. dysenteriae 3. Two of these aerogenic and mannitol-positive S. dysenteriae 3 strains were described in Poland 28 and the UK 14 . We studied the first of these strains (Polska 64-3840). Unlike aerogenic S. boydii 14, strain Polska 64-3840 was not placed by cgMLST in the S1a subcluster with other S. dysenteriae 3, but in S1b, with S. flexneri 6, which is known to produce gas and to ferment mannitol.

Updating the Shigella typing scheme
S. dysenteriae prov. 93-119 (ref. 29 ), prov. SH-103 (ref. 30 ), prov. 204/96 (ref. 31 ), prov. 96-3162 (ref. 32 ), prov. 97-10607 (ref. 33 ), prov. SH-105 (referred to as S. dysenteriae 16 by Melito and coworkers 30 ) belonged to S1a; S. boydii prov. 07-6597 (unpublished) belonged to S1c; S. dysenteriae prov. 96-265 (ref. 20 ), prov. BEDP 02-5104 (ref. 30 ), and prov. E670/74 (ref. 34 ), belonged to S2d; and S. boydii prov. E1621-54 (ref. 35  displaying 100% identity (17,306/17,306), with no gaps relative to the rfb cluster of E. albertii O2 strain SP140152 (GenBank accession code KY574563) 37 . A partial match (91% identity, 7,041/7,744 with three gaps) was also obtained with the rfb of S. dysenteriae 4 (CP026840). This 17⸱3 kb rfb cluster was not the normal rfb cluster (i.e., that located next to the colanic acid biosynthesis gene cluster). This latter consisted only of a remnant of the S. dysenteriae 3 rfb cluster, damaged by IS5 ( Supplementary Fig. 5). The 17⸱3 kb rfb cluster was located on a ~63 kb genomic island integrated close to the yqjH gene ( Supplementary Fig. 6). Based on rfb sequences, 123 of the 133 HC100_44952 genomes in EnteroBase (6 May 2021) belong to this provisional S. dysenteriae serotype, and seven are genuine S. dysenteriae 3. The new serotype of S. dysenteriae described in the UK as E112707/96 was not included in our study 38 . However, an analysis of its representative genome deposited in EnteroBase (SRR4195641) revealed that it had the same 17⸱3 kb rfb cluster as strain 96-204 and also belonged to HC100_44952. Thus, this novel serotype, described by different groups across the world under different names, is actually identical for all strains considered. It has also become relatively frequent. It ranked first and accounted for 30⸱2% (16/ Table 8) and both were agglutinated with an antiserum raised against strain 93-119. They clustered into HC100_35368 and had identical 16⸱3 kb rfb clusters (Fig. 6). This rfb cluster displayed no similarity to those of Shigella genomes, but was similar (identity:16,307/16,323; gap, 1/16,323) to that of E. coli  Fig. 7). The remaining genome was located in another HC100 cluster, among S. dysenteriae 2 genomes. This analysis suggests that this provisional serotype is actually a S. dysenteriae 2 that has acquired an Oantigen-modifying plasmid. In the absence of knowledge about the stability of this plasmid and the possibility of its transfer to S. dysenteriae 2 from other HC100 clusters, it seems wise not to consider S. dysenteriae prov. BEDP 02-5104 to be a new serotype for the time being.