Together with plague, smallpox and typhus, epidemics of dysentery have been a major scourge of human populations for centuries1. A previous genomic study concluded that Shigella dysenteriae type 1 (Sd1), the epidemic dysentery bacillus, emerged and spread worldwide after the First World War, with no clear pattern of transmission2. This is not consistent with the massive cyclic dysentery epidemics reported in Europe during the eighteenth and nineteenth centuries1,3,4 and the first isolation of Sd1 in Japan in 18975. Here, we report a whole-genome analysis of 331 Sd1 isolates from around the world, collected between 1915 and 2011, providing us with unprecedented insight into the historical spread of this pathogen. We show here that Sd1 has existed since at least the eighteenth century and that it swept the globe at the end of the nineteenth century, diversifying into distinct lineages associated with the First World War, Second World War and various conflicts or natural disasters across Africa, Asia and Central America. We also provide a unique historical perspective on the evolution of antibiotic resistance over a 100-year period, beginning decades before the antibiotic era, and identify a prevalent multiple antibiotic-resistant lineage in South Asia that was transmitted in several waves to Africa, where it caused severe outbreaks of disease.
January 2016 marks one hundred years since the invasion force from Britain, Australia, New Zealand and France withdrew from the Dardanelles, in the then Ottoman Empire, only eight months after landing. Most of the more than 120,000 casualties evacuated from the Gallipoli Peninsula were suffering from epidemic bacillary dysentery6, caused by Shigella dysenteriae type 17,8 (Sd1), a bacterium producing the powerful Shiga toxin. This human-adapted clone of Escherichia coli9 was isolated for the first time by Kiyoshi Shiga during a dysentery outbreak in Japan, during which 90,000 cases and 20,000 deaths occurred in the last six months of 1897 alone5. In the second half of the twentieth century, large outbreaks of disease due to Sd1 were still being reported in Central America, with estimates of more than 500,000 cases and 20,000 deaths for the 1969–1973 epidemic10,11, in Africa, where there were an estimated 100,000 cases and 5,000–10,000 deaths in the 1979 epidemic12, and in Asia13,14.
Very little is known about the origins, evolution and spread of this important human pathogen, including, in particular, the strains involved in the major outbreaks and the genetic relationships between them. We carried out a whole-genome sequence analysis on a set of Sd1 isolates selected from more than 35 international strain collections, representing the widest possible temporal and geographic distribution of available isolates, to obtain a phylogenetic framework that was robust over time and space and to infer transmission dynamics. This unique collection included 325 isolates from 66 countries spanning four continents, collected between 1915 and 2011. Sixty-seven historical isolates collected between 1915 and 1960, including 14 isolates obtained during the First World War15,16, were included in the collection, together with several isolates from each major outbreak reported since the 1960s. Short-read sequences from six Sd1 published genomes2 were also included, with S. flexneri, S. boydii, S. sonnei and E. coli genomes used as outgroups.
Single-nucleotide polymorphisms (SNPs) were detected by mapping short-read sequences against Sd1 reference genomes: Sd197 (ref. 17), which was isolated during an outbreak in China in the 1950s, and Sd1617 (ref. 18), which was isolated in Guatemala during the 1968–1969 epidemic. Maximum likelihood (ML) phylogenetic analysis was performed on 14,677 (mapping against Sd197) and 15,752 (mapping against Sd1617) chromosomal SNPs, which were randomly distributed over the non-repetitive non-recombinant core genome (85.6% of the Sd197 chromosome, Supplementary Discussion Section ‘SNP variation in S. dysenteriae type 1’ and Supplementary Fig. 12). Four genetic lineages (Fig. 1a and Supplementary Discussion Section ‘SNP variation in S. dysenteriae type 1’ and Supplementary Figs 9–11) were identified. Lineage I contained only M115, which was isolated from a case in England in 1926. Lineage II contained mostly isolates collected in Europe between 1915 and 1958. Lineage III contained isolates from around the world and could be split into four sublineages with strong geographical affinities: IIIa in eastern and southeastern Asia (with isolates collected between 1927 and 1971), IIIb in Central America (1955–1992), IIIc in West Africa (1954–2006) and IIId in southern Asia and eastern Africa (1956–1977) and then in West Africa (1979–1998). Finally, lineage IV contained most of the Sd1 isolates obtained from the Indian subcontinent and Africa in the last few decades.
Ten of the 14 isolates (71%) amassed by Captain E.G.D. Murray during the First World War belonged to the European lineage, lineage II, and most were isolated at the second Western General Hospital, Manchester, which received many of the soldiers evacuated during the Gallipoli campaign (Supplementary Fig. 1). The other four isolates belonged to three of the four sublineages of the global lineage, lineage III. None of the First World War isolates belonged to sublineage IIId, which gave rise to the modern lineage, lineage IV.
The two candidate vaccine strains developed so far are derived from lineage III parental isolates: IIIb for parental strain Sd1617 of vaccine strain WRSd1 (ref. 19) and IIId for parental strain 7–87 of vaccine strain SC-599 (ref. 20).
ML phylogenetic analysis revealed a strong correlation between root-to-tip branch lengths and the known years of isolation for the sequenced Sd1 isolates, indicating a clock-like evolution (Supplementary Fig. 2). We therefore used a Bayesian phylogenetic approach to provide estimates of the nucleotide substitution rates and divergence times of the different lineages for a spatially and temporally representative subset of 125 isolates (Fig. 2). We estimated the genome-wide substitution rate at 8.7 × 10−7 substitutions site−1 year−1 (95% credible interval (CI) = 7.6 × 10−7 to 9.9 × 10−7), giving a most recent common ancestor (MRCA) for all the Sd1 in our collection dating from 1747 (95% CI, 1645–1822). This finding is consistent with historical data from the eighteenth to mid-nineteenth centuries, which describe cyclic dysentery epidemics in Western and Northern Europe associated with extraordinarily high mortality rates. For example, the 1738–1742 and 1779 epidemics in France killed more than 200,000 people1, the 1770–1775 epidemic in Sweden killed almost 35,000 people (12% of all deaths during the period)3, and a large number of deaths from dysentery were also reported during the Irish Great Famine of 1846–1849 (ref. 4). The MRCA for all isolates other than M115 was dated to the mid-nineteenth century (1853; 95% CI 1831–1871), whereas the MRCAs for each of the sublineages of global lineage III were estimated to have existed between 1889 (95% CI 1881–1897) and 1903 (95% CI 1893–1913), indicating that this lineage spread worldwide over a period of less than two decades. This dating is also consistent with Shiga's observation that the dysentery outbreak of 1897 had begun in the late 1880s in the southern part of Japan21.
Our findings show that the global spread of Sd1 predates the First World War. It therefore occurred earlier than for another Shigella serogroup, S. sonnei, which has been shown to have spread to other continents from Europe during the second half of the twentieth century22. We cannot demonstrate causality between the spread of Sd1 and historical events on the basis of the results presented here, but the late 1800s coincided with a period of intense European emigration, with the colonization of various territories in Africa and Asia by European powers, facilitated by the opening of the Suez Canal in 1869 and the development of steamships.
Geographic and temporal analyses identified several intercontinental transmission events resulting in long-term establishment of the bacterium (Figs 1b,c and 2). Transmission event T1 involved the European lineage II and led to an introduction of Sd1 in Madagascar between 1915 (95% CI 1910–1921) and 1967 (95% CI 1956–1977), during French colonization. This is consistent with the first report, which unambiguously described Sd1 there in 192723. Transmission event T2, involving eastern Asia and Poland, is estimated to have occurred between 1910 (95% CI 1899–1925) and 1944 (95% CI 1942–1945). All other transmission waves originated in the Indian subcontinent and affected mostly East Africa. Two of these transmission waves, T5 and T8, led to major outbreaks; according to our estimates, T5 occurred between 1970 (95% CI 1963–1975) and 1979 (95% CI 1976–1981). This dating is consistent with the first reported outbreak in the northeastern part of what is now the Democratic Republic of the Congo (DRC) in 1979, 28 years after the last isolation of Sd1 in Central Africa12. This epidemic then spread to the Great Lakes region, where it persisted until at least 1990 (ref. 12). T8 occurred between 1984 (95% CI 1978–1987) and 1987 (95% CI 1985–1989), with a first reported outbreak in Zambia in 1990–1991 (refs 12,24). The strain then spread rapidly across an Africa ravaged by civil unrest, war (for example, in Mozambique, Angola, Rwanda and Sierra Leone) and HIV infection12,24 until 2011. With the exception of a localized outbreak in the northern part of the Central African Republic in 200425 caused by sublineage IIIc (see below), all other outbreaks in Africa since 1990 have been caused by lineage IV.
The high resolution of whole-genome sequence analysis (WGS) has significantly changed our understanding of the patterns of Sd1 transmission over time at a global scale. The classical molecular epidemiology tools (Supplementary Discussion Section ‘Correlation of S. dysenteriae type 1 phylogenetic lineages with existing typing and subtyping schemes’) used previously were unable to unravel these patterns of transmission. Furthermore, a re-evaluation of two outbreaks that occurred in the Central African Republic in 2003–2004 (ref. 25), which we had previously investigated by pulsed-field gel electrophoresis (PFGE), the current method of choice for subtyping Sd1, revealed a lack of correlation between PFGE and WGS data (Supplementary Discussion Section ‘Correlation of S. dysenteriae type 1 phylogenetic lineages with existing typing and subtyping schemes’ and Supplementary Fig. 3). In particular, PFGE grouped the isolates from the two outbreaks closely together, but they actually belonged to two different lineages, IIIc and IV, separated by ∼700 SNPs. By contrast, other African T8 lineage IV isolates (differing by 37 to 61 SNPs from the Central African Republic T8 lineage IV outbreak isolates) formed a more distant group. Thus, PFGE cannot attribute profiles from different apparently geographically restricted outbreaks to a single, longer epidemic, such as that associated with the T8 transmission wave in Africa. PFGE should, therefore, no longer be used for the assessment of phylogenetic relationships in Sd1. Instead, WGS provides a robust phylogenetic framework for the epidemiological tracking of this bacterium.
One key feature in the evolution of Sd1 is the acquisition and accumulation of antibiotic resistance genes (ARGs; Figs 3 and 4, Supplementary Discussion Section ‘Pan-genome and antibiotic resistance’ and Supplementary Fig. 4). The first antibiotic-resistant Sd1 isolates in our collection were recovered in Asia and America during the 1960s and rapidly became predominant, such that susceptible isolates had become exceptional by 1991 (100%, [67/67] susceptible isolates, between 1915 and 1960 and <1% [1/123], between 1991 and 2011). Lineage IV, the most recent of the lineages identified, is the most affected by antibiotic resistance, but almost all the contemporary circulating strains from older lineages have also become resistant to multiple antibiotics. ARGs were acquired following the first use of antibiotics in clinical practice (Fig. 4b). The first ARGs identified in Sd1 were borne on small plasmids (<10 kb), encoding resistance to streptomycin and sulfonamides. Larger plasmids (80–130 kb) of different types encoding additional resistance to tetracycline, chloramphenicol and, for some plasmids, ampicillin (via the blaOXA-1 or blaTEM-1 genes) were then acquired in various geographic areas, from the mid-1960s to the 1980s. These plasmids belonged to the IncK and IncF groups in Asia and to the IncB/O group in Central America. The use of cotrimoxazole, beginning in the late 1960s, led to the acquisition of dihydrofolate reductase genes, mostly dfrA1, carried by 110 kb pST186 IncI1 and 30-kb IncX4 plasmids or by the Tn7 transposon inserted into the Sd1 chromosome close to the glmS gene, as observed for S. sonnei22. Since the 1990s, the principal structure associated with multidrug resistance in Sd1 has been a 66 kb genomic element called the Shigella resistance locus pathogenicity island (SRL-PAI)26. This was acquired four times in lineage IV (South Asia or the Middle East), once in sublineage IIIc (West Africa) and once in lineage II (Madagascar). Further evidence for the independent acquisition of the SRL-PAI is provided by the presence of slight differences between the different acquired SRL-PAIs (Supplementary Fig. 5). SRL-A is very similar to the first SRL-PAI to be described in S. flexneri26, and it was found exclusively in lineage IV. SRL-B, found only in lineage IV African T8 isolates, was probably derived from SRL-A by insertion sequence (IS) ISSd1-mediated rearrangements rather than being independently acquired. The other SRL-PAI contained various insertions (group II introns, part of the shf operon, region replacing orf47) not present in SRL-A. Among the 149 isolates bearing the SRL-PAI, only two showed a partial deletion of the SRL-PAI, resulting in a loss of the antibiotic resistance cluster (that is, the SRL sensu stricto). This structure is therefore quite stable over time, particularly in a bacterium containing hundreds of ISs (refs 17,18). This 66 kb element encodes resistance to ampicillin, streptomycin, chloramphenicol and tetracycline, with no more resistance than the previously circulating large plasmids. Its persistence may therefore be associated with a lower fitness cost and the presence of an fec operon for the capture of iron, serving as selective advantages26. Before the principal acquisition of SRL-A, the closest ancestral group (consisting initially of South Asian and then South-East and Central Asian isolates) had acquired a chromosomally encoded transposon (Fig. 2 and Supplementary Fig. 6). This 10 kb structure encodes resistance to chloramphenicol and tetracycline. The structure of the double drug-resistance module is similar to that found in the SRL and to some previously circulating large multidrug resistance IncF plasmids, such as p3099-85 and p80-547. This recent trend towards acquiring ARG-containing genomic islands or chromosomally encoded transposons rather than plasmids is also displayed by the seventh pandemic Vibrio cholerae (SXT/R391) and Salmonella enterica serotype Typhi H58 (24 kb composite transposon) strains, which also originate from the Indian subcontinent27,28.
Resistance to nalidixic acid, a quinolone, mediated by point mutations in the DNA gyrase gene, gyrA, was acquired seven times in lineage IV Sd1 isolates from South Asia and Africa (Fig. 2) from the 1980s. The gyrA mutation leading to a serine-to-leucine substitution in the amino-acid sequence S83L was the most frequently observed, but others, involving codon 87, such as D87G and D87Y, were observed in isolates from Central Africa and Thailand, respectively, during the 1990s. Interestingly, in the same geographic area of DRC and Rwanda in 1994, two different mutations were acquired (S83L and D87G). This may reflect the heavy use of nalidixic acid in the Rwandan refugee camps, which experienced outbreaks of disease caused by V. cholerae O1 and Sd1 (ref. 29).
Resistance to ciprofloxacin, a fluoroquinolone, mediated by a double mutation in gyrA (S83L and a second mutation in codon 87) and a mutation in the topoisomerase IV parC gene (S80I), was acquired only once, in a group of 20 isolates from the Indian subcontinent collected between 1995 and 2010 (Fig. 2). We observed no resistance to extended-spectrum cephalosporins, carbapenems or azithromycin in the isolates studied here, but the existence of such resistance is almost inevitable, as the area of circulation of Sd1 overlaps with that of Enterobacteriaceae possessing mobile ARGs encoding resistance to the latest generation of antibiotics, such as NDM-1 (ref. 30). However, the dramatic decrease in Sd1 isolation reported since the turn of the century, and not explained by the findings of this genomic study, may counterbalance these pessimistic predictions.
The Sd1 isolates analysed in this study (listed in Supplementary Table 1) originated from the collections of the Centers for Disease Control and Prevention, Atlanta, GA, USA (n = 56); Institut Pasteur, Paris, France (n = 53); Public Health England, Colindale, UK (n = 29); Icddr,b, Dhaka, Bangladesh (n = 29); Central Research Institute for Epidemiology, Moscow, Russian Federation (n = 22); National Institute of Public Health, Prague, Czech Republic (n = 19); Public Health England, Porton Down, UK (n = 17); Iris-Lab, Brussels, Belgium (n = 11); National Institute of Cholera and Enteric Diseases, Kolkata, India (n = 8); Institut Pasteur de Bangui, Bangui, Central African Republic (n = 7); Norwegian Institute of Public Health, Oslo, Norway (n = 6); Hungarian National Collection of Medical Bacteria, Budapest, Hungary (n = 6); Pasteur Institute of St Petersburg, St Petersburg, Russian Federation (n = 5); National Institute of Public Health, Warsaw, Poland (n = 5); Institut Pasteur de Dakar, Dakar, Senegal (n = 4); New York University Langone Medical Center, New York, USA (n = 4); Robert Koch Institut, Wernigerode, Germany (n = 4); Institut für Hygiene und Umwelt, Hamburg, Germany (n = 3); Bégin Military Hospital, Saint-Mandé, France (n = 3); IAME, Paris, France (n = 3); Swedish Institute for Communicable Disease Control, Solna, Sweden (n = 3); Walter Reed Army Institute of Research, Silver Spring, MA, USA (n = 3); Epicentre, Maradi, Niger (n = 2); Polish Collection of Microorganisms, Wroclaw, Poland (n = 2); Ministry of Health, Jerusalem, Israel (n = 2); Centers for Disease Control, Taichung, Taiwan (n = 2); Centre Pasteur du Cameroun, Yaoundé, Cameroon (n = 2); National Institute of Infectious Diseases, Tokyo, Japan (n = 1); Public Health Agency of Canada, Winnipeg, Canada (n = 1); Istituto Pasteur-Fondazione Cenci Bolognetti, Rome, Italy (n = 1); Félix d'Hérelle reference center for bacterial viruses, Université Laval, Québec, Canada (n = 1); National Institute for Communicable Disease Control and Prevention, Beijing, China (n = 1).
Bacterial DNA samples were also received from the Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand (n = 10).
The 18 Sd1 isolates from the E.G.D. Murray collection15,16,31 included 14 isolates recovered during the First World War and four isolates obtained between 1926 and 1930. The First World War isolates were obtained from different sources (Supplementary Fig. 1) and were stored at room temperature in Douglas digest agar slant glass tubes after sealing with a gas-air burner between August 1918 and October 1919. In 1980, the 18 tubes and the 680 other cultures of Enterobacteriaceae from the entire collection were shipped to the National Collection of Type Cultures (NCTC), Porton Down, UK, opened and freeze-dried.
It was confirmed that all the isolates included belonged to Sd1, by conventional methods and serotyping at the French National Reference Center for E. coli, Shigella and Salmonella, Institut Pasteur, Paris, as described previously32.
Antibiotic susceptibility testing
Antibiotic susceptibility was determined by disk diffusion on Mueller-Hinton (MH) agar in accordance with the guidelines of the Antibiogram Committee of the French Society for Microbiology (CA-SFM 2014) (www.sfm-microbiologie.org/). The following antimicrobial drugs (Bio-Rad) were tested: amoxicillin, ceftriaxone, ceftazidime, streptomycin, kanamycin, amikacin, gentamicin, nalidixic acid, ofloxacin, ciprofloxacin, sulfonamides, trimethoprim, sulfamethoxazole-trimethoprim, chloramphenicol, tetracycline and azithromycin. E. coli CIP 76.24 (ATCC 25922) was used as a control. For strains displaying resistance to either nalidixic acid or ciprofloxacin by the disk diffusion method, this resistance was confirmed by determination of the minimal inhibitory concentration (MIC) with the corresponding Etest strips (bioMérieux). The MICs of azithromycin and nitrofurantoin were determined by Etests for 30 isolates chosen on the basis of resistance phenotype and year and country of isolation.
Determination of the mutator phenotype of strain M115
The mutation rate of M115 was estimated by monitoring the capacity of this strain to generate mutations conferring resistance to rifampin in two independent experiments including duplicates, as previously described33. E. coli strain ECOR48 (CIP 106023) was used as a strong mutator positive control34, the Sd1 97–13397 isolate was used as a putative strong mutator isolate (deletion of the mutS gene), and Sd1 M116 and Sd197 were used as putative normomutator isolates (integrity of the mutS, mutH, mutL and uvrD methyl-directed mismatch repair genes).
Total DNA extraction
Total DNA was extracted with the InstaGene matrix kit (Bio-Rad) for the PCR identification of antibiotic resistance genes, the Wizard Genomic DNA Kit (Promega) for multilocus sequence typing and Illumina sequencing and the phenol chloroform method35 for Illumina sequencing and PacBio sequencing.
Multi-locus sequence typing
Conventional multi-locus sequence typing (MLST) was performed on a subset of 33 Sd1 isolates, as previously described36. Sequencing was performed at the Plateforme de Génotypage des Pathogènes et Santé Publique, PF8 (Institut Pasteur). The nucleotide sequences and deduced protein sequences were analysed with EditSeq and Megalign software (DNASTAR). The BLASTN program of NCBI was used for database searches (http://www.ncbi.nlm.nih.gov/BLAST/).
PCR identification of antibiotic resistance genes
The blaTEM, blaSHV, blaOXA-1,cat1, sul1, dfrA1 and aadA1 resistance genes and the class 1 and 2 integron gene cassettes were amplified by PCR, as described previously37.
The presence of the Shigella resistance locus pathogenicity island (SRL-PAI) was assessed by PCR, as previously described38. The structure of the SRL-PAI was assessed by PCR mapping with the primers described or with new primers designed on the basis of GenBank accession no. AF326777. Amplicons not of the expected size were sequenced.
Plasmids were obtained from E. coli transconjugants or transformants, as previously described37, except that ampicillin (50 mg l–1) or chloramphenicol (20 mg l–1) was used as a selective agent.
Plasmid size was determined in parental and transconjugant or transformants strains by S1 nuclease treatment and pulsed-field gel electrophoresis, as previously described37. PCR-based replicon-typing analysis was performed as previously described39.
Eight 30–130 kb plasmids conferring antimicrobial resistance were sequenced. Plasmid DNA was extracted with the Large-Construct Kit (Qiagen) and sequenced through services provided by GATC Biotech, using shotgun sequencing runs on a 454/Roche GS FLX Analyzer (Roche). The resulting sequences were assembled into a unique scaffold. Gap closure was carried out by PCR followed by Sanger DNA sequencing with the Big Dye Terminator V3.1 Cycle Sequencing Kit (Applied Biosystems) and a 96-capillary 3730xl DNA Analyzer (Applied Biosystems), by Eurofins MGW Operon (Cochin Platform). Automatic annotation was performed with the RAST40 server (http://rast.nmpdr.org/), followed by manual inspection and correction. The sequences obtained have been deposited in GenBank under accession numbers KT754160 (p80-547), KT754161 (pCAR10), KT754162 (pBU53M1), KT754163 (pA5468), KT754164 (p3099-85), KT754165 (p93-531-1), KT754166 (p92-9000), KT754167 (p69-3818).
High-throughput genome sequencing was carried out at the genomics platform of the Pasteur Institute, GATC Biotech, Beckman Coulter Genomics or at the Wellcome Trust Sanger Institute, on Illumina platforms generating 100 to 146 bp paired-end reads, yielding a mean of 206-fold coverage (minimum 37-fold, maximum 990-fold) (Supplementary Table 2). Short-read sequence data were submitted to the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena), and the genome accession numbers are provided in Supplementary Table 1.
We optimized the resolution of the chromosome-encoded antibiotic resistance structures and ensured that representative isolates from the various lineages were included, by sequencing 10 isolates on the PacBIO RS II platform (Pacific Biosciences), as previously described28. The PacBio data were submitted to the ENA and the genome accession numbers are provided in Supplementary Table 1.
Other studied genomes
Short-read sequences from the following six Sd1 genomes published by Rohmer et al.2 were downloaded from the ENA and included in this study: 2735 (USA, 1974, SRR765065), 91R17 (Guatemala, 1991, SRR765098), 91R14 (Guatemala, 1991, SRR765104), DH03 (Central African Republic, 1996, SRR765110), DH05 (Central African Republic, 1996, SRR765112) and DH06 (Central African Republic, 1996, SRR765113).
The following genomes were used as outgroups: E. coli O157:H7 strain Sakai (GenBank accession no. NC_002695), E. coli strain K-12 MG1655 (GenBank accession no. NC_000913), S. flexneri type 2a strain 2457T (GenBank accession no. AE014073), S. boydii strain Sb227 (GenBank accession no. NC_007613) and S. sonnei strain Ss046 (GenBank accession no. NC_007384).
Read alignment and SNP detection
For the analysis of SNPs, Illumina-generated paired-end reads and the simulated paired-end reads from publicly available assembled genomes were mapped to the reference genome of Sd1 strain Sd197, including the chromosome (CP000034) and plasmids pSD1_197 (CP000035) and pSD197_spA (CP000640), with SMALT (version 0.7.4; http://www.sanger.ac.uk/resources/software/smalt/) as previously described28.
De novo assembly
The reads for each strain were assembled de novo with Velvet41 version 1.2.09, with parameters optimized with VelvetOptimiser version 2.2.5 (https://github.com/tseemann/VelvetOptimiser). They were scaffolded with SSPACE42 version v2.0. The gaps were closed with GapFiller43 version 1.11 and the sequences were annotated with Prokka44 version 1.5, as previously described28. CLC Assembly Cell version 4.2.0 (CLC bio) was also used to investigate antibiotic resistance determinants.
The ML phylogenetic tree shown in Supplementary Fig. 7 was built from a 140,385-chromosomal SNP alignment generated by snp_sites software (https://github.com/sanger-pathogens/snp_sites) from all 331 short-read sequences, plus Sd1 genomes Sd197 (used as a reference) and Sd1617, together with the six E. coli and Shigella sp. genomes used as outgroups. RAxML45 version 7.8.6 was used with the generalized time-reversible model and a Gamma distribution to model site-specific rate variation (the GTR+⌈ substitution model; GTRGAMMA in RAxML). Support for the ML phylogeny was assessed by 100 bootstrap pseudo-analyses of the alignment data and the final tree was visualized in FigTree version 1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/).
The ML phylogenetic trees shown in Figs 1a and 3a,c and Supplementary Figs 1a, 3b, 4, 9, 11 and 14 were built from a 14,677-chromosomal SNP alignment of all 331 Sd1 short-read sequences, plus Sd1 genome Sd197, used as the reference. Repetitive regions (within the chromosome, between the chromosome and the virulence plasmid (VP) or the SRL-PAI) were removed manually with the Artemis46 genome browser. Recombinogenic regions were also removed with the Gubbins47 software. The remaining 14,677 chromosomal SNPs were randomly distributed along the non-repetitive non-recombinant core genome (3,750,125 bp), with a spacing of about one SNP per 256 bp or a nucleotide divergence of 0.39% (Supplementary Fig. 12). RAxML version 7.8.6 (GTRGAMMA substitution model) was used to construct the tree. We performed 500 bootstrap pseudoreplicate analyses to assess support for the ML phylogeny. The tree was rooted on M115, which was shown to be the most closely related to the ancestral strain of Sd1 by two different approaches (ML and Bayesian) and was visualized with MEGA48 version 6, iTOL49,50 or FigTree version 1.4.2.
The ML phylogenetic trees shown in Supplementary Figs 10 and 11 were built from a 15,752-chromosomal SNP alignment of all 331 Sd1 short-read sequences, plus Sd1 genome Sd1617, used as the reference. The method used was similar to that described above, except that the repetitive regions were not removed manually and phylogenetic support was assessed by 100 bootstrap pseudo-analyses.
The VP phylogenetic tree shown in Supplementary Fig. 15 was constructed similarly, from the 226 plasmid-containing isolates (>90% coverage at read depth >10×), based on 290 SNPs randomly distributed along the non-repetitive non-recombinant pSD1_197 sequence (99,704 bp, 54.6% of pSD1_197). The tree was unrooted.
We clustered the isolates of Sd1 into various lineages by eye and by applying hierarchical Bayesian analysis of population structure (BAPS)51 software to the 14,677-chromosomal SNP alignment. Five iterations (L value) were run with a maximum cluster number (K value) of 6 or 10 and three iterations were run with K = 6.
We investigated the temporal signal in the ML phylogeny for Sd1, using Path-O-Gen (http://tree.bio.ed.ac.uk/software/pathogen/). The relationships between root-to-tip distances, year of isolation and lineage were analysed by linear regression methods.
We used Bayesian Evolutionary Analysis by Sampling Trees (BEAST)52 version 1.8 to date the important nodes. The analyses were conducted on a subsample of 125 isolates from across the ML tree, covering the full temporal and geographic range of this pathogen. The concatenated 10,798 chromosomal SNP alignments of these 125 strains were subjected to multiple BEAST analyses with both constant-size and Bayesian skyline population size change models, in combination with either a strict molecular clock or a relaxed clock, to identify the best-fit model22,53. For the BEAST analysis, the GTR+⌈ substitution model was selected and tip dates were defined as the year of isolation. For all model combinations, three independent chains of 100 million generations each were run to ensure convergence, with sampling every 1,000 iterations. Convergence and effective sample size (ESS) values were inspected using Tracer52 version 1.5. A marginal likelihood estimation was carried out, with path sampling and stepping stone sampling for each run that had converged, to compare the different combinations of clock and tree models54,55. The marginal likelihood estimation was then used to determine which model gave the best fit, by calculating the Bayes factor. The relaxed, uncorrelated lognormal clock model, which allows evolutionary rates to vary among the branches of the tree, together with the skyline demographic model, proved a much better fit for the data, as found previously for S. sonnei22 and S. flexneri53. The parameter and tree estimates of the three runs were combined with LogCombiner52 version 1.7.5, with the first 20% of states in each chain removed as burn-in. Maximum clade credibility (MCC) trees were generated with TreeAnnotator52 version 1.7.5 on the combined files and visualized with FigTree version 1.4.2. Estimates are reported as median values with the 95% highest posterior density (HPD, hereafter referred to as the credible interval). The Bayesian skyline plot was calculated and visualized with Tracer52 version 1.5, to investigate changes in the effective population size of Sd1 over time. To confirm the dating estimates, ten other random subsamples were generated from clusters calculated using the Prosperi method56 (code at http://figshare.com/articles/clustertree.R_Code_for_clustering_phylogenetic_trees/97225) with a threshold of 0.03. All singleton isolates were included (n = 86), and one isolate from each of the 33 clusters was randomly selected to generate the ten subsamples. These alignments were analysed in BEAST using the same model and showed similar dating for each of the lineages (Supplementary Table 2).
In silico MLST was then carried out by MLST version 1.8 (https://cge.cbs.dtu.dk/services/MLST/) on assembled sequences for all the data set. New alleles were confirmed by Sanger sequencing and submitted to the MLST database website (http://mlst.warwick.ac.uk/mlst/).
The presence and type of ARG or ARG-containing structures (Fig. 3b and Supplementary Fig. 4) were determined with ResFinder57 version 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/), BLAST analysis against defined reference sequences (plasmids or chromosomally encoded structures), PlasmidFinder58 version 1.3 (https://cge.cbs.dtu.dk/services/PlasmidFinder/), Plasmid MLST locus/sequence definitions database (http://pubmlst.org/plasmid/) and pMLST version 1.2 (https://cge.cbs.dtu.dk/services/pMLST/) on CLC or Velvet assemblies. The new alleles and STs of IncI59 and IncN60 plasmids have been deposited in the PubMLST database (http://pubmlst.org/plasmid/). The presence of mutations in the quinolone resistance determining region of the DNA gyrase and topoisomerase IV genes was determined manually on de novo assembled sequences. PacBio sequences were used to analyse the structure of the SRL-PAI variants and the composite transposon inserted into the chromosome in genome CDC 87-3330. The in silico results were compared with PCR data, when available.
Roary61 version 3.2.4 was used on Velvet-annotated assemblies, to construct a pan-genome. The pan-genome analysis identified genome 27352 as an outlier. Further investigation revealed an extreme AT bias, so this sample was excluded from subsequent analyses. A more sensitive annotation was performed on the resulting clusters of proteins with InterPro62, to provide Gene Ontology63 classifications for each gene.
Short-read sequences have been deposited at EBI-ENA, under study accession numbers PRJEB10304, PRJEB2846 and PRJEB3255. PacBio sequences have been deposited at EBI-ENA, under study accession number PRJEB7928. Plasmid, SRL-PAI and Tn87-3330 sequences have been deposited in GenBank, under accession numbers KT754160–KT754167, KT777637–KT777641 and KT777642, respectively.
Kohn, G. C. (ed.) Encyclopedia of Plague and Pestilence (Facts on File New York, 1995).
Rohmer, L. et al. Genomic analysis of the emergence of 20th century epidemic dysentery. BMC Genomics 15, 355 (2014).
Castenbrandt, H. K. A forgotten plague: dysentery in Sweden, 1750–1900. Scand. J. Hist. 39, 612–639 (2014).
Creighton, C. A History of Epidemics in Britain. Volume II: from the Extinction of Plague to the Present Time (Cambridge Univ. Press, 1894).
Shiga, K. Ueber den erreger der dysenterie in Japan. Vorläufige mitteilung. Zentralbl. Bakteriol. Microbiol. Hyg. 23, 599–600 (1898).
Manson-Bahr, P. H. Dysentery and diarrhoea in wartime. Br. Med. J. 2, 346–348 (1942).
Ledingham, J. C. & Penfold, W. J. Recent bacteriological experiences with typhoidal disease and dysentery. Br. Med. J. 2, 704–711 (1915).
Tribondeau, L. & Fichet, M. Note sur les dysenteries des Dardanelles. Ann. Inst. Pasteur (Paris) 30, 357–362 (1916).
Pupo, G. M., Lan, R. & Reeves, P. R. Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics. Proc. Natl Acad. Sci. USA 97, 10567–10572 (2000).
Mata, L. J., Gangarosa, E. J., Cáceres, A., Perera, D. R. & Mejicanos, M. L. Epidemic Shiga bacillus dysentery in Central America. I. Etiologic investigations in Guatemala, 1969. J. Infect. Dis. 122, 170–180 (1970).
Parsonnet, J. et al. Shigella dysenteriae type 1 infections in US travellers to Mexico, 1988. Lancet 2, 543–545 (1989).
Cobra, C. & Sack, D. A. The Control of Epidemic Dysentery in Africa: Overview, Recommendations, and Checklists (SD Publication Series, Technical Paper No. 27 (USAID, Bureau for Africa, Office of Sustainable Development, 1996); http://pdf.usaid.gov/pdf_docs/pnaby890.pdf
Rahaman, M. M., Khan, M. M., Aziz, K. M., Islam, M. S. & Kibriya, A. K. An outbreak of dysentery caused by Shigella dysenteriae type 1 on a coral island in the Bay of Bengal. J. Infect. Dis. 132, 15–19 (1975).
Taylor, D. N. et al. Introduction and spread of multi-resistant Shigella dysenteriae 1 in Thailand. Am. J. Trop. Med. Hyg. 40, 77–85 (1989).
Murray, G. R. E. More on bacterial longevity: the Murray collection. ASM News 51, 261–262 (1985).
Baker, K. S. et al. The Murray collection of pre-antibiotic era Enterobacteriaceae: a unique research resource. Genome Med. 7, 97 (2015).
Yang, F. et al. Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery. Nucleic Acids Res. 33, 6445–6458 (2005).
Vongsawan, A. A. et al. The genome of Shigella dysenteriae strain Sd1617 comparison to representative strains in evaluating pathogenesis. FEMS Microbiol. Lett. 362, pii: fnv011 (2015).
McKenzie, R. et al. Safety and immunogenicity of WRSd1, a live attenuated Shigella dysenteriae type 1 vaccine candidate. Vaccine 26, 3291–3296 (2008).
Launay, O. et al. Safety and immunogenicity of SC599, an oral live attenuated Shigella dysenteriae type-1 vaccine in healthy volunteers: results of a phase 2, randomized, double-blind placebo-controlled trial. Vaccine 27, 1184–1191 (2009).
Shiga, K. Observations on the epidemiology of dysentery in Japan. Philipp. J. Sci. 1, 485–500 (1906).
Holt, K. E. et al. Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe. Nature Genet. 44, 1056–1059 (2012).
Robic, J. Une épidémie de dysenterie bacillaire à Madagascar (1927–1928). Bull. Soc. Path. Exot. 21, 709–713 (1928).
Guerin, P. J., Grais, R. F., Rottingen, J. A. & Valleron, A. J. & Shigella Study Group. Using European travellers as an early alert to detect emerging pathogens in countries with limited laboratory resources. BMC Public Health 7, 8 (2007).
Bercion, R. et al. Molecular epidemiology of multidrug-resistant Shigella dysenteriae type 1 causing dysentery outbreaks in Central African Republic, 2003–2004. Trans R. Soc. Trop. Med. Hyg. 100, 1151–1158 (2006).
Luck, S. N., Turner, S. A., Rajakumar, K., Sakellaris, H. & Adler, B. Ferric dicitrate transport system (Fec) of Shigella flexneri 2a YSH6000 is encoded on a novel pathogenicity island carrying multiple antibiotic resistance genes. Infect. Immun. 69, 6012–6021 (2001).
Mutreja, A. et al. Evidence for several waves of global transmission in the seventh cholera pandemic. Nature 477, 462–465 (2011).
Wong, V. K. et al. Phylogeographical analysis of the dominant multidrug-resistant H58 clade of Salmonella Typhi identifies inter- and intracontinental transmission events. Nature Genet. 47, 632–639 (2015).
Islam, M. S. et al. Microbiological investigation of diarrhoea epidemics among Rwandan refugees in Zaire. Trans. R. Soc. Trop. Med. Hyg. 89, 506 (1995).
Kumarasamy, K. K. et al. Emergence of a new antibiotic resistance mechanism in India, Pakistan, and the UK: a molecular, biological, and epidemiological study. Lancet Infect. Dis. 10, 597–602 (2010).
Murray, E. G. D. An attempt at classification of Bacillus dysenteriae, based upon an examination of the agglutinating properties of fifty-three strains. J. R. Army Med. Corps 31, 257–271 (1918).
Langendorf, C. et al. Enteric bacterial pathogens in children with diarrhea in Niger: diversity and antimicrobial resistance. PLoS ONE 10, e0120275 (2015).
Taddei, F., Matic, I. & Radman, M. cAMP-dependent SOS induction and mutagenesis in resting bacterial populations. Proc. Natl Acad. Sci. USA 92, 11736–11740 (1995).
Picard, B. et al. Mutator natural Escherichia coli isolates have an unusual virulence phenotype. Infect. Immun. 69, 9–14 (2001).
Grimont, F. & Grimont, P. A. D. Determination of rDNA gene restriction patterns. Methods Mol. Biol. 46, 181–200 (1995).
Wirth, T. et al. Sex and virulence in Escherichia coli: an evolutionary perspective. Mol. Microbiol. 60, 1136–1151 (2006).
Fabre, L. et al. Chromosomal integration of the extended-spectrum beta-lactamase gene blaCTX-M-15 in Salmonella enterica serotype Concord isolates from internationally adopted children. Antimicrob. Agents Chemother. 53, 1808–1816 (2009).
Turner, S. A., Luck, S. N., Sakellaris, H., Rajakumar, K. & Adler, B. Molecular epidemiology of the SRL pathogenicity island. Antimicrob. Agents Chemother. 47, 727–734 (2003).
Carattoli, A. et al. Identification of plasmids by PCR-based replicon typing. J. Microbiol. Methods 63, 219–228 (2005).
Aziz, R. K. et al. The RAST server: rapid annotation using subsystems technology. BMC Genomics 9, 75 (2008).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011).
Boetzer, M. & Pirovano, W. Toward almost closed genomes with GapFiller. Genome Biol. 13, R56 (2012).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006).
Carver, T. et al. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics 24, 2672–2676 (2008).
Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).
Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6: molecular evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 30, 2725–2729 (2013).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23, 127–128 (2007).
Letunic, I. & Bork, P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39, W475–W478 (2011).
Cheng, L., Connor, T. R., Siren, J., Aanensen, D. M. & Corander, J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol. Biol. Evol. 30, 1224–1228 (2013).
Drummond, A. J. & Rambaut, A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007).
Connor, T. R. et al. Species-wide whole genome sequencing reveals historical global spread and recent local persistence in Shigella flexneri. eLife 4, e07335 (2015).
Baele, G. et al. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol. Biol. Evol. 29, 2157–2167 (2012).
Baele, G., Li, W. L., Drummond, A. J., Suchard, M. A. & Lemey, P. Accurate model selection of relaxed molecular clocks in Bayesian phylogenetics. Mol. Biol. Evol. 30, 239–243 (2013).
Prosperi, M. C. et al. A novel methodology for large-scale phylogeny partition. Nature Commun. 2, 321 (2011).
Zankari, E. et al. Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67, 2640–2644 (2012).
Carattoli, A. et al. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob. Agents Chemother. 58, 3895–3903 (2014).
García-Fernández, A. et al. Multilocus sequence typing of IncI1 plasmids carrying extended-spectrum beta-lactamases in Escherichia coli and Salmonella of human and animal origin. J. Antimicrob. Chemother. 61, 1229–1233 (2008).
García-Fernández, A. et al. Multilocus sequence typing of IncN plasmids. J. Antimicrob. Chemother. 66, 1987–1991 (2011).
Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221 (2015).
Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43, D1049–D1056 (2015).
This study was supported by the Institut Pasteur and the Institut Pasteur International Network, the Institut de Veille Sanitaire, the French government's Investissement d'Avenir programme, Laboratoire d'Excellence ‘Integrative Biology of Emerging Infectious Diseases’ (grant no. ANR-10-LABX-62-IBEID), the Fondation ‘Le Roch-Les Mousquetaires’, the Canetti family through the Georges, Jacques et Elias Canetti Award 2013, the Wellcome Trust through grant 098051 to the Sanger Institute, the NHMRC of Australia (grant 1061409 to K.E.H), the Victorian Life Sciences Computation Initiative (VLSCI) (grant VR0082) and the Indian Council of Medical Research, New Delhi, India. The authors thank A. Dautry-Versat, A.P. Pugsley, C. Bréchot and J. Savall for their support; T. Hieu, C. Soto, E. Bourreterre and B. Faye for technical assistance; Z. Szabó, D. Tremblay for providing isolates; L.R. Hiltzik, N. Baldwin and C. Mackenzie for their searches of the archives; M. Toucas, H. d'Hauteville, E. Aldová, S. Formal, and A.T. Maurelli for information about isolates; D. Nedelec for discussion, I. Gut, M. Gut, L. Ma, D. Harris, K. Oliver and the sequencing teams at the Institut Pasteur and Wellcome Trust Sanger Institute for sequencing the samples. The views expressed in this publication are those of the authors and do not reflect the views of the US Department of the Army or Department of Defense. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors declare no competing financial interests.
Supplementary Figures 1-16, Table 3, Discussion and References (PDF 4476 kb)
Details of Shigella dysenteriae type 1 isolates and genomes used in this study. (XLSX 181 kb)
Whole-genome sequences, SNPs and phylogenetic data used in this study. (XLSX 15396 kb)
About this article
Cite this article
Njamkepo, E., Fawal, N., Tran-Dien, A. et al. Global phylogeography and evolutionary history of Shigella dysenteriae type 1. Nat Microbiol 1, 16027 (2016). https://doi.org/10.1038/nmicrobiol.2016.27
This article is cited by
Population structure analysis and laboratory monitoring of Shigella by core-genome multilocus sequence typing
Nature Communications (2022)
Pathogenomic analyses of Shigella isolates inform factors limiting shigellosis prevention and control across LMICs
Nature Microbiology (2022)
Antibiotic susceptibility and genetic relatedness of Shigella species isolated from food and human stool samples in Qazvin, Iran
BMC Research Notes (2021)
Nature Communications (2021)
Evolutionary histories and antimicrobial resistance in Shigella flexneri and Shigella sonnei in Southeast Asia
Communications Biology (2021)