Genomics of the Argentinian cholera epidemic elucidate the contrasting dynamics of epidemic and endemic Vibrio cholerae

In order to control and eradicate epidemic cholera, we need to understand how epidemics begin, how they spread, and how they decline and eventually end. This requires extensive sampling of epidemic disease over time, alongside the background of endemic disease that may exist concurrently with the epidemic. The unique circumstances surrounding the Argentinian cholera epidemic of 1992–1998 presented an opportunity to do this. Here, we use 490 Argentinian V. cholerae genome sequences to characterise the variation within, and between, epidemic and endemic V. cholerae. We show that, during the 1992–1998 cholera epidemic, the invariant epidemic clone co-existed alongside highly diverse members of the Vibrio cholerae species in Argentina, and we contrast the clonality of epidemic V. cholerae with the background diversity of local endemic bacteria. Our findings refine and add nuance to our genomic definitions of epidemic and endemic cholera, and are of direct relevance to controlling current and future cholera epidemics.

For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above. Complete lists of accession numbers for sequences produced in this study as well as previously-published genome sequences used for phylogenetic analysis (total number = 1,318, non-redundant) are provided in Supplementary Data 1-3.

Software and code
The original data which underpin Figure 1  This study used archived bacterial cultures from the collection of Vibrio cholerae held at INEI-ANLIS "Dr. Carlos G. Malbrán", the Argentinian national reference laboratory for enterobacteria. In this study, we sequenced the genomes of as many V. cholerae as was practical, producing a total of 490 V. cholerae genome sequences, together with associated metadata. The study was designed to study the evolution of a bacterial pathogen using isolates that were collected as a by-product of national cholera surveillance in Argentina. Please note that this study used the genomes of archived bacteria and summary statistics related to the receipt of these bacteria at INEI. At no point were any patient data or other identifiable data used in, or made available for, this study. This has been stated explicitly in the 'Ethics' section of the manuscript.
We used a set of 490 Vibrio cholerae from the INEI-ANLIS bacterial culture archive, together with phenotypic metadata and region/ date of origin pertaining to each isolate. These data were contextualised with previously-published, publicly-available genome sequences for other V. cholerae.
These samples were chosen principally to capture diversity of both O1 and non-O1 V. cholerae at the beginning (1992)(1993) and the end (1996)(1997) of the Argentinian cholera epidemic (1992)(1993)(1994)(1995)(1996)(1997)(1998). The sequenced isolates were a spatiotemporally-broad crosssection of cholera incidence -they were obtained from all regions of Argentina that experienced cholera cases. They were also chosen to capture apparent shifts between Inaba and Ogawa serotype.
The datasets used in this study were either generated by whole-genome sequencing of bacterial isolates, or comprised previouslypublished genome sequences that are available in publicly-accessible databases. Accession numbers for all publicly-available sequence data are listed in the manuscript, in this Reporting Summary, and in the Supplementary Data.
No sample size calculations were performed. We sequenced as many viable bacterial isolates as was practical. Limitations were imposed by available resources, and by the fact that not all of the archived cultures were viable.
The metadata associated with each bacterial isolate were recorded by INEI staff at the time of receiving the isolate at the reference laboratory. These included the results of serotyping, PCR, biochemical, and other phenotypic tests as listed in the manuscript (Methods) and Supplementary Data 1-3. The dates and places of isolation were recorded by the laboratories submitting cultures to INEI for verification. The collection of these metadata occurred between 1992 and the early 2000s, at and around the time of the isolate's receipt at the reference laboratory. These metadata were aggregated for this study.
The new genome sequences used in the study were generated by the core sequencing teams at the Wellcome Sanger Institute. Previously-published genome sequences were aggregated from the publications listed in Supplementary Data 1-3: these data were downloaded from the European Nucleotide Archive using the accession numbers listed in Supplementary Data 1-3.
Isolates were chosen from the INEI-ANLIS collection to cover the beginning and end of the 1990s cholera epidemic in Argentina. V. cholerae recorded as being both serogroup O1 and non-O1 were included, with the aim of studying the background of non-epidemic (non-O1) V. cholerae that co-existed alongside the epidemic (O1) lineage in the country.
The earliest isolate was obtained on 10th February 1992 and the latest isolate was obtained on 18th August 2005. Six isolates were obtained outside of the epidemic period, the earliest of which was obtained on 3rd February 2000 and the latest of which on 18th August 2005 (see Supplementary Data 3). The latest isolate from within the epidemic period was obtained on 23rd February 1998. Isolates were sent to INEI-ANLIS from suspected cases of cholera, accordingly, there was no periodicity or specific frequency of sampling other than that of the natural progression of cholera epidemics in Argentina (see Figure 1). Isolates were included from all geographic regions that suffered from cholera during the epidemic period. This was concentrated to the North of Argentina (see Figure 1) but spanned an area of approximately 1.2 million square kilometres.
The exclusion of data is explicitly described in the 'Methods' section of the manuscript: "A total of 21 sequenced isolates contained substantial amounts of contaminating sequences from non-Vibrio species, and were excluded from this study, for a total of 490 sequences used in this analysis. Contamination was assessed using Kraken v0.10.6, by examining the overall length of the SPAdes assembly (data were summarised using assembly-stats v1.0.1 (https://github.com/sanger-pathogens/assembly-stats) and assemblies greater than 5 Mbp in length were excluded) and by inspection of initial phylogenetic trees.". The contaminated genome sequences have not been published alongside this manuscript.
Replication of sequencing was not performed; every bacterial isolate was sequenced once. However, by using next-generation sequencing technology, sequencing an isolate once actually means that each genome is sequenced many times. The target sequencing depth per isolate was 30 X -i.e., the sequencing data obtained for each isolate should cover the genome of that isolate at least 30 times. Thus, within a sequencing run, each isolate is sequenced the equivalent of 30 times. Consensus sequences for each isolate were used for all genome assemblies and SNV identification (see Methods). Therefore, although DNA from each isolate was only submitted once for sequencing, the genome of the isolate was sequenced multiple times.
This study was not randomised. Sequences were allocated into '7PET', 'LAT-1', and 'non-7PET' groups on the basis of their phylogenetic position.