Main

Salmonella typhi is a serovar of S. enterica, which is serologically O (lipopolysaccharide) type 09, 012; H (flagellin) type d; and Vi (extracellular capsule) positive. Humans are the only known natural host of S. typhi, with S. typhi showing limited pathogenicity for most animals. Isoenzyme analysis has suggested that isolates of S. typhi around the world are highly related2, a view confirmed using multi-locus sequence analysis (C. Kidgell et al., unpublished data). Multiple drug resistance (MDR) is a serious emerging threat to the treatment of infectious diseases—MDR S. typhi are resistant to commonly available antibiotics, and clinical resistance to fluoroquinolones, the most effective antimicrobials for the treatment of typhoid fever, has been reported3. Salmonella typhi CT18 is an example of an emerging MDR microorganism; in depth genome analysis will contribute to our understanding of how such microorganisms adapt rapidly to new environmental opportunities that are presented by modern human society.

The principal features of the S. typhi CT18 chromosome and the two plasmids harboured by this strain are shown in Table 1. The beginning of the sequence was taken to correspond with minute 0 on the E. coli and Salmonella genetic maps4; the origin and terminus of replication, predicted by comparison with E. coli and confirmed by GC-bias (Fig. 1), are near 3.765 megabases (Mb) and 1.437 Mb, respectively. The metabolism of Salmonella and E. coli has been extensively studied over many decades4, and our analysis reveals few surprises in this area. However, the chromosome was predicted to encode 204 pseudogenes, which is remarkable in a genome of an organism capable of growth both inside and outside the host. Most of these pseudogenes (124 out of 204) have been inactivated by the introduction of a single frameshift or stop codon, which suggests that they are of recent origin. Five pseudogenes (priC, ushA, fepE, sopE2 and fliB) were re-sequenced from several independent S. typhi isolates, and were identical in every case. Frameshifts that are due to changes in the length of homopolymeric tracts account for 45 pseudogenes; this is a mechanism of variation that was previously shown to occur in E. coli5, although at much lower rates than are required for rapid phase variation in other organisms. Some of the pseudogenes (27 out of 204) are the remnants of insertion sequence (IS) transposases, integrases and genes of bacteriophage origin. However, there are a significant number (75 out of 204) that are predicted to be involved in housekeeping functions, such as a component of the DNA primase complex, priC, cobalamin biosynthesis genes cbiM, -J, -K and -C, the proline transporter proV, and the anaerobic dimethyl sulphoxide reductase components dmsA and dmsB. Many more mutations (46 out of 204) are in genes that are potentially involved in virulence or host interaction. Examples of this latter group include components of seven of the twelve chaperone–usher fimbrial operons6; the gene responsible for flagellar methylation, fliB; genes within or associated with previously described Salmonella pathogenicity islands (SPI) (for example, the sensor kinase ttrS that is associated with SPI-2 (ref. 7) and cigR, marT and misL from SPI-3 (ref. 8); the leucine-rich repeat protein slrP, which is involved in host-range specificity9 and is secreted through a type III system; other type-III-secreted effector proteins including sseJ (ref. 10), sopE2 (ref. 11) and sopA; and the genes shdA, ratA and sivH, which are present in an island unique to Salmonellae infecting warm-blooded vertebrates12. A greater proportion of pseudogenes than expected lie within islands that are unique to S. typhi compared with E. coli (59% (120 out of 204) compared with 33% (1,505 out of 4,599) of all genes). This apparent inactivation of many of the mechanisms of host interaction may go some way to explaining the host restriction of S. typhi relative to other Salmonella serovars, and suggests that S. typhi may have passed through a recent evolutionary bottleneck.

Table 1 Features of the S. typhi genome
Figure 1: Circular representation of the S. typhi genome.
figure 1

The outer scale is marked in megabases. Circles range from 1 (outer circle) to 9 (inner circle). Circles 1 and 2, genes on forward and reverse strand; circles 3 and 4, genes conserved with E. coli; circles 5 and 6, genes unique to S. typhi with respect to E. coli; circle 7, pseudogenes; circle 8, G+C content; circle 9, GC bias ((G - C/G +C); khaki indicates values >1; purple <1). All genes are colour-coded by function: dark blue, pathogenicity/adaptation; black, energy metabolism; red, information transfer; dark green, membranes/surface structures; cyan, degradation of macromolecules; purple, degradation of small molecules; yellow, central/intermediary metabolism; light blue, regulators; pink, phage/IS elements; orange, conserved hypothetical; pale green, unknown function; brown, pseudogenes.

Apart from some previously detected rearrangements13, which were caused by recombination between ribosomal RNA operons near the origin of replication, the genomes of S. typhi and E. coli14 are essentially collinear along their entire length (see Supplementary Information for a linear gene map and functional classification of identified genes; see also http://www.sanger.ac.uk/Projects/S_typhi). Although there are some cases of translocation of small gene blocks, most of the differences are due to insertions, deletions or replacements. Several of the large insertions that are present in Salmonella have been extensively studied. These insertions generally carry genes that are important for survival in the host (including two type III secretion systems and an array of effector proteins and metal-ion transporters), they have often been inserted adjacent to a stable RNA gene, and may carry a gene encoding an integrase or transposase-like protein15. These regions, termed SPIs, are believed to be fairly recent horizontal acquisitions, and may be self-mobile. However, as observed in the comparison between E. coli K12 and 0157:H7 (ref. 16), there are many more insertions and deletions, including much smaller blocks (Fig. 2). Together there are 1,505 genes (32.7%) in 290 blocks that are unique to S. typhi relative to E. coli (see Supplementary Information), and 1,220 genes (28.4%) in 268 blocks that are unique to E. coli relative to S. typhi. Single-gene insertions account for 128 of the unique genes in S. typhi, and 456 genes are in insertions of 5 genes or less.

Figure 2: Distribution of insertions and deletions in S. typhi relative to E. coli and S. typhimurium.
figure 2

The graph shows number of insertion/deletion events plotted against the size of the inserted or deleted element (shown as number of genes), clearly indicating that most of the events involve a small number of genes. Values above the lines represent genes present in S. typhi; values below the line represent genes absent in S. typhi. Dark bars show the comparison with S. typhimurium; light bars with E. coli.

Among the larger blocks are the previously characterized SPIs 1–5 and seven prophage elements. There are also at least five more islands that have the characteristics of SPIs. Of the previously published SPIs we found that the original sequence of SPI-4 (ref. 17) varied most markedly from our sequence. SPI-4 was originally predicted to encode 18 genes, but our analysis of this region revealed the presence of only 8 coding sequences (CDS), two of which (STY4458 and STY4459) spanned most of those originally observed (these are represented by just one CDS in S. typhimurium). SPI-4 carries three CDS that are predicted to encode a type I secretion system—STY4458 and STY4459 are large, highly repetitive, and are weakly similar to RTX-like toxins. Of the new SPIs, SPI-6 (59 kb) encodes the safA-D and tcsA-R chaperone–usher fimbrial operons6. SPI-7 (134 kb) encodes the Vi biosynthetic genes18, the SopE prophage19 and a type IVB pilus operon20. SPI-8 (6.8 kb) encodes two bacteriocin pseudogenes (STY3280 and STY3282) and a degenerate integrase; notably, genes conferring immunity to the bacteriocins remain intact. SPI-9 (16 kb), like SPI-4, encodes a type I secretory apparatus and a single, large RTX-like protein (STY2875). SPI-10 (33 kb) carries phage 46 and the sefA-R chaperone–usher fimbrial operon. In addition to these large islands, there are many insertions of smaller gene blocks and individual genes that may be involved in pathogenicity. These include: numerous secreted and integral membrane proteins; many new regulators; a set of eight genes that are potentially involved in extracellular polysaccharide biosynthesis (STY0759-0768, three of which are pseudogenes); a predicted iron-uptake ABC transporter (STY0802-0803); two putative efflux pumps (STY0278 and STY0414); and genes encoding several secreted effector proteins, including SifA, SopD, SopD2 and SspH2.

As would be expected, the relationship between S. typhi and S. typhimurium21 is very much closer than that between S. typhi and E. coli, although there are still significant differences. The same conservation of gene order is apparent (see Supplementary Information), disrupted only by the rearrangements around the rRNAs and a large inversion around the terminus in S. typhimurium13. As with E. coli, the differences are not limited to a few large blocks. Single-gene insertions account for 42 unique genes in S. typhi, and 103 genes are in insertions of 5 genes or less. In all, there are 601 genes (13.1%) in 82 blocks that are unique to S. typhi compared with S. typhimurium (see Supplementary Information), and 479 genes (10.9%) in 80 blocks that are unique to S. typhimurium relative to S. typhi. Several significant insertions are apparent in S. typhi. These insertions include: the staA-G, tcfA-D, steA-G and stgA-G chaperone–usher fimbrial systems6; a homologue of the E. coli haemolysin E (STY1498); a short insert carrying homologues of the Campylobacter toxin cdtB (STY1886) and the Bordetella pertussis toxin genes ptxA and ptxB (STY1890 and STY1891); a putative polysaccharide acetyltransferase (STY2629); phages 10, 15, 18 and 46; and SPIs 7, 8 and 10. Aside from these relatively few insertions, much of the difference in phenotype and host range between S. typhi and S. typhimurium may well be explained by the pseudogenes described above. Of the 204 pseudogenes that have been discovered in S. typhi, 145 are present as intact genes in S. typhimurium, whereas only 23 are present as pseudogenes (15 of which have the same inactivating mutations).

Salmonella typhi CT18 harbours two plasmids, the larger of which (pHCM1) is conjugative and encodes resistances to all of the first-line drugs used for the treatment of typhoid fever. pHCM1 shares around 168 kb of DNA at greater than 99% sequence identity with R27 (ref. 22)—an incH1 plasmid that was first isolated in the 1960s from S. enterica. Shared functions include plasmid transfer and partition, and pHCM1 has apparently been derived by the insertion of 46 CDS, totalling around 50 kb, primarily at two positions in an R27-like ancestral plasmid (Fig. 3). Eighteen of these CDS are involved with resistance to antimicrobial agents or heavy metals, and 16 are of unknown function—just three of which are similar to other plasmid-borne genes. Several intact and degenerate integrases and transposases are clustered around these two regions of insertion, and mercury resistance operons have been acquired independently in each. Many clinically relevant resistance genes were identified, including dhfr1b (trimethoprim), sulII (sulphonamide), catI (chloramphenicol), bla (TEM-1; ampicillin) and strAB (streptomycin). The last four resistance determinants appear to have been inserted into the tetC gene of the R27 tetracycline resistance operon in several sequential IS element-mediated events (Fig. 3c). Although it has been reported that MDR plasmids bestow enhanced clinical virulence23, we can find no obvious virulence-associated genes on pHCM1.

Figure 3: Circular representations of pHCM1 and pHCM2.
figure 3

For the circular diagrams, the outer scale is marked in kilobases. Circles are numbered to the same scheme as in Fig. 1. Circles 1 and 2, all genes; circle 3, regions conserved in R27 (for pHCM1) or pMT1 (for pHCM2); circle 4, phage and IS genes; circle 5, G+C content; circle 6, GC bias ((G - C/G + C); khaki indicates values >1; purple <1). The linear figure represents one possible order of IS-mediated acquisition of resistance determinants in pHCM1 on the basis of likely pairs of IS elements and interrupted genes. Colour-coding for genes: pale green, unknown function; orange, conserved hypothetical; grey, bacteriophage/IS elements; brown, pseudogenes; red, DNA metabolism; dark green, membrane/surface structures; yellow, metabolic genes; cyan, nucleases; purple, resistance determinants; dark blue, plasmid functions.

pHCM2 is phenotypically cryptic, yet it shares over 56% of its sequence (at greater than 97% DNA identity) with the Yersinia pestis virulence-associated plasmid pMT1 (ref. 24). pMT1 encodes major virulence-associated determinants of Y. pestis, and the acquisition of this plasmid was a significant step in the evolution of the plague bacilli25. However, pHCM2 lacks the capsular antigen operon (caf1) and murine toxin genes that are characteristic of Y. pestis pMT1. The sequences that are shared between pMT1 and pHCM1 are not contiguous but fall into several blocks (Fig. 3). Examination of the G+C content of the unique and conserved sequences of pMT1 and pHCM2 suggests that pMT1 may have been derived from a pHCM2-like precursor plasmid26. We have detected plasmids related to pHCM2 in S. typhi only from Southeast Asia, but most S. typhi do not harbour this plasmid (data not shown). The CDS that are unique to pHCM2 show similarities to a number of bacteriophage genes and to genes directly or indirectly involved in DNA biosynthesis and replication. These include a gene cluster that encodes genes similar to thymidylate synthetase, dihydrofolate reductase, ribonuclease H and ribonucleotide diphosphate reductase, as well as a putative primosomal gene cluster. In bacteriophage T4 these genes form an integral part of the primase replication complex27 that facilitates rapid phage DNA biosynthesis and replication.

The sequence of the S. typhi genome, together with E. coli K12 (ref. 14) and 0157:H7 (ref. 16), reveals an unexpectedly large diversity in gene complements among these organisms. Much of this diversity is located in discrete gene clusters that are spread throughout the different genomes. In contrast to this diversity, these enteric microorganisms exhibit marked synteny in their large-scale genomic organization, bearing in mind that E. coli and S. enterica diverged about 100 Myr ago28. The conserved genes may be a reflection of the basic lifestyle of the bacteria, requiring intestine colonization, environmental survival and transmission. The unique gene clusters probably contribute to adaptation to environmental niches and to pathogenicity. The pseudogene complement of S. typhi has implications for our understanding of the tight host restriction of this organism, and raises the question of whether it may be possible to eradicate S. typhi and typhoid fever altogether.

Methods

Salmonella typhi CT18 was isolated in December 1993, at the Mekong Delta region of Vietnam, from a 9-year-old girl who was suffering from typhoid. The strain was isolated from blood using routine culture methods23, and after serological and metabolic confirmation of the strain as S. typhi it was immediately frozen in glycerol at -70 °C. The genome sequence was obtained from 97,000 end sequences (giving 7.9× coverage) derived from several pUC18 genomic shotgun libraries (with insert sizes ranging from 1.4 to 4.0 kb) using dye terminator chemistry on ABI377 automated sequencers. This was supplemented with 0.7× sequence coverage from M13mp18 libraries with similar insert sizes. End sequences from a larger insert plasmid (pSP64; 1.9× clone coverage, 10–14-kb insert size) and lambda (lambda-FIX-II; 0.4× clone coverage, 20–22-kb insert size) libraries were used as a scaffold, and the final assembly was verified by comparison with restriction-enzyme digest patterns using pulsed-field gel electrophoresis (data not shown). Total sequence coverage was 9.1×. The sequence was assembled, finished and annotated as described29, using Artemis30 to collate data and facilitate annotation. In addition we used a genefinder that was trained specifically for S. typhi, which uses a hidden Markov model with modules for the coding region, start and stop codons, and the ribosome-binding site (T.S.L. and A.K., unpublished data). The genome and proteome sequences of S. typhi and S. typhimurium or E. coli were compared in parallel to identify deletions and insertions using the Artemis Comparison Tool (ACT) (K. Rutherford, unpublished data; see also http://www.sanger.ac.uk/Software/ACT/). Pseudogenes had one or more mutations that would ablate expression, and were identified by direct comparison with S. typhimurium; each of the inactivating mutations was subsequently checked against the original sequencing data.