There are estimated to be 16 million cases of typhoid and more than 600,000 resulting deaths annually1,2. Typhoid is a systemic infection caused primarily by S. enterica serovar Typhi. Serovar Paratyphi A is the second most prevalent cause of typhoid, responsible for one third of cases or more in southern and eastern Asia3,4,5. Paratyphi A and Typhi cause a similar illness, with relapsing fever6. Paratyphi A generally causes a milder disease but has been particularly virulent in some outbreaks. The increasingly prevalent antibiotic-resistant forms of Paratyphi A are more difficult and expensive to treat4,5,7.

Because the only known host for Typhi and Paratyphi A is humans, the disease could be eradicated, as smallpox was and polio soon might be8. Typhoid is rare in industrialized countries as improved sanitation has interrupted transmission through contaminated water. While the developing world awaits improved sanitation, there are moderately effective vaccines for Typhi, with weak cross-protection against Paratyphi A. A vaccine for Paratyphi A is under development9.

Typhi and Paratyphi A are members of different O antigen serogroups, D1 and A, based on a minor difference in lipopolysaccharide sugars. Despite this difference, microarray data on shared gene content10,11,12 indicates that Paratyphi A and Typhi are closely related, with Paratyphi A more distant from other members of serogroup A12. Lateral transfer of serogroup genes has been observed13, and so it is possible that Paratyphi A arose by transfer of serogroup A genes into a strain very similar to Typhi. Typhi strains show little DNA sequence diversity, indicating that this serovar arose as a human-specific pathogen only 15,000–150,000 years ago14. Multilocus enzyme electrophoresis data indicate that housekeeping proteins are homogeneous in Typhi and in Paratyphi A15, suggesting that Paratyphi A is also a clone of recent origin.

To facilitate genomic comparisons between Paratyphi A and Typhi to further address their phylogeny and the basis of their recent specialization, we determined the complete nucleotide sequence of S. enterica serovar Paratyphi A strain ATCC 9150 (SARB42). This strain was chosen because we have a physical map of its genome16, multilocus enzyme electrophoresis data15 and microarray-based comparative genomics with other serovars10,11,12. This strain will be abbreviated SPA.


We annotated 4,263 protein coding sequences in SPA. Table 1 and Figure 1 outline the basic features of the genome. An annotation of this genome and a gene-by-gene comparison with related genomes is given in Supplementary Table 1 and Supplementary Figure 1 online.

Table 1 Genome essentials
Figure 1: The genome of Paratyphi A.
figure 1

First circle: Distribution of regions unique to SPA or shared with Typhi or Typhimurium. Three phage are annotated on the outside of the circle. Second circle: Location of pseudogenes found in Paratyphi A and Typhi. Third circle: G+C content. Blue, high G+C; red, low G+C. Fourth circle: Violet and dark yellow represent GC skew. A more detailed linear map can be found in Supplementary Figure 1 online.

Differences in gene content between Paratyphi A and Typhi

The SPA genome is 4,585 kb, 200 kb smaller than the two Typhi genomes that have been sequenced (4,809 kb and 4,792 kb)17,18 and the next nearest sequenced relative, Typhimurium LT2 (4,857 kb)19, primarily because SPA has fewer prophage or other mobile elements. The extra genes in Typhi include 24 regions of two or more genes, the largest of which is SPI-7, a 134-kb island including ten genes involved in synthesis of the Vi exopolysaccharide in Typhi20. Typhi has at least three large putative phage or phage remnants that are missing from SPA21,22. The order of orthologous genes is conserved relative to Typhimurium, except that recombination between the rrnH and rrnG operons inverted half the chromosome that contains the TER region16. Table 2 shows the number of genes shared by each pair of genomes among SPA, Typhi, Typhimurium and Escherichia coli and absent in the two others. Consistent with gene content measured using microarrays10,12, SPA and Typhi have more of these shared derived genes than any other pair.

Table 2 Genes unique to pairs of genomes

Most of the genes present in SPA and absent from Typhi occur in two prophage regions, and an additional prophage is located at a different site in Typhi. A lambdoid prophage of 41 kb, designated SPA-1, is encoded in the 47 genes between SPA2385 (gtrC) and SPA2431 (int), inserted next to SPA2432 (thrW for tRNAThr), where P22 phage can insert. Most genes of SPA-1 have 90–100% identity to those of phage ST64T23 and a lesser identity to phage P22. A P2-type phage designated SPA-2-SopE is inserted into the ssrA tm-RNA gene; it contains genes SPA2554 to SPA2600 (46 genes, 34 kb). Typhimurium has Fels2 (85–100% amino acid identity) inserted at the same site, whereas Typhi has a more closely related phage (95–100% amino acid identity) inserted as part of SPI-7, in a different part of the chromosome. The SPA and Typhi phages, but not Fels2, carry sopE, encoding an invasion-associated secreted protein. A third prophage region of 25 kb, designated SPA-3-P2, encoded in SPA2601 to SPA2625, has genes with protein identities of 40% to the P2-type phages HP1 and HP2 of Haemophilus influenzae. At least one phage is active: we treated a culture of SPA with mitomycin C and collected phage from the supernatant after 5 h using polyethelyene glycol precipitation. Phage SPA-3-P2 was present in sufficient quantities for DNA to be detected by microarray hybridization (Supplementary Fig. 2 online).

The genome sequence comparisons showed only 12 other insertions of two or a few genes in SPA compared with Typhi. Most are also missing from Typhimurium and have no close homologs with known function in the GenBank database. These identifiable clusters include the stk and ste fimbrial clusters; the former is present only in SPA, and the latter is present in SPA and Typhimurium but not Typhi.

We used a gene microarray, constructed for Paratyphi A by supplementing a Typhimurium/Typhi microarray24 with PCR products from genes unique to SPA, to monitor the presence of genes in twelve strains of Paratyphi A isolated at diverse times and places (Supplementary Table 2 online). Each of the three putative phage regions described above was missing from at least one Paratyphi A strain (Table 3 and Fig. 2). Only two other small regions of two or three genes each were missing from some strains; thus, diversity in gene content in Paratyphi A is less than the 13 polymorphic regions measured in nine Typhi strains21, suggesting that Paratyphi A is a more recent clone.

Table 3 Gene content differences among strains of Paratyphi A
Figure 2: Visualization of gene content differences among strains of Paratyphi A.
figure 2

Blue, present; red, absent; purple, intermediate or no data. Predictions were established as described12. Each region is described in Table 3 and the strains are described in Supplementary Table 2 online.

Lack of newly acquired functions

The evolution from host generalists, such as Typhimurium, to human-restricted variants, such as Paratyphi A and Typhi, might occur through the addition of genes by lateral transfer. We identified a few regions that are present in SPA and Typhi but for which close homologs rarely occur in other S. enterica serovars (129 genes, found in fewer than 30 of 78 strains in 27 serovars12, are annotated in Supplementary Table 1 online). These regions include the SPA-2-SopE phage, a restriction cluster, a fimbrial cluster and the rfb cluster. In addition, there are six clusters of a few genes of mostly unknown function, four of them exceptionally A+T rich. These genes added by lateral transfer might contribute to human-specific enteric disease and warrant further investigation. In particular, genes may have been acquired for 'chronic carriage', in which bacteria can be carried asymptomatically for years, particularly in the gall bladder, and spread, through the stool, to new victims. But Typhimurium, the most prevalent serovar of S. enterica, infects humans (where it causes gastroenteritis) and causes a typhoid-like systemic disease in mice. Thus, many, though not all, of the features needed to make a human-specific typhoid already exist in widespread serovars of S. enterica. Although lateral transfer probably has a role, changing from a phenotype similar to that of Typhimurium to that of a human-restricted typhoid-causing variant could proceed primarily by genome degradation through pseudogene formation, as has recently been proposed based on comparative genomics in Bordetella species25.


The most notable feature of the SPA genome is the extent of genome degradation. There are 173 pseudogenes (4.1% of annotated SPA coding sequences). These are usually caused by single-base mutations that lead to a frameshift or a stop codon in genes that are otherwise >98% identical to their orthologs in Typhi or Typhimurium (Supplementary Table 3 online). Both Typhi genomes have more than 200 pseudogenes (5% of annotated coding sequences)17,18. Only a few of the more than 100 eubacterial genomes sequenced to date have abundant pseudogenes, including those of Mycobacterium leprae26, Rickettsia prowazekii27, human-specific variants of Bordetella25, Bartonella28, Yersinia pestis29 and Shigella flexneri30,31. Among these, Y. pestis, S. flexneri, Typhi and Paratyphi A are all in the family Enterobacteriaceae and all seem to have evolved in the last few thousand years from gastrointestinal pathogens to human-specific systemic pathogens30,31,32,33. Perhaps genome degradation through pseudogene formation sheds genes that are no longer needed in their new niche. In contrast, the host-generalists, such as S. enterica serovar Typhimurium and most E. coli strains, have 40 or fewer easily recognizable pseudogenes (only 1% of their coding sequences).

Of 154 genes found in both SPA and Typhi but absent from Typhimurium, 14 (9%) are pseudogenes in SPA and 14 (9%) are pseudogenes in Typhi. This is more than twice the frequency of pseudogenes for all coding sequences (4% and 5%, respectively; P < 0.001). Pseudogenes may be concentrated in coding sequences that are not shared with Typhimurium because they seldom encode core life functions.

Paratyphi A and Typhi are closely related, but their pseudogenes, postulated no longer to be needed in their specialization in human enteric disease, evolved independently. We found that 166 of the 173 pseudogenes in SPA can be aligned as probable orthologs in Typhi, and only 28 are also pseudogenes in Typhi (Table 4). Twenty-seven pseudogenes have lesions in different bases, and only one gene is mutated at the same site, a hypermutable poly(A), in both genomes, which may have occurred independently. Two other genes are probably recently deleted in both lineages. This affords a unique opportunity to observe the genomes of two almost identical pathogens as they degrade and specialize in an almost identical niche. Some of the functions that they both lose could be involved in pathways they no longer need, such as gastroenteritis or infecting other hosts, and others might be partly responsible for the newly acquired phenotypes that they share, such as chronic carriage. We describe below how many of the genes and pathways that are disrupted in both serovars have important roles in pathogenesis in other Salmonella species; therefore, their loss must have altered and narrowed the phenotype of both Paratyphi A and Typhi. Unless otherwise mentioned, these genes do not have close homologs or known functional paralogs that would compensate for their loss.

Table 4 Pseudogenes found in both Paratyphi A and Typhi

Degradation of certain pathogenicity functions

Three of the pseudogenes in both lineages, sinH, ratB and shdA, are located in the same 25-kb pathogenicity island, called CS54 (ref. 19). This island is present only in subspecies 1 of S. enterica, which contains most mammal- and bird-pathogenic Salmonella strains. Mutations in these genes in Typhimurium reduce early-stage infection in mice34: sivH and shdA reduce colonization of Peyer's patches of the terminal ileum; shdA and ratB reduce colonization of the cecum; shdA reduces colonization of the mesenteric lymph nodes and spleen; and shdA and ratB reduce fecal shedding. Mutations in ratA and sivI in the same island, however, have no effect on the infection phenotype when mutated in Typhimurium34 and are not pseudogenes in SPA or Typhi. Thus, in this locus, pseudogenes in SPA and Typhi occur only in genes involved in intestinal colonization and persistence, consistent with the phenotype of Paratyphi A and Typhi, which generally do not colonize the intestines.

There are two type-III secretion systems, encoded by Salmonella pathogenicity island 1 (SPI-1) and SPI-2, respectively, that are crucial in pathogenicity; both are present in SPA. These systems secrete Salmonella-translocated effectors (STEs) into the host. SPI-1 is mainly involved in Salmonella interaction with the host cell, and SPI-2 is involved in secreting proteins inside a vacuole in the host cell. These proteins subvert host functions. Approximately 20 STEs are exported by these type-III secretion systems. Four STEs, sopA, slrP, sopD2 and sseJ, are pseudogenes in both SPA and Typhi. In addition, the STEs sifB and sspH2 are mutated in SPA and sopE is mutated in Typhi. SopA protein, secreted through SPI-1, is involved in diarrhea during Typhimurium infection of calves35; its loss in SPA and Typhi may be permitted because these strains seldom cause diarrhea. Some STEs that are mutated seem to be involved in systemic disease. SlrP protein is secreted using both SPI-1 and SPI-2 in Typhimurium36. SlrP has a role in the 'typhoid' that afflicts mice infected with Typhimurium37 but not in the gastrointestinal disease caused by Typhimurium in cattle. Mutation of the SopD2 protein, which is exported by SPI-2 in Typhimurium into late endocytic compartments38, prolongs the survival of mice infected with Typhimurium. The SseJ protein is secreted through SPI-2 in Typhimurium and targeted to the bacterium-containing vacuole; deletion of the gene attenuates Typhimurium replication in mice after intraperitoneal inoculation39. Perhaps, rather than increase virulence, pseudogenes in some of these STEs prolong and moderate infection by SPA and Typhi, as the same mutations seem to do in Typhimurium-induced murine typhoid. Survival of a substantial fraction of hosts may be important, because typhoid transmission partly relies on up to 4% of survivors of infection becoming long-term carriers, as made famous in the case of “Typhoid Mary”40,41,42.

Genome degradation is presumably continuing in these two serovars. Genes degraded in one serovar may be degraded in the future in the other serovar: SPA has pseudogenes in the STE sifB, which is secreted through SPI-2 and targeted to the bacterium-containing vacuole but whose role in infection is unknown, and in sspH2, which interacts with the actin-binding protein filamin43. Typhi has a pseudogene in sopE2, whose protein is secreted through SPI-1 and is a guanine-nucleotide-exchange factor for the small GTPases Cdc42 and Rac5. A mutation in sopE2 significantly reduced fluid accumulation in bovine ligated ileal loops. Thus, SopE2, like sopA, is a key virulence factor responsible for diarrhea in calves and also contributes to colitis in mice35.

Typhimurium contains five chemotaxis-specific transmembrane receptors that mediate responses to specific attractant and repellent stimuli and control the direction of flagella rotation and, thus, smooth swimming or tumbling: Tar, Tsr, Trg, Aer and Tcp44. A knockout mutation can lead to smooth swimming. Tsr, the chemotaxis receptor for serine, is mutated in Typhi by a frameshift caused by a change in the length of a simple repeat. Trg is deleted in both SPA and Typhi, with different deletion ends in each genome, and is also mutated in two strains of Y. pestis29 and in uropathogenic E. coli45. Tar, which is the receptor for chemotaxis induced by aspartate and maltose, has a 98–amino acid in-frame internal deletion in SPA that removes some cytoplasmic regions but may allow the retention of some functions. The deleted sequence is exactly flanked by a direct repeat. A Tar mutant of Typhimurium is hyperinvasive for human epithelial cells and in mice46. In Vibrio cholerae, smooth-swimming nonchemotactic mutants occur during infection and out-compete the wild type47. Thus, mutations in chemotaxis receptors may increase smooth swimming and have a role in infection of Paratyphi A, Typhi and other pathogens.

Efficient iron scavenging is an important factor in pathogenesis. Conversely, toxicity due to excess iron is a concern in other environments. A large number of genes involved directly or indirectly in iron metabolism are mutated in SPA and Typhi. These include the ttr operon and the surface receptors fhuA and fhuE. The latter is also a pseudogene in S. flexneri, another recently evolved host-specialist invasive pathogen30,31. rhlB, sufS and sufD are mutated in SPA. A discussion of the possible phenotypes of other pseudogenes, based on the phenotype of mutations in other Salmonella species, can be found in the Supplementary Note online.

Disruption of different genes in the same pathway

The influence of pseudogenes on the physiological similarity in SPA and Typhi would be underestimated by considering only mutations in the same gene rather than looking at the pathway level. Not all pathways are fully understood, and so estimating this effect is difficult. Nevertheless, many known and putative operons are mutated in different genes in both SPA and Typhi. Supplementary Table 4 online lists all the putatively orthologous pseudogenes in both serovars, as well as in S. flexneri and Y. pestis. Some pathways have multiple mutations in both serovars. Examples include the cbi gene cluster, involved in cobalamin synthesis (for vitamin B12), of which cbiM, cbiK, cbiJ and cbiC are mutated in Typhi, and only cbiA is mutated in SPA. The adjacent pdu cluster, involved in propanodiol degradation by a B12-dependent pathway, is presumably inactivated by the cbi mutations but is also inactivated by mutations in pduN in Typhi and in pduF in SPA.

Adhesion to host cells is mediated by fimbriae on the bacterial cell surface. SPA and Typhi share 11 clusters of genes involved in fimbrial synthesis. Of these, two (bcf and saf) have pseudogenes for different genes in SPA and Typhi; five (fim, sef, ste, stb and sth) have pseudogenes only in Typhi; and one (csg) has a pseudogene only in SPA. stf, a fimbrial cluster that SPA shares with Typhimurium, has a pseudogene. Thus, most fimbrial clusters are mutated in one or both serovars, resulting in few clusters being fully functional in both serovars. This probably has important implications for adhesion.

Loss of phase variation

Most strains of S. enterica subspecies 1 have a switching mechanism leading to diphasic flagella. This 'phase variation' occurs through the Hin protein, which causes the hin region to flip orientation48. Both Typhi and Paratyphi A are monophasic for phase 1 flagella. But their mutations are different: in Typhi the phase 2 and hin genes are deleted; in SPA the mutation is more subtle: a frameshift mutation in hin prevents switching, with the cell locked into phase 1 expression.

Other possible losses of function

The deletion of the phase 2 genes in Typhi is probably quite recent, as it does not occur in other relatives, including Paratyphi A. To identify other potential recent deletions, we inspected an extensive data set of gene distributions among the Salmonella species12 for genes that are generally present in Salmonella species but absent in Paratyphi A or Typhi. Fifty genes are present in 64 or more of 71 other strains (90%) in 14 other serovars, but absent in Paratyphi A or Typhi. The named genes include sseJ, encoding an STE, and the Trg chemotaxis receptor for ribose and galactose. The latter gene is present in all of the 71 strains and has different deletion ends in SPA and Typhi, consistent with separate and recent deletion events. Other genes that are absent from either of these genomes are discussed in the Supplementary Note online. Deletion seems to have contributed to genome degradation in Paratyphi A and Typhi, but the extent is hard to assess because the timing and even the existence of a deletion is sometimes hard to determine. In the absence of evidence for an unusually high rate of gene loss by deletion in these genomes, it seems that point mutation may be the fastest route to gene inactivation, with deletion of genes (and pseudogenes) and operons being a slower process.

No doubt, some pseudogenes are unrecognized in the annotation. Pseudogenes in Typhi that are apparent only upon sequencing of Paratyphi A are listed in Supplementary Table 5 online. For example, some short open reading frames that we have not annotated could encode proteins, which may have been mutated in Paratyphi A. In addition, missense amino acid changes could destroy function even though the full open reading frame is preserved. One example has already been serendipitously discovered: a restriction endonuclease that is inactive in SPA is identical to an active enzyme in another Salmonella species, except for a single amino acid change at a dimerization interface49.

At a minimum, both genomes can be expected to continue to shed those genes that currently remain intact but are part of now-nonfunctional systems.

Divergence of gene degradation in Paratyphi A and Typhi

The functions that have been disrupted independently in both SPA and Typhi are emphasized here because some seem to be particularly important in other Salmonella species. But this does not mean that these two serovars are converging on an identical phenotype. On the contrary, such an outcome is highly unlikely because mutations arise in a different order in each serovar, thereby affecting the fitness of subsequent mutations. Lateral transfer to one lineage, such as the Vi antigen in Typhi20, will also change the course of evolution. An estimate of the divergence during genome streamlining can be derived from the overlap among the 166 pseudogenes in SPA and 219 in Typhi (among probable orthologs). If all permissible orthologous genes mutated in one genome were equally likely to be mutated in both genomes (a null hypothesis where there is perfect potential for future convergence), then an overlap of 30 genes mutated in both genomes would be observed if the total population size for permissible pseudogenes in both genomes is 166 × 219 / 30 = 1,212 coding sequences (31%). Of 135 orthologous coding sequences that contain a simple repeat of (X)8, however, 16 are pseudogenes in Paratyphi A and 18 in Typhi, most of these by a frameshift at the simple repeat, indicating that when a coding sequence is hypermutable, a pseudogene is permissible for up to 15% of coding sequences. By extrapolation, given enough time, 505 coding sequences of 4,263 would permit mutation in Paratyphi A, and a similar number in Typhi. If both genomes are converging on an identical phenotype, the expected number of shared pseudogenes in the two lineages is approximately 166 × 219 / 505 = 72, but only 30 shared pseudogenes are observed. These discrepancies indicate that these genomes are probably evolving along partly overlapping but distinct genotypic pathways, despite the notable phenotypic similarities in their specialization to the same host and enteric disease.


Paratyphi A and Typhi have genomes of similar gene content although they are in different serogroups. Paratyphi A may be a more recently derived clone than Typhi, because Paratyphi A strains have less diverse gene content than Typhi. Both genomes have many pseudogenes. The fewer pseudogenes in Paratyphi A indicates that Paratyphi A may have arisen more recently, if the accumulation of pseudogenes has proceeded at a similar rate in both serovars. Genome degradation has occurred independently in each serovar. Thus, nature has done an experiment twice, degrading the genomes of two almost identical pathogens as they specialize in an almost identical niche. Most or all of the mutations in both serovars are probably due to neutral drift among genes that are no longer needed. Although it might be expected that the same genes would be dispensable to both serovars in their new narrow niche, fewer genes that expected are disrupted in both serovars, indicative of divergence in genome degradation.

Much can be inferred from the nature of the genes that are disrupted in one or both of these genomes. For example, pathogenesis-associated genes that have remained intact in both genomes remain candidates for a role in enteric disease, whereas those genes that are mutated in either genome may be dispensable for enteric disease in humans. These latter genes would include those that are used for functions, such as gastroenteritis or infection of other hosts, that are no longer needed in these more specialized pathogens as they converged on a similar phenotypic niche. Notably, the 30 genes that are independently mutated in both serovars include many of the known virulence genes for gastrointestinal infection, and known STEs. In addition, many pathways and functions are disrupted by mutations in different genes in each serovar, including chemotaxis, iron transport and response, and surface structures such as fimbrial clusters.



We sequenced the complete 4.6-Mb genome of S. enterica serovar Paratyphi A (American Type Culture Collection 9150) using a shotgun method, supplemented with end sequencing of a fosmid library and a comparison to the restriction map of this strain. We cloned sonicated and size-fractionated DNA into M13 (1.5-kb inserts) and plasmid vectors (2-kb to 4-kb inserts). We sequenced subclones using dye-primer and dye-terminator chemistry on ABI 3700 and 3730 sequencing machines. We assembled 92,648 sequence reads, representing 8.7-fold final coverage, using the PHRAP assembly program. We then sequenced under-represented areas and ambiguities, for a final estimated accuracy of 99.99%. The assembled genome sequence is in agreement with the restriction map16. The ordered fosmid library is available on request.


Acedb was the primary annotation database. We predicted protein-coding genes using a three-tier approach. First, we predicted genes based on protein similarities using (Y. Shotland, unpublished data). Then we used GeneMark, an HMM-based ab initio gene predictor, to call genes in gaps between annotated genes. Finally, we used getorf, part of the EMBOSS sequence analysis package, to annotate open reading frames of at least 50 amino acids in length, not identified by GeneMark or protein homology. We compared the predicted genes with genes from Typhi CT18, Ty2, Typhimurium LT2 and E. coli K12. All differences in predictions of start codon, length, conservation in Enterobacteria and presence of motifs were adjusted for consistency. We searched for the annotated proteins in the protein family databases Pfam 13.0 and InterPro and the protein localization prediction package PSORT-B. We predicted transfer RNA genes with tRNAscan-SE and ribosomal RNA genes by similarity to other Enterobacteria rRNA genes.


We constructed and probed SPA microarrays and analyzed the data as previously described12,24.


Acedb is available at GeneMark is available at The EMBOSS sequence analysis package is available at Pfam 13.0 is available at, and InterPro is available at The protein localization prediction package PSORT-B is available at tRNAscan-SE is available at Our project website is available at The PHRAP assembly program is available at

Accession numbers.

GenBank: Typhi CT18, NC_003198; Ty2, NC_004631; Typhimurium LT2, NC_003197; E. coli K12, NC_000913; Paratyphi A ATCC 9150 sequence, CP000026; Y. pestis biovar Medievalis str. 91001 complete genome, NC_005810. GEO series number: GSE1500.

Note: Supplementary information is available on the Nature Genetics website.