Main

Francisella tularensis is one of the most infectious pathogens known and is the etiological agent of tularemia, a disease of humans and animals1. The vector-borne form of the disease (glandular or ulceroglandular tularemia) is usually contracted from the bite of an arthropod vector that previously fed on an infected animal1. Respiratory tularemia is less frequent and is usually contracted during farming activities that generate dust from sites where infected animals have resided. The mortality rate of respiratory tularemia may be as high as 5–30% without antibiotic therapy; even if not fatal, the disease may be severely incapacitating for a period of weeks or even months1. One outbreak of respiratory tularemia occurred in Martha's Vineyard in the US, probably triggered by the mechanical disruption of a rabbit carcass during lawn mowing. The aerosols that were generated infected two individuals, underscoring the highly infectious nature of the organism by the airborne route.

The infectious dose of F. tularensis in humans by the airborne route is as low as 10 cells1. Although the bacterium is nutritionally fastidious, it was developed as a weapon by Japanese Germ Warfare units during the 1930s and 1940s and later by the former Soviet Union and the US2. There is concern that bioweapons containing this bacterium still exist elsewhere in the world.

The high level of interest in F. tularensis and concerns over possible misuse contrast with the paucity of knowledge on virulence mechanisms. The bacterium infects macrophages1, and a few virulence determinants have been proposed, including an ill-defined capsule3 and a 23-kDa protein that seems to have a role in downregulating proinflammatory cytokines4. Some other genes required for growth in macrophages have been identified5,6,7, but their roles are uncertain. There is currently no licensed vaccine for the prevention of tularemia.

We report here the complete genome sequence and a phylogenetic analysis of a fully virulent human isolate of F. tularensis subspecies tularensis (strain SCHU S4). In the long term, this study will support work to devise improved countermeasures against tularemia.

Results

General features

The genome of the F. tularensis strain SCHU S4 consists of a 1,892,819-bp circular chromosome, with an overall G+C content of 32.9% and 1,804 predicted coding sequences (CDSs; including pseudogenes). The low G+C content is typical of that found in small (0.9–2.0 Mb) bacterial genomes (range 25–40%). The overall features of the genome are given in Table 1. The origin of replication (ori) was identified with the aid of the strand specific mutation bias (Fig. 1) and was flanked by genes also present at this position in other species, such as dnaA and rng.

Table 1 Overall features of the genome of F. tularensis strain SCHU S4
Figure 1: Circular map of the genome of F. tularensis strain SCHU S4.
figure 1

The outer scale is marked in base pairs. Circles 1 and 2 (numbering from the outside in) show genes color-coded by function. Circles 3 and 4 show pseudogenes. Circles 5 and 6 show IS elements (ISFtu1, red; ISFtu2, cyan; ISFtu3, orange; ISFtu4, green; ISFtu5, gray; fragments of IS elements, black). The next 16 circles show the locations of genes with matches to L. pneumophila (unfinished genome ver. 12 December 2003), P. aeruginosa, V. cholerae, C. burnetii RSA 493, B. anthracis Ames, S. oneidensis, E. coli K12, H. influenzae, P. multocida, S. enterica serovar Typhi, S. enterica serovar Typhimurium LT2, X. axonopodis, X. campestris, Y. pestis, S. flexneri 2a and X. fastidiosa, respectively. Red color marks the top hit, green shows the second best hit and gray shows genes with sequence similarity less than 10−10. The innermost circles show G+C content (%; black) and GC deviation (G−C)/(G+C).

In total, 1,281 genes in F. tularensis SCHU S4 had homologs (E < 1 × 10−10) in one or more γ-proteobacterial genomes (Fig. 1). These were randomly distributed around the genome, with the exception of a duplicated region of 33.9 kb (nucleotides 1,374,371–1,408,281 and 1,767,715–1,801,625), which lacked homologs in 16 other γ-proteobacterial genomes (Fig. 1). In F. tularensis strain LVS, duplication of one of the genes in this region (iglC) has been reported, suggesting that this region is also duplicated in this strain7. The origin of the duplicated regions is not clear, because these genes do not show significant sequence homology with any other genes in GenBank. The genes encoding hypothetical proteins in these duplicated regions (Fig. 2) have a low G+C content (27.5%). But the G+C content of genes in the iglABCD operon and their codon usage are similar to those of other F. tularensis genes. In contrast to the genomic islands of other species, there are no flanking insertion elements or tRNA genes on both sides, although both copies are flanked on one side by rRNA operons and on the other an ISFtu1 element. Mutation of some genes within the duplicated regions can be attenuating5,6,7; therefore, we believe that these regions are pathogenicity islands.

Figure 2: The organization of the duplicated region in F. tularensis strain SCHU S4.
figure 2

The leftmost scale shows G+C content. Blue indicates RNA coding regions; green indicates open reading frames encoding hypothetical proteins; brown indicates pseudogenes; and pink indicates IS elements. Open reading frame labels refer to the corresponding annotated gene or FTT number in the genome sequence of SCHU S4.

We clustered the proteins predicted to be encoded by the SCHU S4 genome into protein families using the TribeMCL method8 and identified 61 clusters with more than two members. The largest cluster with exclusively hypothetical proteins that we identified contained five members (FTT0025, FTT0267, FTT0602, FTT0918 and FTT0919). A Hidden Markov Model constructed using HMMER9 failed to identify any distant homologs when searched against the SwissProt or TrEmbl databases. BLAST searches against the NCBI nr or nt databases and the Sanger Centre Pfam protein family database also did not identify any significant hits (E < 1 × 10−6). Therefore, this cluster represents a new protein family. Three of the proteins (FTT0025, FTT0918 and FTT0919) were predicted to contain both signal peptides and coiled-coil domains, whereas a signal peptide only was predicted for FTT0267 and a coiled-coil domain only was predicted for FTT0602. Our analysis for motifs, which might indicate a possible function, and a range of bioinformatics tools did not identify any significant associations. Therefore, the functional importance of these proteins remains to be elucidated experimentally.

Two types of IS elements (ISFtu1 and ISFtu2) were previously identified in F. tularensis10. Our analysis identified 50 copies of ISFtu1, a transposon that belongs to the IS630 Tc-1 mariner family. Tc-1 mariner elements are generally found in eukaryotes and have been reported in a range of invertebrates such as nematodes and insects. The presence of this element in a bacterium is unusual. F. tularensis is often transmitted by infected insect vectors; the IS630 element that we identified may have been acquired originally from an insect. One copy of ISFTu1 is located in the O-antigen cluster, like the IS630 element found in the O-antigen cluster of Shigella sonnei. In S. sonnei, this element has a key role in the stable expression of form 1 of the O antigen, which is essential for virulence11. The IS630 element in the F. tularensis O-antigen cluster may have a similar function.

Sixteen copies of ISFtu2, an IS5 family element, were present. The genome also contained three types of IS element previously unreported in F. tularensis: ISFtu3 (two complete copies and one fragment), ISFtu4 (one copy) and ISFtu5 (one copy). ISFtu3 has homology to ISHpaI-IS1016 elements, and ISFtu4 and ISFtu5 belong to the IS families IS982 and IS4, respectively. Also present are three IS element fragments, which share homology with ISHpaI-IS1016 elements. These fragments possess terminal inverted repeat sequences not previously reported and therefore represent a new type of IS element.

Most members of the IS630 and Tc-1 mariner family of insertion sequences possess a single open reading frame12, but translation of the ISFtu1 CDS requires ribosomal frameshifting. In ISFtu1, the first aspartic acid residue of the DDE triad, which is essential for transposition, is generated only after a frameshift13. The programmed ribosomal frameshifting motif in ISFtu1 may be used to control the transposition rate of this element.

More than 10% of the CDSs in the SCHU S4 genome are pseudogenes or gene fragments that may have become fixed as the result of a recent evolutionary bottleneck (Supplementary Table 1 online). The proportion of pseudogenes due to disruption by IS elements (14%) is broadly similar to the proportion in other pathogens such as Yersinia pestis (34%), Leifsonia xyli (5.5%) and Bordetella pertussis (20.6%). Most of the pseudogenes were found among genes uniquely present in F. tularensis, hypothetically conserved or encoding proteins involved in transport, DNA metabolism or amino acid biosynthesis (Fig. 3). In agreement with this observation, pseudogenes in these categories were also overrepresented in F. tularensis compared with other bacteria having more than 50 pseudogenes (Supplementary Fig. 1 online).

Figure 3
figure 3

Percentage of total F. tularensis strain SCHU S4 CDSs (black) and pseudogenes or fragments (gray) attributed by predicted biological function.

Phylogeny

Francisella is the only genus of the family Francisellaceae, which belongs to the γ subclass of proteobacteria. The Francisellaceae have no close pathogenic relatives, as inferred from sequence similarity or 16S rRNA phylogenies14. Instead, 16S rRNA data suggests that F. tularensis is a sister clade with arthropod endosymbionts like Wolbachia persica14 and only distantly related to the human pathogens Coxiella burnetii and Legionella. This suggestion is supported by a phylogenomic analysis of more than 200 genes with homologs in F. tularensis and 15 other γ-proteobacterial genomes, as previously undertaken for a smaller set of γ-proteobacterial genomes15. More than 40% of the single gene trees suggest with strong support (>75%) that F. tularensis is the most deeply diverging lineage among the 16 γ-proteobacterial species examined here. This is also shown in a tree derived from a concatenated alignment of ten proteins (Fig. 4). The C. burnetii genome is 1.9 Mbp, with 2,134 predicted CDSs16. Coxiella, Legionella and Francisella are γ-proteobacterial pathogens with many lifestyle similarities. But they are not sister clades, and their deep divergences explain the lack of overall gene-order conservation among the three genomes. Therefore, although these pathogens have similar lifestyles and their similar genome sizes seem to reflect this similarity, their positions in the phylogenetic tree suggest that they experienced independent, convergent evolution.

Figure 4: Phylogenetic relationship of 16 γ-proteobacterial species inferred from a concatenated alignment of the proteins encoded by dnaA, ftsA, mfd, mraY, murB, murC, parC, recA, recG and rpoC.
figure 4

B. anthracis was used as the outgroup. The topology, branch lengths and bootstrap support are according to the reconstruction with the neighbor-joining method. Values at nodes are bootstrap support values for the neighbor-joining and maximum parsimony methods (in that order).

Predicted metabolic pathways and growth requirements

We identified genes encoding 350 enzymes involved in small-molecule metabolism in the annotated genome. We inferred 429 distinct enzymatic reactions to be catalyzed by these enzymes and predicted 155 small-molecule metabolic pathways to be present (Supplementary Table 2 online). Pathway predictions and information parsed from the annotated genome were output as a pathway-genome database called FrantCyc. Each predicted pathway, P, was assigned a score X/Y/Z: P consists of X reactions; enzymes for Y reactions were identified in the genome; and Z of the Y reactions are used in other predicted pathways. We inserted all pathways for which Y was nonzero into FrantCyc. In total 1,105 operons were predicted using the Pathway Tools operon predictor and inserted into FrantCyc.

Overall, we identified 390 pathway holes in 137 predicted pathways, corresponding to 54% of the reactions involved in the predicted metabolic pathway network of F. tularensis. This percentage is higher than we have observed in other bacteria17 and is consistent with the proposal that the F. tularensis genome is in an advanced state of decay. But we cannot exclude the possibility that because of the relative phylogenetic isolation of this bacterium, some of the pathways holes are filled by divergent orthologs that have not been identified. Pathway holes were input to a program that identifies candidate genes with functions corresponding to each pathway hole17. We evaluated each candidate gene manually, and for those deemed sufficiently reliable, we assigned new gene functions to reflect the function postulated by the program (Supplementary Note online). Application of this algorithm to a complete genome has not been published previously to our knowledge; in this case, it resulted in the identification of high-probability putative functions for 74 genes whose functions were not identified by classical sequence analysis. We then rescored pathways and removed from the pathway-genome database any pathways for which Y/X was less than 1/3. Figure 5 shows one part of the entire predicted metabolic map of F. tularensis, with the unfilled pathway holes highlighted.

Figure 5: A portion of the FrantCyc cellular overview for F. tularensis, showing a region of the predicted metabolic map of the organism.
figure 5

Each line is a single metabolic reaction, and each node is a single metabolite. Green lines indicate reactions that have no enzyme assigned in the genome (pathway holes); blue lines indicate reactions that do have an assigned enzyme. Upward triangle, amino acid; square, carbohydrate; diamond, protein; vertical oval, purine; horizontal oval, pyrimidine; downward triangle, cofactor; T, tRNA; open circle, other; shaded symbol, phosphorylated.

A growth medium consisting of 14 essential compounds (Table 2) was developed to support the growth of avirulent F. tularensis strain 176 (ref. 18) and was also reported to support growth of SCHU S4 (ref. 19). F. tularensis strain SCHU S4 also has a requirement for cysteine20, which seems to be due to a nonfunctional pathway for sulfate assimilation resulting from a pseudogene (missing start codon) encoding adenylylsulfate kinase (EC 2.7.1.25). It remains to be determined experimentally how many of the other 13 essential compounds18 are absolutely required for growth. Our analysis indicated that biosynthetic pathways were present for 7 of these 13 compounds. The pathways for sulfate assimilation, threonine biosynthesis, valine biosynthesis and isoleucine biosynthesis seemed to be incomplete (i.e., they contained pathway holes). The available evidence does not indicate whether the enzymes for these pathway holes are truly missing from the genome, thus inactivating the pathway, or whether the enzymes are present but the activity of these pathways is too low to support growth. We found genomic evidence for loss of biosynthetic capabilities for valine, isoleucine and threonine, indicating that these amino acids are required for growth. Specifically, we identified a pseudogene encoding homoserine kinase that mapped to the one step missing from the predicted pathway for threonine biosynthesis and a pseudogene encoding the large subunit of acetolactate synthase that mapped to the one step missing from the biosynthetic pathways for both valine and isoleucine. This loss of biosynthetic capacity may have followed a change of evolutionary niche (such as a move from a free-living organism to one or more specific host cells) that resulted in these amino acids being readily available in the environment. It remains to be determined whether one or more of the other four compounds for which pathways were predicted to be present (i.e., serine, aspartic acid, leucine and proline) is absolutely required for growth. Conversely, the other compounds for which we found little or no evidence of a biosynthetic pathway have probably always been readily available to the organism across a diverse range of evolutionary niches, such that the corresponding pathways were never required by the organism. These compounds are probably required for growth.

Table 2 A pathway-genome view of the 14 compounds supporting growth of F. tularensis

We correctly predicted functional biosynthetic pathways for all seven nonessential amino acids (alanine, asparagine, glutamate, glutamine, glycine, phenylalanine and tryptophan; Supplementary Table 3 online). Because the genomic evidence for these pathways (fraction of enzymes present) was comparable to that observed for the 'false positive' pathways above, we infer that the predominant mechanism of gene-function inactivation rendering biosynthetic pathways inactive or insufficiently active involved relatively small sequence changes, such as one or more point mutations. Biosynthetic pathways for several polyamines, including putrescine and spermine, were also disrupted. This is consistent with the observations that F. tularensis is unable to survive under hypotonic conditions21 and that osmotolerance can be attained by addition of micromolar amounts of putrescine and spermine21.

Candidate mechanisms of virulence

Little is known about the virulence mechanisms of F. tularensis, but growth in macrophages is central to the ability of F. tularensis to cause disease. Mutation of the genes iglA, iglC or pdpD in the 33.9-kb duplicated region reduces the ability of F. tularensis to survive in amoebae or macrophages and is attenuating5,6,7. These genes, and others in this region, are regulated by the transcriptional regulator MglA6. The precise functions of these genes are not known; the gene products do not show sufficient homology with any other genes in GenBank to infer their functions. Therefore, undiscovered mechanisms of virulence are probably encoded in the 33.9-kb pathogenicity island in F. tularensis. Within the macrophage, the bacterium can degrade the phagosomal membrane and escape into the cytosol22. We identified genes encoding a phospholipase C acpA (FTT0221) and a phospholipase D family protein (FTT0490), which may have a role in this process. FTT1043 encodes a macrophage infectivity potentiator protein previously found to confer virulence for several pathogens, including Legionella pneumophila23. We also identified a homolog of mce, involved in entry of Mycobacterium tuberculosis into host cells24, in the SCHU S4 genome sequence.

When F. tularensis is cultured in acidified medium, the pH of the medium increases25, reportedly owing to the generation of ammonia18. The generation of ammonia, and subsequent buffering of the endosomal compartment, may allow pathogens to survive in macrophages26. Deaminases such as L-glutaminase, L-asparginase and citrulline ureidase, which could be responsible for ammonia generation, have previously been reported in F. tularensis27. In addition, citrulline ureidase activity is used to differentiate strains with high virulence (subspecies tularensis) from strains with low virulence (subspecies holarctica)20, and low levels of glutaminase activity have been associated with low virulence27. We identified several genes in the SCHU S4 genome that could have a role in ammonia production. In addition to genes potentially encoding an L-asparginase (FTT0591) and an L-glutaminase (FTT0195), we also identified an operon predicted to encode a peptidyl-arginine deaminase (FTT0434) and a candidate gene encoding citrulline ureidase (FTT0435). The latter seems to encode a carbon-nitrogen hydrolase family protein and possesses a Pfam (PF00795) motif, indicating that it is an enzyme capable of reducing organic nitrogen compounds and producing ammonia28.

Type I secretion systems transport substances across the bacterial envelope using transporters containing ATP-binding cassettes. The genome of SCHU S4 is predicted to contain 15 potentially functional ATP-binding cassette systems (H. Garmory, personal communication). We did not identify gene clusters encoding type III, type IV or type V export systems, but we did identify some candidate cell surface–located virulence factors. The presence of pili on the surface of F. tularensis has been reported29, and we identified all currently known genes necessary for type IV pili biosynthesis. The exact role of type IV pili in Francisella is not yet known, but in other bacteria, they contribute to virulence. The makeup of the poorly characterized capsule surrounding F. tularensis is not known, but we identified a gene cluster (FTT0789–FTT0801) that could encode a polysaccharide additional to the lipopolysaccharide O antigen. We also identified homologs of the genes capB (FTT0805) and capC (FTT0806) required for capsule biosynthesis in Bacillus anthracis. Therefore, the capsule of F. tularensis might contain poly-D-glutamic acid.

Virulence and iron acquisition

The ability of the bacterium to acquire iron in the phagosome seems to be crucial for virulence of F. tularensis30, and growth under iron-limited conditions results in changes in the composition of the cell envelope31. For many microorganisms, the ferric uptake regulator (Fur) has a key role in modulating iron uptake, and the genome of F. tularensis strain SCHU S4 is predicted to encode a Fur protein (FTT0030). We also identified a number of genes that may be regulated by Fur, including ftnA, fumB, acnA, sodB and an ortholog of iraB (FTT0651), which is associated with iron uptake in L. pneumophila32. A gene (frgA; FTT0029) belonging to a family of hydroxamate-siderophore synthetic genes33 was located downstream of fur, and a putative iron-box was found in the promoter region. Recent papers have described the ability of F. tularensis to escape the phagosome34,35. In the cytoplasmic environment, iron is highly insoluble, and a TonB-dependent system for complex-bound iron uptake would be expected. Although a low–molecular weight iron-binding compound, growth-initiating substance, has been reported to be secreted by F. tularensis36, the evidence does not indicate growth-initiating substance to be a hydroxamate siderophore37. No genes encoding TonB; outer membrane uptake receptors for ferric siderophore-complexes; or receptors for transferrin, lactoferrin, heme, hemoglobin or hemopexin were found in the genome.

Discussion

Pathogens are frequently thought to evolve by acquiring DNA fragments encoding virulence determinants. But an emerging theme in genome biology is that several pathogens that cause severe disease have evolved by losing genetic information instead. Different genome sequences provide different snapshots of this process of evolution. For example, Y. pestis seems to be at the very early stages of evolution and has both lost and acquired genes during this process38. Mycobacterium leprae39 and Rickettsia prowazekii40 seem to have evolved solely by gene loss from a progenitor species39. The genome sequence of F. tularensis SCHU S4 shows extensive inactivation of genes and a duplicated region that is strongly implicated in virulence and may be a pathogenicity island. The origins of the pathogenicity islands are not known, and the function of the genes in this region cannot be inferred on the basis of sequence homology with gene products of known functions. This finding raises the possibility that new mechanisms of virulence operate in F. tularensis. SCHU S4 lacks coding potential for several expected features. The ability to import complexed ferric (Fe3+) iron should be important for Francisella, because it can escape from the phagosome22,35, thereby losing its access to soluble iron present in the acidic milieu. But no previously known ferric iron uptake systems were found in the genome sequence.

F. tularensis is considered one of the microorganisms most likely to be used as a biological warfare or bioterrorism agent, but there is a paucity of information on the biochemical makeup of the organism and mechanisms of virulence. In part, this lack of information is a consequence of the difficulties associated with working with highly virulent strains. The complete genome sequence of F. tularensis strain SCHU S4 is a key advances in our understanding of this pathogen and will fuel future work to devise defensive countermeasures against this potential biological warfare and bioterrorism agent.

Methods

F. tularensis subspecies tularensis strain SCHU S4 was derived from an isolate from a case of human tularemia in the US41. A clonal seedstock of the bacterium has a median lethal dose in the murine model of disease of less than 1 colony-forming unit42. We isolated DNA from a culture derived from this seedstock. We constructed plasmid libraries from randomly sheared DNA in pUC18 or pUC19 with insert sizes of 1–2 kb or 2–4 kb, respectively. We also constructed five libraries with insert sizes of 1–4 kb using the TOPO Shotgun subcloning kit (Invitrogen), from nebulized DNA, and one M13 library with insert sizes of 1–2 kb using the double adaptor method43 from nebulized DNA. We carried out DNA sequencing and assembly as previously described44. We produced a total of 32,743 sequence reads, resulting in an overall genomic coverage of × 12.9. For finishing and gap closure, we used PCR, multiplexed combinatorial PCR, single primed PCR and pulse field gel electrophoresis.

Gene prediction was done using Glimmer45. We carried out annotation and curation, facilitated by Artemis, as described previously39 and checked them manually. We identified protein motifs using CONSENSUS and SMART. We used Pathway Tools software to determine metabolic pathways46, operons47 and pathway hole fillers17. Pathway holes are reactions in a pathway for which no catalyzing enzymes were identified in the genome annotation. The algorithm for identifying candidate genes for each pathway hole involves querying the public protein sequence databases for proteins in other organisms that are known to catalyze the reaction associated with each pathway hole, BLAST searching these sequences against all F. tularensis open reading frames and then scoring each matching gene using a Bayesian network that integrates several types of evidence48. For example, the Bayesian network will score a given candidate gene higher if multiple query sequences show similarity to it and if the candidate is adjacent to, or in the same direction as, another gene in the same pathway (a direction is a contiguous group of genes transcribed in the same direction). This algorithm differs from previous work49,50 in that it is completely automated, it applies a reverse BLAST search with increased sensitivity compared with other methods and it computes a probability value for each candidate (validated through cross-validation studies) that allows ranking of the candidates.

URLs.

CONSENSUS is available at http://npsa-pbil.ibcp.fr/. SMART is available at http://smart.embl-heidelberg.de/. A version of Figure 5 that can be explored interactively is available at http://biocyc.org/FRANT/new-image?type=OVERVIEW. A pathway-genome database that describes the F. tularensis chromosome; its genes and their predicted operons; the product of each gene; the biochemical reaction(s), if any, catalyzed by each gene product; the substrates of each reaction; and the predicted organization of those reactions into small-molecule metabolic pathways is available at http://biocyc.org/server.html. Also available at this site (http://biocyc.org/FRANT/pathologic-index.html) is a complete listing of all predicted F. tularensis pathways and their corresponding evidence scores.

GenBank accession numbers.

Genome sequences of F. tularensis subspecies tularensis strain SCHU S4, AJ749949; Pseudomonas aeruginosa, NC_002516; Vibrio cholerae, NC_002505 and NC_002506; C. burnetii RSA 493, NC_002971; B. anthracis Ames, NC_003997; Shewanella oneidensis, NC_004347 and NC_004349; Escherichia coli K12, NC_000913; Haemophilus influenzae, NC_000907; Pasteurella multocida, NC_002663; Salmonella enterica serovar Typhi, NC_003198, NC_003384 and NC_003385; Salmonella enterica serovar Typhimurium LT2, NC_003197 and NC_003277; Xanthomonas axonopodis, NC_003919; Xanthomonas campestris, NC_003902; Y. pestis, NC_003131, NC_003134 and NC_003143; Shigella flexneri 2a, NC_004337; and Xylella fastidiosa, NC_002488, NC_002489 and NC_002490.

Note: Supplementary information is available on the Nature Genetics website.