Spirochaetes, morphologically unique in their coiled, slender and flexuous shape and related form of motility, form a major phylogenetic lineage (phylum) of eubacteria. Leptospira, an obligately aerobic, tightly coiled spirochaete, is the only genus other than Borrelia, Treponema and Brachyspira that is able to cause significant infection in mammals. The leptospires are physiologically chemoheterotrophic. They include the saprophytic L. biflexa and the pathogenic L. interrogans. The latter is known worldwide to be responsible for the water-borne zoonosis leptospirosis. Although antibiotic therapy is effective against the disease, it remains a serious threat in tropical and subtropical countries as well as in those cities where sanitation is substandard and where wild rats can serve as reservoirs when sewage disposal is poor1.

Molecular and cellular studies on leptospires5 have focused on their dynamics of motility, biosynthesis of amino acids and lipopolysaccharide (LPS), outer-membrane proteins and other potential virulence factors. In contrast to L. biflexa, little in the way of genetic analysis has been reported for L. interrogans, owing to their fastidious cultivation requirements and the lack of genetic systems5. Previously, the genomic sequences of two pathogenic spirochaetes—T. pallidum, responsible for syphilis3, and B. burgdorferi, responsible for Lyme disease4—have been determined. We employed the whole-genome random sequencing method3,4,6 to sequence and analyse the genomic DNA of a representative virulent serovar type strain (Lai)2 of L. interrogans serogroup Icterohaemorrhagiae (see Methods).

The L. interrogans genome (4,691,184 base pairs (bp); Fig. 1, Table 1) is much larger than either of the other two spirochaetes (1,138,006 bp for T. pallidum and 1,519,857 bp for B. burgdorferi, including plasmids). It consists of two circular chromosomes, a large one of 4,332,241 bp (CI) and a small one of 358,943 bp (CII), in good agreement with previous estimates5. More than 30 copies of repetitive DNA elements, including members of the IS1500 and IS1501 families, were distributed throughout the genome but few phage-related sequences were identified.

Figure 1: Circular representation of the L. interrogans strain Lai genome, with predicted CDSs.
figure 1

a, Large chromosome (CI); b, small chromosome (CII). The outer scale is shown in kilobases. Circles range from 1 (outer circle) to 6 (inner circle) for CI and from I (outer circle) to IV (inner circle) for CII. Circles 1/I and 2/II, genes on forward and reverse strand; circles 3, tRNA genes; circle 4, rRNA genes; circle 5/III, GC bias ((G-C)/(G + C); red indicates values > 0; green indicates values < 0); circles 6/IV, G + C content. All genes are colour-coded according to functions: orange for amino acid biosynthesis, green for purines, pyrimidines, nucleosides and nucleotides, blue for fatty acid and phospholipid metabolism, magenta for biosynthesis of cofactors, prosthetic groups and carriers, khaki for central intermediary metabolism, cyan for energy metabolism, orchid for transport and binding proteins, yellow for DNA metabolism, dark green for transcription, brown for protein synthesis, red for protein fate, green–yellow for regulatory functions, pink for cell envelope, salmon for cellular processes, navy for other categories, light grey for conserved, dim grey for hypothetical, slate grey for unknown function protein, and black for tRNA and rRNA.

Table 1 General features of the L. interrogans chromosomes

Both GC nucleotide skew ((G - C)/(G + C)) analysis and comparisons with the ori sequences of other bacteria were employed to locate the replication origin of CI, whereas only GC nucleotide skew analysis was used to identify a putative replication origin on CII, as with Vibrio cholerae6. DnaA boxes were identified on the anticlockwise side of oris for both CI and CII. In addition, parAB operons were identified on each side of the putative replication origins of both chromosomes (Supplementary Information 1).

In all, 4,768 putative genes were predicted, among them 37 genes for transfer RNAs (Supplementary Information 2-1). Previous reports5 indicated that in strains Ictero No. 1, Verdun and RZ11 of L. interrogans, there were two sets each of genes encoding 16S ribosomal RNA (rrs) and 23S rRNA (rrl). However, besides the two rrs genes, we identified only one gene each encoding 5S (rrf) and 23S rRNAs respectively. The extraordinarily low number of tRNA and rRNA genes might well account for the fastidious growth of L. interrogans.

Among the 4,727 protein-coding sequences (CDSs), 4,360 lie on CI and 367 lie on CII, whereas all of the rRNA and tRNA genes were found on CI (Table 1). Although most of the genes required for growth and viability are located on CI, some essential genes lie on CII. Besides the previously recognized metF7 (LB002) and asd5 (LB355), it is significant to recognize an ndh gene (LB036), encoding NADH dehydrogenase, and clusters of genes involved in a nearly complete pathway for the de novo biosynthesis of haem. These data, therefore, tend to support the view that CII is an authentic part of the genome that did not originate by lateral transfer.

On the basis of amino acid sequence similarity searches and/or domain analysis, biological functions have been assigned to about 44% of the CDSs (2,060), whereas 15% of the CDSs (715) either encode proteins of unknown function or are similar to unassigned CDSs predicted in other organisms. A total of 1,952 predicted CDSs (41%) failed to exhibit obvious similarity to any protein-coding genes of other organisms (Table 1). In particular, only 315 orthologues were shared by L. interrogans, T. pallidum and B. burgdorferi (Supplementary Information 3).

Some of the previously identified metabolic characteristics of leptospires, such as the absence of hexokinase1, were confirmed by genomic analysis. A complete set of genes for a system of long-chain fatty-acid utilization, a tricarboxylic acid cycle and a respiratory electron transport chain were identified in L. interrogans; this was consistent with the notion that the organism generates ATP by oxidative phosphorylation (Fig. 2). In contrast, none of the aforementioned genes are present in T. pallidum or B. burgdorferi, in which ATP can be generated only by sugar fermentation by means of the Embden–Meyerhof pathway5. Because L. interrogans cannot utilize sugars as carbon sources, anaplerotic reactions are essential for gluconeogenesis. We failed to identify genes encoding glucose-6-phosphate dehydrogenase, one of the key enzymes of the phosphogluconate pathway. Neither of the two key enzymes of the glyoxylate pathway, isocitrate lyase and malate synthase, were present, although these two enzymes were detected in L. biflexa1. However, we did identify all the genes encoding enzymes for gluconeogenesis from glycerol (Fig. 2), including phosphoglucose isomerase, as previously reported1. In addition, genes encoding enzymes likely to be involved in the oxidative carboxylation of acetyl-CoA to succinyl-CoA through the 3-hydroxypropionate pathway8 were recognized (Fig. 2). Intermediates of carbohydrate metabolism are therefore likely to be synthesized by means of the tricarboxylic acid cycle and the non-oxidative pentose phosphate pathway (Fig. 2). Genes encoding transhydrogenase (pntA and pntB) were identified. These enzymes could catalyse the formation of sufficient NADPH for anabolic processes at the cost of protonmotive force generated by an NADH dehydrogenase complex (Fig. 2). In this connection, one should emphasize that glycerol, together with the long-chain fatty acids, is present in EMJH medium (Johnson and Harris modification of the Ellinghausen and McCullough medium)1 for better growth of L. interrogans.

Figure 2: Overview of selected metabolic pathways and morphological components in L. interrogans strain Lai.
figure 2

Only pathways related to energy production, biosynthesis of carbon skeletons and certain amino acids (methionine and isoleucine) are shown in detail. In addition, important components of transport systems as well as chemotaxis and motility systems are illustrated. The only two endoflagella are located between the outer membrane sheath (blue) and the cell wall (yellow)/cytoplasmic membrane (blue). At each end of the protoplasmic cell cylinder, a single periplasmic flagellum extends towards the centre of the cell with no overlap between them. Key metabolic enzymes and other related functional proteins are labelled according to their corresponding genes with their CDS numbers listed in Supplementary Information 2-2.

In contrast to B. burgdorferi and T. pallidum, L. interrogans encodes complete metabolic systems for amino acid and nucleotide biosynthesis, which is in agreement with previous work1. Methionine biosynthesis in leptospires is similar to that in yeast1, whereas the final step seems to be catalysed by a B12-dependent homocysteine-N5-methyltetrahydrofolate transmethylase, encoded by metH, rather than by a cobalamin-independent methionine synthase encoded by metE (Fig. 2). In this connection, the absence of several genes of B12 biosynthesis from the L. interrogans genome accounts for the fact that this compound is an essential component of the EMJH semi-synthetic medium1. It was proposed that a pyruvate pathway might be used by leptospires for isoleucine biosynthesis, either alone or together with the conventional threonine deaminase pathway1. Because we failed to identify a gene encoding threonine deaminase but did find three putative leuA genes, we experimentally determined the substrate specificity of these enzymes (see Methods). The enzyme encoded by LA2202 is an isopropylmalate synthase (leuA1), whereas LA2350 encodes citramalate synthase (cimA). Although the enzyme encoded by LA0469 has some citramalate synthase activity, it is primarily an isopropylmalate synthase (leuA2).

The genomic information enhances our understanding of the mechanisms of virulence and pathogenesis in leptospirosis. As with most other pathogenic bacteria, L. interrogans possesses several genes related to the attachment and invasion of eukaryotic cells (mce, invA, atsE and mviN; Supplementary Information 2-2). The unique cellular shape and motility apparatus of spirochaetes provide these organisms with an additional method of achieving effective infection5,9. We found at least 50 genes (not including chemotaxis genes) related to motility, accounting for more than 1% of the deduced CDSs (Fig. 2). Like B. burgdorferi and T. pallidum, L. interrogans uses FlaA sheath protein and FlaB core protein as the essential components of its endoflagellar filament5. Other bacteria5,10 employ FliC for this purpose. L. interrogans also has a complete set of genes (Supplementary Information 2-2) for shape determination. In contrast to B. burgdorferi11, the finely coiled spiral shape of leptospires is likely to be mainly attributable to the murein layer rather than the flagella12.

Chemotaxis is generally acknowledged to be an important virulence factor for pathogenic bacteria. The chemotaxis system of L. interrogans (Fig. 2) is more complex than that of either T. pallidum or B. burgdorferi. The recognition of many genes (12 CDSs) encoding methyl-accepting chemotaxis proteins (MCPs) presumably reflects the extremely diverse environmental situations that a facultatively parasitic zoonotic bacterium can encounter. Employing secondary-structure prediction methods, 5 of the 15 CDSs with clear CheY-like response domains were designated cheY genes (Supplementary Information 4-1). However, only one such gene was located in a putative chemotaxis operon (cheWABY, LA1250-1253).

Leptospirosis virulence has been attributed in part to the effect of the leptospiral LPS1. The nucleotide sequence of the locus encoding a set of enzymes for the biosynthesis of the O-antigen component of Leptospira LPS (rfb locus) is known for four serovars of two species5. We identified an rfb locus of 103 kilobases (kb) (Supplementary Information 5) in L. interrogans serovar lai. In agreement with findings in other rfb loci of leptospires, almost all of the 97 CDSs (LA1576 to LA1672), except three short ones, are encoded on the same strand (forward). About 30 kb of nucleotide sequence located at the 3′-proximal end of the locus is almost identical to its counterpart in serovar copenhageni (GB: U61226). Unlike L. borgpetersenii serovar hardjo, subtype hardjobovis, no IS elements were found within or flanking the rfb locus. We tentatively assigned a series of genes encoding O-antigen-processing enzymes within and outside the rfb locus by comparisons of predicted transmembrane patterns with genes characterized in other Gram-negative bacteria (Supplementary Information 4-2). This is a strong indication that the biosynthesis of LPS in L. interrogans proceeds through the Rfc (Wzy)-dependent pathway.

In contrast to T. pallidum and B. burgdorferi5, genes encoding enzymes involved in the biosynthesis of the Lipid A backbone and its KDO (2-keto-3-deoxyoctonoic acid) core (Supplementary Information 6) are present in L. interrogans. The LPS of L. interrogans is a structurally unique molecule of relatively low toxicity5 that activates macrophages in a distinct manner13. These characteristics can be rationalized on the basis of structural comparisons between LpxA proteins of different bacterial origins (Supplementary Information 4-3).

Although it is not clear whether the extensively studied sphingomyelin-specific phospholipases have significant roles in the pathogenesis of leptospirosis1, we identified four genes encoding other kinds of haemolysin in addition to five genes coding for sphingomyelinase-like proteins (Supplementary Information 7). All these proteins have been expressed in Escherichia coli, and their haemolytic activities have been demonstrated (Y.X.-Z. and G.-P.Z., unpublished results).

The genome of L. interrogans encodes several proteins bearing homology to animal proteins important in haemostasis (Supplementary Information 8). These include a protein that resembles the mammalian platelet-activating factor (PAF) acetylhydrolase14 (LA2144, pafAH) and another that is similar to von Willebrand factor15 type A domains (LB054 and LB055, vwa). No bacterial genomes have hitherto been shown to encode both of these proteins, although they have been separately identified in several bacterial species (Supplementary Information 8). A third gene relevant to haemostasis, so far found only in Leptospira, seems to specify an orthologue of paraoxonase (LA0399, pon). This protein might hydrolyse PAF through its arylesterase activity16. Because a colA17 gene (LA0872) encoding microbial collagenase has been identified, it is reasonable to propose that collagenase-mediated injury to the vascular epithelium during infection and the subsequent combined effects of the Vwa, PafAH and Pon proteins could lead to a loss of haemostasis, in addition to the proposed effects of LPS1,13. This model is consistent with the clinical manifestations of leptospirosis, namely damage to the endothelial cell membranes of small blood vessels1. It also might explain the observed sequelae of severe infections by serovar lai, such as massive pulmonary haemorrhage and fatal sudden haemoptysis1.

Among eubacteria, spirochaetes are evolutionarily primitive9,18. However, the fact that leptospires can survive either as saprophytes or as facultative parasites has presumably afforded them significant growth opportunities, although not without pressure for co-evolution in response to their environment or hosts. A BLAST analysis was performed to compare the best-hit distribution of protein homologues in representative eubacteria with the predicted proteomes of bacteria, virus (phage), archaea and eukarya. The result (Fig. 3) suggests that the genome of L. interrogans surpasses those of other bacteria in terms of the number of proteins with structural similarity to eukaryal and archaeal proteins that it encodes. In this respect, L. interrogans resembles B. burgdorferi and Mycoplasma genitalium. This raises several important evolutionary questions, including the possibility that lateral gene transfer, operating in parallel with standard gene evolution events, contributed to the emergence of an important human pathogen from an environmental bacterium.

Figure 3: Distribution of the best hits for BLAST protein homologues by representative eubacteria against predicted proteomes of bacteria (brown squares), virus (phage) (blue bars), archaea (yellow bars) and eukarya (red bars).
figure 3

See the Methods section for details of analysis. The symbols used are: lint, L. interrogans; bbur, B. burgdorferi; tpal, T. pallidum; mgen, M. genitalium; vcho, V. cholerae; ecok, E. coli K12; ecoo, E. coli O157 Sakai; styp, Salmonella typhimurium LT2; ypes, Yersinia pestis CO92; atum, Agrobacterium tumefaciens C58 Cereon; syne, Synechocystis PCC6803; bsub, B. subtilis; saur, Staphylococcus aureus MW2; spne, Streptococcus pneumoniae TIGR4; blon, Bifidobacterium longum; mlep, Mycobacterium leprae; scoe, Streptomyces coelicolor. The percentage distributions of CDSs similar to their counterparts are depicted as coloured histograms. Scales used: 0–100% for bacteria, and 0–20% for virus (phage), archaea and eukarya.


Source and culturing of study organism

The Leptospira interrogans serogroup Icterohaemorrhagiae serovar lai type strain 56601 used in this study is maintained by the National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention (ICDC, China CDC), Beijing, China2. For sequencing purposes, a single colony was picked from EMJH1 soft agar and cultured in the same medium. The culture thus obtained was then subjected to morphological, serological, genetic and virulence analysis. The properties of the strain were in accordance with those of pathogenic Leptospira. For functional analysis, growth curves for L. interrogans in EMJH or Korthof1 medium were measured turbidimetrically, and viable bacterial counts were determined by dark-field microscopy. Culture conditions were then developed to ensure that only mid-exponential-phase bacterial cultures were used for further experimentation.

Genome sequencing and analysis

The genome of strain Lai of L. interrogans was sequenced by a whole-genome random sequencing method previously applied to other microbial genomes3,4,6. Three different libraries were used in this project. The first two, in pUC18, had inserts of either 1.5–3 kb or 8–10 kb. The third was a 40-kb cosmid library. Altogether, 111,402 sequence reads (Phred value >Q20 (refs 19, 20)) were generated, which gave rise to an overall genome coverage of 8.5 fold, of which 1,600 were from the end sequences of large insert plasmid (8–10-kb) clones and 1,000 were from the end sequences of cosmid clones. The Phred/Phrap/Consed software package19,20,21 was used for quality assessment and sequence assembly. The initial assembly yielded 805 contigs, which were clustered into 145 groups based on linking information from forward and reverse sequence reads. Some contigs were also located on the physical map by Southern analysis. Sequence and/or physical gaps of the chromosomes were closed by primer walking and PCR. The final assembly was checked against the physical map of restriction sites, mapped genes and end sequences of large plasmid and cosmid clones.

Assignment of CDSs

CDSs were determined with Glimmer 2.0 (ref. 22) and the Z-curve method23, and the results were subjected to further manual inspection. A few CDSs were found by hand curating as guided by BLAST results. BLAST searches against the NCBI non-redundant protein database (or SwissProt, PIR and COG) were performed to determine the similarity. The blast search criteria were as follows: (1) e-value = 10-5 and (2) at least 60% of the subject sequence was aligned. If there was no database hit, domain analysis was performed by searching the Pfam, PRINTS, PROSITE, ProDom, Block and SMART databases. Transfer RNAs were predicted with tRNAscan-SE24. TopPred25 was used to identify potential membrane-spanning domains in proteins. The presence of signal peptides and the probable position of a cleavage site in secreted proteins were detected with Signal-P. Lipoproteins were identified by scanning for a lipobox ([LV][ASTVI][GAS][C]) in the first 30 amino acids of every protein. Possible metabolic pathways were examined using the KEGG database10. Transmembrane helices in proteins were predicted by the THMHH method (Supplementary Information 4). Predicted biological roles were assigned by the classification scheme in ref. 26. In cases in which tertiary structures of hypothetical proteins were predicted, sequences of CDSs were submitted to the SWISS-MODEL server and the illustrations were prepared with Rasmol 2.6.

Deposition of data

In addition to the data deposited at the NCBI database (GB: AE010300 for CI and GB: AE010301 for CII), the L. interrogans genome database is also available at and at

BLAST analysis

The BLAST analysis for comparing the best-hit distribution of protein homologues in representative eubacteria with the predicted proteomes of bacteria, virus (phage), archaea and eukarya was based on ref. 27 for studying horizontal gene transfer with modifications. The data were retrieved from NCBI TaxMap ( The CDSs of each bacterium were used in a BLAST search against the database. Only those hits that scored at least 95 bits were collected and ranked. The ‘most’ similar organism was the one to which the homologous protein bears the strongest similarity with the query CDS.

Enzyme assays

Citramalate synthase activity was assayed as described in ref. 28, with minor modifications. Isopropylmalate synthase activity was assayed as described in ref. 29, with minor modifications.