Main

The first bacterial genome, Hemophilus influenzae, completely sequenced, annotated, and published was in 1995 by Fleischmann et al. (1) at The Institute for Genomic Research. Today, 73 prokaryotic (archaeal and bacterial) genomes have been completed and at least 120 others are in various stages of completion (http://www.tigr.org/tdb/mdb/mdbcomplete.html; http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html; http://www.sanger.ac.uk/Projects/Microbes/). Among the completed genomes, 35 are human bacterial pathogens (235) (see Table 1). Table 1 presents five key features of these genomes: size, topology, guanosine plus cytidine (G+C) content, total number of predicted open reading frames (ORFs), percentage of the ORFs with unknown function, and percentage of the ORFs that are unique to the species (or strain). The major impetus in obtaining the complete genome sequence of an organism is to gain a better understanding of the biology and evolution of the microbes and, for pathogens, to identify new vaccine candidates, putative virulent genes, and targets for antibiotics. In this review, some of the major findings about bacterial genomes and their impact on strategy and approach for investigating mechanisms of pathogenesis, prevention, and treatment of infectious diseases are highlighted.

Table 1 Genomes of pathogenic bacteria

DIVERSITY OF BACTERIAL GENOMES

Bacteria generally have a double-stranded circular DNA genome. However, a number of species have been shown to have a linear chromosome (36). The best known example is Borrelia burgdorferi, which also has the distinction of having the largest number of extrachromosomal elements, 12 linear and nine circular plasmids (2, 3). Unlike other bacteria, Vibrio cholerae (34) and Deinococcus radiodurans (37) have two circular chromosomes per cell.

The size of a bacterial genome can range from 0.58 Mbp to 10.0 Mbp, and the G+C content can vary from 25% to 75% (36). Bacteria of the same species and closely related species have very similar G+C content and generally are of similar size. However, for some species, the size of the genome can vary markedly. For example, Escherichia coli K-12 has a genome size of 4.6 Mbp, whereas E. coli O157:H7 is 5.6 Mbp (1 Mbp can encode 1000 extra genes). In contrast, different isolates of Chlamydia pneumoniae have similar genomic size, overall organization, gene order of orthologous genes, and predicted proteomes (see Table 1) (8).

Diversity of codon usage has also been observed among bacteria. Significant deviation from standard “universal” code has been observed in bacteria with small genomes and/or extreme G+C content. UGA, a stop codon in the standard genetic code, encodes tryptophan in Mycoplasma (e.g. M. genitalium and Ureaplasma urealyticum). The CGG codon for arginine is an unassigned (of no function) codon in M. capricolon (25% G+C), whereas in Micrococcus luteus (74% G+C), AGA (arginine) and AUA (isoleucine) are unassigned (38).

BACTERIAL EVOLUTION

Deletions, pseudogenes, and reductive evolution.

Bacterial evolution is dependent on natural selection operating on mixed populations of parental forms and genetic variants. Genetic variations are generated through spontaneous mutations, intrachromosomal recombinations, and acquisition of DNA from other organisms. Spontaneous mutations produce variant genes, including pseudogenes (nonfunctional genes, usually with one or more nonsense or frame-shift mutations). Intrachromosomal recombinations and spontaneous mutations can generate major deletions, duplications, and translocations, leading to chromosomal and gene reorganization. With the exception of a few gene clusters (e.g. ribosomal protein genes), organization of genes is not conserved between distantly related bacteria (39). Even between closely related species, such as M. genitalium and U. urealyticum, there is little conservation of gene order across large fractions of the genome (19, 33).

Obligate intracellular parasitic bacterial genomes seem to have undergone reductive evolution (40). Evidence of this was observed in the genome of Rickettsia prowazekii (24, 41) and Mycobacterium leprae (17). In the M. leprae (17) and R. prowazekii (24) genomes, the protein encoding regions represent only 76.5% and 75.4%, respectively. Thus, large regions of the genome contain noncoding sequences, which may represent regulatory sequences or residual genes mutated beyond recognition. The average coding content of free-living bacteria is approximately 90%. The Campylobacter jejuni genome is the most dense with 94.3% (5). Among all of the genomes sequenced, the M. leprae genome has the highest level of pseudogenes, representing 23.5% of the genome (17). Many pseudogenes eventually will be lost from the genome through deletions.

Mollicutes (commonly named mycoplasmas) as a group have the smallest genome. The genome of M. genitalium is only 0.58 Mbp, which is the smallest of the bacterial genomes sequenced. It is predicted to have 470 ORFs with an average size of 1040 bp and composing 88% of the genome, a value similar to most free-living bacteria (19). The reduction in genome size is not due to an increase in gene density or a decrease in gene size. It seems to have evolved from a much larger genome through successive deletion of nonessential genes and the acquisition of a small number of new genes that are largely involved in transport. There is an almost complete lack of genes involved in amino acid biosynthesis, de novo nucleotide biosynthesis, and fatty acid biosynthesis. The genomes of two other Mollicutes, M. pneumoniae (20) and U. urealyticum (33), have also been sequenced (see Table 1). Comparative analysis of these three genomes and that of other bacteria provide information for defining essential functions of a minimal self-replicating bacterial cell.

Recent comparative sequence analyses and genetics studies of commensal E. coli, enteroinvasive E. coli, and Shigella species showed that deletion of certain large genomic fragments or genes that inhibit virulence can lead to increase in virulence. These large genomic sequences that are absent from pathogenic strains but are present in nonpathogenic isolates are known as “black holes” (42). Formation of black holes is a complementary but inverse pathway to the acquisition of pathogenicity islands (see below) in the evolution of pathogenic lineages.

Horizontally transferred genes.

Phylogenetics and comparative sequence analysis of orthologous genes (homologous genes with the same function) from distantly and closely related species provide strong indication that in many bacterial genomes some genes are acquired through a horizontal (lateral) transfer process. A phylogenetic tree generated from orthologous protein or gene sequences from various species should reflect the evolutionary relationship and should be congruent with trees that are generated basing on 16S rRNA sequences. Noncongruence of the trees generated or contradiction to established evolutionary relationship would suggest horizontally transferred genes (HTGs). Horizontal transfer of genes can be mediated by DNA transformation, phage-mediated transduction, or plasmid-mediated conjugation. The G+C content and codon usage of HTGs should resemble that of the genome of the donors allowing easy identification if the donor DNA has significantly different G+C content from that of the recipient. HTGs often exist in clusters, ranging in size from a few kb to 200 kb. These clusters are generally known as genomic islands. Islands that contain genes involved in pathogenesis are called pathogenicity islands (PAIs) (43). PAIs are often absent from nonvirulent strains, flanked by small direct repeat sequences, associated with a t-RNA gene, and often carry insertion and mobile elements that encode integrases and transposases. Koonin et al. (44) estimated the levels of HTG in various sequenced genomes. The estimates vary from a modest level of 1.6% of the genes for M. genitalium to 32.6% in T. pallidum. M. pneumoniae is also low with a level of 2.1%, and Pseudomonas aeruginosa, a proteobacterium found in diverse environment, has an intermediate value of 13.1%. E. coli O157:H7 strain EDL933 (enterohemorrhagic) has 177 strain-specific genomic islands scattered over the whole genome that are not found in the E. coli K-12 strain MG1655 (nonpathogenic) and MG1655 has 234 genomic islands that are absent in EDL933 strain (12).

Slipped-strand mispairing.

Slipped-strand mispairing at homopolymeric (primarily G and C) tracts during replication can generate a high level of mutations at these sites and form hypervariable sequences. Hypervariable homopolymeric sequences were identified in the genome of C. jejuni (5) and Helicobacter pylori (14, 15). In C. jejuni, most of the hypervariable sequences are coincident with the clusters of genes involved in lipooligosaccharide and surface polysaccharide biosynthesis and flagella modification (5). Tract length variation results in translational frameshifting and has been shown to be responsible for phase variation of lipooligosaccharide structure of C. jejuni (45, 46) and surface structure of other bacteria, and such variations have been proposed to play a role in adaptive evolution (47).

Bacterial speciation and intraspecies variation.

Complete genomic sequences of multiple strains of the same species have been achieved for Chlamydia pneumoniae, C. trachomatis, E. coli, H. pylori, Neisseria meningitides, Staphylococcus aureus, Streptococcus pneumoniae, and Streptococcus pyogenes. Comparative genomic analyses between strains of the same species have generated valuable information on intraspecies variation and mechanisms of speciation. Intraspecies variation is an inherent feature of bacteria. The magnitude of variation among isolates of the same species can vary significantly for different species and seems to be largely determined by the lifestyle and niches that are occupied by the organism. C. pneumoniae, an obligate parasite of human cells, seems to have a very low level of strain variation, presumably because it occupies a stable environment and has limited opportunity to acquire DNA from other bacterial species (68). In contrast, E. coli strains, which occupy diverse environments and often reside in the presence of numerous and diverse populations of bacteria, have genomes that can differ by as much as 20% in size. Lan and Reeves (48) suggested that there is a need for a species genome concept. They proposed that genes that are found in 95% or more isolates form the core set of genes for the species and genes found in 1% to 95% of isolates are considered auxiliary genes.

VIRULENCE GENES

Comparative sequence analyses enable identification of known virulence proteins with conserved sequences or motifs and also novel putative virulence proteins and PAIs. The value of genomic sequence is best illustrated by pathogens that are difficult to grow in vitro and have poorly characterized genetic systems. Chlamydiae, obligate intracellular eubacteria, are prime examples of such organisms. Complete genome sequence analyses identified species' unique genes and putative virulent genes.

Comparative genomic analyses identified type III protein secretion system genes, which are conserved among bacteria. The type III system transposes effectors and toxins directly into the cytosol of the host cells or into the extracellular milieu. Putative genes encoding type III effector proteins and a Chlamydia-specific protein, that may have a role in virulence, were also identified in the C. trachomatis MoPn genome (7). The role of a type III secretion system and type III effector proteins in pathogenesis is now well established in pathogenic E. coli, Salmonella, Yersinia, and Shigella. In the M. leprae genome, a single gene encoding laminin-binding protein, that may be an important virulence factor, was identified (17).

NEW VACCINE CANDIDATES, ANTIBIOTICS TARGETS, AND DNA PROBES

Pathogenic bacteria are becoming resistant to commonly used antibiotics at an alarming rate. Accordingly, there is an urgent need for the development of new antibiotics and vaccines. With the aim of identifying new vaccine candidate genes of N. meningitides and S. pneumoniae, whole-genome sequence of these two pathogens were scanned to identify ORFs encoding proteins with secretion motifs or similarity to predicted virulence factors. A total of 130 ORFs of S. pneumoniae were identified and then cloned into an expression vector. Products were purified and tested for immunogenic activity in a mouse model for induction of protective antibodies against pneumococcal challenge. Six novel antigens encoded by five separate genes of S. pneumoniae conferred protection against disseminated infection in the mouse model. These proteins, shown to be widely conserved among different isolates and immunogenic in human infection, are currently being evaluated as a vaccine for the prevention of mucosal infection and invasive disease caused by pneumococci (49). Similar wholegenome scanning identified 570 ORFs of N. meningitides, which were cloned into expression vectors, and purified recombinant proteins were then used to immunize mice to generate specific antisera. Using the antisera, seven proteins were localized on the cell surface of N. meningitides, induced bactericidal antibodies, and are conserved among different isolates, characteristics of an effective vaccine (50).

Another major benefit of the bacterial genome sequences is in antibacterial drug development (51). Comparative analyses of the encoded proteins of the completed genomes have shown that a significant fraction of the ORFs are unique to the species (see Table 1). Among the sequenced bacterial pathogens, the percentage of genes that are unique to the species can range from 7% to 32%. Some of the proteins encoded by speciesspecific genes are essential for growth or survival in the infected host and should serve well as novel targets for the development of highly species-specific antibacterial drugs. Species-specific antibiotics have the potential to reduce the imminent problem of interspecies transfer of drug-resistant genes and reduce nonspecific toxicity to the beneficial commensal microflora in the gut.

An important and immediate benefit of having sequenced the complete genome of bacteria is the potential for developing rapid and highly species-specific DNA probes and immunoprobes for the identification of pathogens. Using PCR technology, it should be possible to develop multitarget probes based on conserved species-specific genes and known virulence genes. Rapid and reliable specific tests will improve treatment of infectious diseases and reduce the levels of misuse of antibiotics. Genomic sequences have also had an impact on the development of typing schemes of infectious agents. A typing scheme based on multilocus sequences has been developed for various pathogens and may well become the gold standard for typing bacterial pathogens (52).

BACTERIA-HOST INTERACTIONS

The availability of a complete genomic sequence makes it possible to examine the global transcription profile of a cell. Both bacterial and mammalian (mouse, human) genome sequences can be used in microarray technology to define the expression profile of pathogens and the host cells. The global transcription effects on host cells by various bacterial pathogens, including Listeria monocytogenes, Salmonella, Pseudomonas aeruginosa, and Bordetella pertussis have been analyzed by using microarray technology (53). Rosenberger et al. (54) identified novel macrophage genes whose level of expression are altered in S. typhimurium infection or when treated with lipopolysaccharide. Similarly, Cohen et al. (55) identified 74 up-regulated RNAs and 23 down-regulated host RNAs in L. monocytogenes-infected human promyelocytic THP1 cells. Infection of human bronchial epithelial cells (BEAS-2B) by B. pertussis results in an increase in transcriptional levels of 33 genes and decrease in transcriptional levels of 65 genes (56). Many of the up-regulated genes encode proinflammatory cytokines (e.g. IL-8, IL-6, and growth-related oncogene-1), and many of the down-regulated genes encode transcriptional factors and cellular adhesion molecules. Understanding the molecular basis of the host response to bacterial infections is critical for preventing disease and tissue damage resulting from the host response. Furthermore, an understanding of host transcriptional changes induced by the microbes can be used to identify specific protein targets for drug development.

POSTGENOME RESEARCH

Analysis of the 35 completed bacterial pathogen genomes clearly signals how little we comprehend the biology of these pathogens as on the average approximately 31% of the predicted ORFs have unknown functions (Table 1). Understanding the functions and how these genes and their products are regulated are some of the major tasks confronting us. New and better algorithms and programs for structure prediction and identification of new conserved motifs in proteins are needed. Putative virulence genes identified from the sequenced genomes by bioinformatics must be verified experimentally by construction of isogenic mutants and testing them using appropriate animal models. There is an urgent demand for good animal models for some pathogen-induced diseases (e.g. campylobacteriosis). Inexpensive animal models with a large repertoire of knock-out mutants defective in innate immune response or signal transduction will also be in great demand. Gene expression profiling technology has recently been expanded to examine the expression profile of a specific cell population (gastric parietal cells) in mice with or without infection by H. pylori (57). Similar studies can be extended to other pathogens and animal model organisms. Such studies will provide a clearer picture of the molecular events that occur in human infection. This knowledge is critical for gaining a better understanding of the mechanisms of pathogenesis.

THE NEW “OMICS” ERA

Completion of the large number of microbial genomes and the human genome provide enormous impetus to develop and implement new techniques to manage and exploit this sequence information leading to creation of a new generation of “omics” enterprises, which emphasize on comparative and functional aspects of genomics, transcriptomics, proteomics, metabolomics, infectomics, pharmacogenomics, immunoproteomics, and many others (5861). An essential component of the “omics” era is the development of new computational methods (bioinformatics) that aim to solve biologic problems (62). New advances in bioinformatics are major driving forces in many areas of biologic research.