Abstract
Helicobacter pylori, one of the most common bacterial pathogens of humans, colonizes the gastric mucosa, where it appears to persist throughout the host's life unless the patient is treated. Colonization induces chronic gastric inflammation which can progress to a variety of diseases, ranging in severity from superficial gastritis and peptic ulcer to gastric cancer and mucosal-associated lymphoma1. Strain-specific genetic diversity has been proposed to be involved in the organism's ability to cause different diseases or even be beneficial to the infected host2,3 and to participate in the lifelong chronicity of infection4. Here we compare the complete genomic sequences of two unrelated H. pylori isolates. This is, to our knowledge, the first such genomic comparison. H. pylori was believed to exhibit a large degree of genomic and allelic diversity, but we find that the overall genomic organization, gene order and predicted proteomes (sets of proteins encoded by the genomes) of the two strains are quite similar. Between 6 to 7% of the genes are specific to each strain, with almost half of these genes being clustered in a single hypervariable region.
H. pylori strain J99 (cagA+ vacA +), isolated in the USA in 1994 from a patient with a duodenal ulcer, was subjected to minimal subculturing before being sequenced by us in 1996. We describe this sequence below and compare it with the sequence of strain 26695, which was isolated in the UK before 1987 from a gastritis patient and which had a history of subculturing before being sequenced5. The J99 circular chromosome is 1,643,831 base pairs (bp) in size, which is 24,036 bp smaller than the 26695 chromosome. Several features, including the absence of an identifiable origin of relication, the average length of coding sequences and the relative frequency of the different initiation codons, are similar in the two strains (Table 1). We predict that there are 1,495 open reading frames (ORFs) in J99, representing 91% of the genome. Eighty-nine of these ORFs are absent from 26695. Of these J99-specific ORFs, 25 and 8 have sequence similarity to genes of predicted and unknown function, respectively, and 56 share no significant sequence similarity with any genes in public databases. J99 has 95 fewer genes than has been reported for 26695. However, 54 predicted genes of strain 26695 are less than 150 bp in size. In comparison with J99 genes, these 54 small genes either are highly conserved (16) and likely to encode proteins (note that three of these 26695 ORFs are part of larger ORFs in J99), or contain in-frame stop codons or exhibit nucleotide drift (38), as do other intergenic regions, and are therefore unlikely to encode proteins. Thus, we revised the 26695 gene complement to 1,552 genes; 117 of these are unique to 26695 and 26 of these unique genes have a predicted function. Some genes appeared to contain a frameshift in J99 or 26695: 27 J99 genes are the equivalents of 55 predicted genes in 26695, and 7 genes from 26695 are the equivalents of 15 predicted genes in J99. In addition, three single-copy genes in 26695 have complete (gene HP1365; H. pylori 26695 genes are numbers preceded by 'HP') or partial (genes HP0818 and HP0928) duplications in J99. There are 1,406 genes in J99 that have counterparts in 26695.
Both genomes contain two 16S and two 23S–5S ribosomal RNA copies in the same relative locations, but strain 26695 contains a further, orphan 5S rRNA. In contrast to most other bacteria, the H. pylori rRNA loci are not contiguous, indicating that they may be regulated in a complex way. There are fewer complete insertion-sequence elements and fragments in J99 than in 26695, yet their location in both strains appears to be biased towards one half of the genome (Fig. 1a). Both genomes encode 36 transfer RNA species, each mapping to the same relative location. Neither strain contains Asn or Gln tRNA species; however, we have identified homologues for the Bacillus subtilis gatABC genes (gene JHP769/HP0830, JHP603/HP0658 and JHP909/HP0975; H. pylori J99 genes are numbers attached to the prefix 'JHP'), which amidate glutamate charged tRNAs to make glutamine-charged tRNAs6. Such genes are also likely to be responsible for amidation of appropriate aspartate-charged tRNAs.
Figure 1: Comparison of the two sequenced H.pylori genomes based on the chromosomal organization of strain 26695.

a, Genome-wide view. Circles are numbered starting from the outermost concentric ring. Circle 1, nucleotide and circle 2, amino-acid similarity between each J99 and 26695 orthologue. The relative location and amount of each J99-specific sequence are shown immediately inside the second circle (the height of each line is proportional to the amount of unique sequence, and for larger regions the size relative to the equivalent 26695 region is indicated by a triangle proportional to the 26695 scale). The largest J99-specific region shown is composed of two segments separated by 150 bp (see b for details). Circles 3 and 4, which flank the solid reference circle, show the locations of rRNA, insertion-sequence (IS) and repeat elements, for 26695 (circle 3) and J99 (circle 4). Circles 5 and 6 represent the locations of the NotI sites in the 26695 and J99 genomes, respectively. Circle 7 represents the relative transcriptional direction of J99 genes compared to their 26695 orthologues. Regions that are not coloured and translocations are transcribed in the same relative direction in J99 and 26695, whereas inversions result in genes being transcribed in the opposite relative direction in J99. Circle 8 represents the organization of the J99 genome relative to the 26695 genome, incorporating artificial end points needed to allow the alignment. The required inversions and/or translocations are numbered consecutively for J99. b, c, An expanded view of the complex organizational differences 7–10 (b) and 3A/3B (c) shown in circle 8 of a. The 26695 ORFs are shown in the order and location that they are found (black numbers). The J99 ORFs are shown as red numbers. ORFs and other elements in parentheses are found in 26695 but not J99. The organization of J99 segments that share >90% identity to 26695 are depicted by the solid black lines, with arrows indicating the relative orientation of the J99 segments with respect to 26695 segments. The open triangles represent J99-specific DNA, drawn to scale, with the size and genes shown. The order of these regions in J99 is indicated by lower-case blue letters. The circled numbers correspond to the inversions/translocations referred to in a. Sizes of regions in a are shown in megabases (Mb).
High resolution image and legend (102K)Severity of H. pylori related disease is correlated with the presence of an island of genes (the cag pathogenicity island, cagPAI) associated with production of the CagA antigen7 and upregulation of interleukin (IL)-8 in gastric epithelial cells8. Both J99 and 26695 contain the complete cagPAI flanked by the same chromosomal genes and the previously described 31-bp repeat7 but lack the insertion-sequence 605 elements that are associated with cagPAI in strain NCTC11638 (ref. 7). Comparison of available cagPAI gene sequences showed minor differences between the J99 cag PAI genes and the other available sequences, such as apparent deletions in the cag7 gene of J99 and 26695 (JHP476, HP0527) that lead to loss of up to 114 amino acids.
Like 26695, J99 encodes many families of paralogous proteins (337 genes, 22.5% of the total, are members of 113 families). One family contains the vacA-encoded vacuolating cytotoxin and three paralogues. Two of the three orthologues differ significantly in size between J99 and 26695: JHP856 encodes a protein that is 130 amino acids shorter than the protein encoded by HP922, and JHP556 represents a fusion between HP0610 and HP0609. The three paralogues in both strains lack the cleavage signal contained within VacA and may not be secreted.
The DNA-sequence differences between orthologues from the two strains are
mainly found in the third position of coding triplets, consistent with the
variance seen between H. pylori strains using methods dependent on
the nucleotide sequence or on the sequencing of specific loci in different
strains9, 10, 11. However, this nucleotide variation does not
translate into a highly divergent proteome (Fig.1a).
For example, there are only eight genes with
98% nucleotide identity
but 310 proteins with
98% amino-acid conservation, including 41 with
perfect identity.
To align homologous regions in the two genomes, we needed to artificially
invert and/or transpose ten segments, ranging in size from 1 kilobase (kb)
to 83 kb, of the J99 sequence. Most of the artificial end points are
in intergenic regions and most are associated with insertion elements, repeated
sequences or genes, and/or DNA-restriction/modification genes in one or both
of the genomes (Fig. 1, Table 2),
consistent with a possible role for such elements in generating these organizational
differences. Two differences between the genomes are associated with genes
encoding members of the large outer membrane protein (Omp) family. Inversion
5 in J99 could have resulted from a simple recombination across the inverted,
repeated nucleotide sequence encoding the carboxy-terminal domain of two Omp
proteins. Rearrangement 6 in J99 is the result of the equivalent of a reciprocal
exchange of the Lewis-antigen-binding adhesin genes babA and babB
12; BabA and BabB share similar C-terminal domains. The
complex rearrangements 8–10 in J99 consist of both inversions and translocations
(Fig. 1b). In both genomes, inversion 3 is associated
with a region of (G + C) percentage that is lower (35%) than in the
rest of the genome (39%). We named this region a 'plasticity zone'
because it contains 46% and 48% of the genes that are unique to 26695 and
J99, respectively (Fig.1c). Although this region is
continuous in J99, it is split in 26695 into two domains that are separated
by
600 kb. The presence of vir homologues, insertion sequences
and a lower percentage of (G + C) DNA indicates that these regions
might represent pathogenicity islands. The clustering of DNA with a lower
(G + C) percentage is suggestive of horizontal DNA transfer, and the
strain-specific sequence differences are consistent with different origins
for this DNA. H. pylori and Campylobacter spp. plasmids have
a (G + C) percentage in this lower range13. Significantly,
two copies of the insertion-sequence 605 element and neighbouring 26695-strain-specific
chromosomal DNA from the plasticity zone (genes HP0999–HP1001) are present
on the H. pylori plasmid pHPM186 (GenBank accession number AF077006).
Thus, plasmids may be responsible for the integration of new DNA into the
H. pylori chromosome and for the transfer of this DNA between strains.
Recombination across inverted repeats (repeat 7; ref. 5
) in a progenitor strain that resembles strain 26695 would yield
an arrangement similar to that of the region of inversion 3A in strain J99,
but a similar reciprocal event cannot account for the complexity of the J99
3B locus (Fig. 1c).
Table 2: Elements associated with the artificial end points required to align the two H. pylori chromosomes
To confirm the assembled sequence, we studied the J99 genome by pulsed-field gel electrophoresis (PFGE) and hybridization with specific probes (for J99 genes JHP117, 312, 548, 663, 733 and 1133–1136). Each observed Not I fragment was consistent in size with that predicted by the sequence. Hybridization of the 26695 genome in silico with these same probes yielded NotI fragments that were different in size to those observed with J99 DNA. Differences similar to those which we observed in restriction-fragment sizes and in probe hybridization patterns have been interpreted to mean that H.pylori strains are highly diverse in their genomic organization and gene order14. The differences in sizes of NotI fragments in strains J99 and 26695 are due mainly to silent nucleotide variation within genes. J99 contains twice as many NotI sites as 26695 ( Fig. 1a), and silent nucleotide changes in 26695 are responsible for the absence of six of the seven NotI sites unique to J99; differences at the seventh site result in the alteration of a single amino acid. Similar minor sequence differences account for the variability in the NruI-site content between the strains. Thus, results obtained with lower-resolution techniques such as PFGE and polymerase chain reaction (PCR)–restriction-fragment length polymorphism (RFLP) have probably led to an overestimation of the true extent of genetic diversity in H. pylori9,10,14. However, these techniques will continue to be useful for epidemiology and strain discrimination.
To estimate the degree of conservation of gene order between J99 and 26695, we studied the immediate neighbours of each J99 gene and its 26695 orthologue, if present. Of the 1,495 genes in J99, 1,267 (84.7%) have the same neighbour on each side in both genomes; 161 (10.8%) are flanked by one common neighbour and one strain-specific gene; and 40 (2.7%) are flanked by strain-specific genes on both sides. Only 27 (1.8%) have the same neighbour on one side and a common gene that appears in a different position on the other side as the result of an organizational difference. There are 9 conserved gene strings that are more than 50 genes long, representing 46% of the genes common to both strains, with the longest string containing 133 genes. This highly conserved gene order indicates that physical linkage of a few genes (topA/flaB15 and ftsH/pss/copA (D.E.T., unpublished observations)) in several strains is the rule rather than the exception. The absence of extensive gene shuffling between J99 and 26695 is consistent with a low level of evolutionary divergence16.
Of the 1,495 J99 genes and the 1,552 re-annotated 26695 genes, 874 (58.5%) and 895 (57.7%) gene products, respectively, have been assigned putative functions. A total of 275 (18.4%) J99 and 290 (18.7%) 26695 gene products have orthologues of unknown function in other species, and 346 (23.1%) J99 and 367 (23.6%) 26695 genes are H. pylori specific (that is, they show no sequence similarity with genes available in public databases). Of these H. pylori specific genes, 56 and 69 are specific to strains J99 and 26695, respectively. Excluding the strain-specific ORFs in the plasticity zone, the J99-specific genes are located singly (24 times) or as clusters of two (8 clusters) or three (1 cluster); many of these clusters appear to be organized to permit co-transcription. In one case, six genes (insertion-sequence 606 element and four J99-specific genes) are linked and are flanked by a duplicated region. In 17 corresponding locations, both J99 and 26695 have strain-specific genes. This high proportion of common relative loci for strain-specific ORFs indicates that H. pylori may have limited flexibility for containing strain-specific genes. Of the total of 206 strain-specific genes (89 in J99, 117 in 26695), the plasticity zones contain 94 (42 in J99, 52 in 26695); 125 of the 206 strain-specific genes (60.7%) are also specific to H. pylori, and 30 (14.6%) share similarity with genes of unknown function. J99- or 26695-specific genes have been assigned to the following categories: DNA restriction or modification (15 and 16, respectively); cell-envelope synthesis (4 and 2); cellular processes, such as DNA transfer and competence proteins (2 and 4); DNA replication (2 and 2); energy metabolism (2 and 1); and phospholipid metabolism (1 in 26695).
The fact that strain-specific DNA-restriction/modification genes have a lower (G + C) content than the remainder of the genome and are associated with regions that are organized differently in the J99 and 26695 genomes indicates that these genes may have been acquired horizontally from other bacterial species or transferred more recently from other H. pylori strains by natural transformation. Each H. pylori strain contains its own specific complement of these restriction/modification enzymes (R.A.A., unpublished observations). Nine type II methyltransferases are conserved between the two strains but lack identifiable cognate restriction-subunit partners, indicating that H. pylori may regulate gene expression by methylation.
The strain-specific genes encoding proteins involved in cell envelope (lipopolysaccharide and outer membrane protein) biosynthesis represent members of four paralogous families. Each strain contains one unique member of the omp families (HP0317 and JHP870). J99 and 26695 contain two (JHP820 and JHP1032) and one (HP1578) unique member, respectively, as well as four common members, of the rfaI/rfaJ-like family which is involved in lipopolysaccharide biosynthesis. In addition, J99 contains a unique member (JHP562) plus three common members of the lex2B lipopolysaccharide-biosynthesis family.
One of the J99-specific genes involved in energy metabolism encodes a second homologue of alcohol dehydrogenase (JHP1429), and the other (JHP585) may be required for amino-acid degradation. The 26695-specific energy-metabolism gene (HP1045) encodes an acetyl-CoA synthase. Strain 26695 has a second, larger acyl-carrier protein (encoded by HP0962) which is involved in phospholipid metabolism. J99 and 26695 have two (JHP919 and JHP931) and one (HP0440) unique genes encoding topoisomerase homologues, respectively, in their plasticity zones, which also contain the strain-specific genes that encode proteins involved in cellular processes. J99 has two adjacent virB4 homologues (JHP917 and JHP918) which may have once represented a single complete gene, whereas 26695 contains two complete virB4 (HP0441 and HP0459) and one truncated virD4 (HP1006) homologues and a protein (encoded by HP0432) with similarity to a human protein kinase C.
The identification of homopolymeric tracts and dinucleotide repeats in
H. pylori led to the prediction that 'slipped-strand repair'
may modulate gene expression5, which could result in antigenic
variation and adaptive evolution. The J99 gene sequences do not support some
previously proposed examples of genes which are regulated in this fashion
(for example, HP0211 and HP0928)17. In other cases, the data
obtained from J99 do support this mechanism of control. Repeat lengths in
some J99 genes differ from those in 26695 genes, indicating that such genes
may be differently expressed in the two strains (Table 3).
The same five members of the large omp paralogous family contain CT
dinucleotide repeats in both strains, but the number of repeats differs without
affecting the predicted expression status. The comparative data indicate that
slipped-strand regulation may operate at two sites in some genes, including
the
-(1,3)-fucosyltransferase gene. This regulatory mechanism also
operates during laboratory passage of cell cultures: we found changes in the
lengths of specific homopolymeric tracts or dinucleotide repeats within different
populations of strain J99 (Table 3). We also detected
nucleotide substitutions, most of which were found in the third position of
coding triplets, at a low frequency.
Several factors could influence the pathophysiology and severity of disease associated with infection by different cagA+ H. pylori strains. First, strain-specific genes, such as those associated with the plasticity zone, could play a role. Second, differences in gene expression, perhaps mediated by slipped-strand repair, may be important and may affect the ability of the organism to colonize. Third, a human host factor(s) may play a significant, and perhaps unappreciated, part in susceptibility to, and severity of, H. pylori infection. In any host–parasite relationship, bacterial, host and environmental factors influence the host's susceptibility to and the clinical outcome of infection. For example, different mice strains exhibit markedly different susceptibilities to H. pylori colonization and clinical outcome18. Different human populations also show differences in susceptibility to H. pylori infection and incidence rates for gastric cancer19. Our identification of the minimal genetic diversity between two virulent strains, genes that are conserved between the two strains, and the strain-specific plasticity zone allows a better understanding of the biology of H. pylori. Our results suggest the need, and provide a unique opportunity, for a reassessment of the respective roles of bacterial and host factors in diseases associated with H. pylori.
Methods
H. pylori strain J99 was sequenced, assembled and analysed nearly as described14,20. The sequences of regions that differ significantly between strains J99 and 26695, including putative frameshifts, were all confirmed by sequencing PCR products of J99 and, where relevant, by diagnostic PCR of 26695. The nucleotide and amino-acid alignments used to determine the identity between orthologues were generated by ALIGN from version 2.0 of the FASTA program package. Paralogues were identified using BLASTP and TBLASTX algorithms. The output was initially grouped such that all members of a family exhibited similarity to at least one other member, using a cut-off of P < 10-10, and then checked manually for validity.


